Some poetry data visualisation
Analysing stress patterns
Shakespeare
Data and code on Github.
Shakespeare wrote most of his verse in iambic pentameter, the basic pattern of which goes da-DUM da-DUM da-DUM da-DUM da-DUM. One da-DUM is called an iamb; ‘penta-’ means there’s five of them in a line. Other meters have different numbers, or different building blocks (‘feet’) than the iamb.
It’s rare for a poem to stay in fully strict iambic pentameter. Whether to introduce some variety or to emphasise a word by a change in the rhythm, a poet may deviate from the regular alternation of unstressed and stressed syllables in various ways.
The simplest variation is to replace a stressed syllable with an unstressed syllable. In the line
For Brutus is an honourable man,
there are stresses on Bru, hon, and man – the other two would-be-stressed positions are occupied by is and ra. Such a ‘pyrrhic’ substitution does not fundamentally affect the iambic meter. The ear is attuned to hearing regular beats on the even-numbered syllables, and what stresses remain are still on the even syllables; all odd-numbered syllables are unstressed (but I repeat myself).
The notion of “syllable where the beat is” is formalised in the term ictus. Ictuses are usually stressed – the stresses are what set the regular rhythmic beat in the first place – and stressed syllables are usually ictic. But ictus and stress are not identical, and I will be distinguishing them later in the post.
Another metrical variation is to swap the positions of neighbouring stressed and unstressed syllables. Most commonly, this occurs within a single iamb (a ‘trochaic substitution’), going from da-DUM to DUM-da, though it is also possible to swap across an expected “boundary” between two iambs.
Either of these swaps is more aggressive than a pyrrhic substitution. The beat usually moves with the stress, interrupting the regular alternating rhythm:
To be, or not to be, that is the question
(The above line also includes an extra unstressed syllable at the end. These are called feminine endings.)
The process of marking the ictic and nonictic syllables, or stressed and unstressed, is called scansion. I scanned all of Shakespeare’s 154 sonnets using a tool Claude Code coded for me – I paste in a poem and click a button, it breaks the poem into syllables with some default stress markings, and I click on them to change/add/remove, then click another button to download the scansion.
The sonnets, apart from three irregular ones,1 all have the same structure: 14 lines of iambic pentameter, consisting of three quatrains and a couplet to end. I marked stressed syllables in ictic positions, then had Claude collate the results to show, for each syllable position, the percentage of ictic stresses there.
Because we expect most stresses to fall on the even-numbered syllables, I use different colour scales for the even and odd columns in the following heatmap. The 11th column shows the percentage of lines that had a feminine ending, which are always unstressed.
It’s immediately apparent from all the red colours in the first column that Shakespeare employed trochaic substitutions most frequently at the start of a line, and especially at the start of a quatrain (lines 1, 5, 9) or the couplet (line 13). Almost a third of the sonnets, by my reading, start with a stressed syllable instead of the expected iambic unstressed syllable.
The most consistently stressed regular beats are the second and fifth (syllables 4 and 10), the second to firmly set the iambic rhythmic context, and the final beat to close out each line strongly. Mid-line substitutions are somewhat more common for the third and fourth beats.
If you do some arithmetic, you’ll find that percentages add up to less than 5 ictic stresses per line: the average is 4.75, meaning that the sonnets have, on average, about 3.5 pyrrhic substitutions each.
You can define an irregularity score for each sonnet by summing over all syllables the absolute difference between its stress (1 or 0) and the average stressed fraction for that syllable. The most regular is sonnet 102; the most irregular is sonnet 86.2
The scansion is often ambiguous. Even on the most famous opening line in this corpus,
Shall I compare thee to a summer’s day?
readers have a choice between Shall I (Ian McKellen) and Shall I (Patrick Stewart); I went with the former. I felt my own sense of the meter change as I read more of the sonnets, and I went back and corrected many of my earlier readings. Probably I still disagree with myself sometimes.
The ‘e’ in past tense -ed suffixes could be pronounced as a separate syllable or elided. In the original printing in 1609, the spelling distinguished these possibilities, but the modernised text that I was working with writes them all as -ed. Once I settled into the iambic rhythm, the elision or pronunciation came naturally:
O no! It is an ever fixed mark
There needs to be an unstressed syllable before mark, so fixèd has two syllables. This example is unremarkable.
More curious is that sometimes the -ed suffix is in the ictic position, rhymed with a clearly stressed word. Sonnet 86 (the most irregular by my earlier calculation):
Was it his spirit, by spirits taught to write
Above a mortal pitch, that struck me dead?
No, neither he, nor his compeers by night,
Giving him aid, my verse astonished.
Here astonishèd is pronounced with four syllables, and rhymes with dead. Was the -ed suffix really so heavy back then, or was Shakespeare a hack rhymer? (Another fun aspect of this quatrain is that spirit is one syllable but spirits is two; the second syllable could be elided or not as the meter demanded.)
A similar question arises for three-syllable words whose final syllable today is unstressed, like memory, majesty, ignorance, etc. Words like these are often at the end of a line, which sounds weak to my modern ear. Perhaps the ending vowels were pronounced more fully in 1600?
The heatmap above suggests not, and that Shakespeare was a hack rhymer. The percentage of stressed 10th syllables is noticeably higher in the couplet than in the quatrains: 99% versus 95%. Shakespeare wanted his sonnets to end strongly, and was very reluctant to end a sonnet on a weak third syllable, or even on a rhyme involving one. In no cases does a couplet line end in a separately pronounced ictic -ed.
I conclude that in 1600 these words were already being stressed similarly to how they are today, and it is fair for me to judge their use at the ends of lines accordingly.
Nevertheless, there is no doubting that the ictus is on the 10th syllable in these cases, however weak the stress. Here is the same style of heatmap as earlier, but now showing the percentage of ictuses at each syllable, stressed or otherwise. It hides the pyrrhic substitutions:
I find an average of 5.001 ictuses per line, thanks to two lines in which I, perhaps heretically, hear six ictic stresses. Sonnet 8 includes
Sweets with sweets war not, joy delights in joy.
I’ve heard actors on YouTube do their best to leave war not unstressed, but it doesn’t work for me at all.
The first quatrain of sonnet 129 contains an explosion on lines three and four:
Th’expense of spirit in a waste of shame
Is lust in action; and till action, lust
Is perjured, murd’rous, bloody, full of blame,
Savage, extreme, rude, cruel, not to trust;
The Wikipedia article, presumably following some distinguished scholars, describes rude on line four as a stressed nonictus. Such a concept is coherent – stress is more continuum than binary – but to me the stress on rude is too strong; there are six beats in the line.
Dolniks
Data on Pastebin.
I started on this little project because I was reading a 1995 paper by Marina Tarlinskaja called ‘Beyond “Loose Iamb”: The Form and Themes of the English “Dolnik”’, although the JSTOR optical character recognition on the sans-serif title refers to ‘Loose Lamb’ instead. (I found the paper via the book chapter ‘How Yeats Learned to Scan’ by Hannah Sullivan, in turn found via a note by Sunil Iyengar.)
Tarlinskaja considers poems such as Robert Frost’s ‘The Road Not Taken’, which is close to being in iambic tetrameter, but most lines have an extra unstressed syllable or two (italicised):
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Such a poem might be described as “loosely iambic”, but Tarlinskaja argues that it is a separate poetic form, which she calls the dolnik, borrowing from Russian. Her argument for the dolnik as a distinct form is a statistical one based on the frequency of two consecutive unstressed syllables being present between two ictics. She collects such statistics across more than 150 poems and 14,000 lines, from the mid-18th to mid-20th century.
Unfortunately her analysis is strange and not entirely convincing to me. But her data by poem is given in an appendix, and I am going to present the statistics as I think they should be presented, and then worry that the result is not very robust.
The basic calculational goal is to make a histogram of the percentage of disyllabic inter-ictic intervals. Two possible approaches to this are to weight each poem equally (though she amalgamates some) and to weight each poem by line count. It feels like a 1000-line poem should count for more than an 8-line poem, but weighting by line would effectively reduce the dataset to just the long poems, and short poems are equally a part of the poetic tradition. A reasonable compromise might be to weight by the square root of the number of lines.
Here is a histogram:
The distribution is bimodal: there is a first peak near zero, which covers poems that are very regular, and a second peak around 30%. There is disproportionately little English verse with between 5% and 15% disyllabic inter-ictic intervals – if poets in this time period wanted to break from strict regularity, they usually broke more decisively towards the rhythms of speech. Tarlinskaja modelled the latter on samples of prose from Dickens and Walter Scott, finding a percentage of about 34%. She calls poems in this part of the distribution ‘dolniks’.
It is a nice story, and it might be true, but the histogram is sensitive to the choice of weighting. (Tarlinskaja first binned the poems and then calculated a total for each bin by multiplying the number of lines in the bin by the number of poems in the bin. Not an intuitive idea to me! It has some desirable properties though, and gives qualitatively similar results to the sqrt(lines) weighting.) Here are the poem-weighted and line-weighted histograms:
It’s a little bit annoyingly inconclusive. It would be nice to see robustness to different weighting choices. My feeling is that the all-poems-equal weighting would satisfactorily show the bimodal distribution with a sample that went further back in time. Surely there was enough regular iambic verse to make the initial peak more prominent.
Sonnet 99 has 15 lines; sonnet 126 isn’t a sonnet at all, but is instead six couplets; sonnet 145 is in iambic tetrameter rather than pentameter.
The full list, ordered from most to least regular (I ignored feminine endings): 102, 74, 50, 93, 28, 43, 112, 133, 83, 11, 123, 20, 36, 45, 49, 18, 120, 96, 81, 92, 65, 150, 78, 12, 63, 98, 148, 79, 131, 73, 56, 23, 5, 152, 21, 16, 87, 135, 119, 14, 3, 149, 141, 85, 84, 40, 37, 57, 94, 88, 53, 130, 17, 24, 58, 55, 138, 97, 110, 128, 38, 64, 26, 82, 33, 44, 72, 47, 69, 121, 22, 7, 77, 134, 154, 30, 9, 67, 29, 101, 25, 31, 139, 27, 35, 129, 2, 104, 41, 39, 153, 136, 34, 143, 132, 8, 109, 48, 68, 6, 146, 147, 100, 54, 144, 71, 52, 114, 118, 13, 46, 61, 80, 15, 151, 127, 113, 95, 111, 1, 103, 105, 66, 142, 124, 59, 42, 62, 115, 140, 117, 125, 32, 122, 89, 108, 106, 60, 76, 90, 116, 51, 75, 19, 107, 70, 10, 137, 4, 91, 86.







I admire your attempt at establishing objective criteria to judge which are the most and least "regular" of the sonnets, though I suspect there will always be a fair amount of subjectivity in practice. For instance, sonnet 107 feels more radically varied to my ear than sonnet 86, most likely because it contains more beat shifts - including the less common patterns containing beat shifts, or in less common positions.
The opening swing (or trochee-iamb combi, if we were to insist on dividing it into individual feet) is so common that it feels a less notable departure than a swing elsewhere in the line, or another variant pattern containing a beat shift. And, indeed, a swing elsewhere in the line *does* produce a more dramatic switch up in the rhythm.
I wholly disagree there is anything innately faulty in rhyming words that close on a light beat! Or even rhyming such a word with another that closes on a heavy beat! Most recently I memorised "Sonnets from the Portuguese", and Elizabeth Barrett Browning does it often enough: I really don't think it will do to say that she too is a "hack rhymer"!
I am also *very* much in the camp that you can have a fully stressed offbeat within an iambic template. As you mentioned sonnet 86, would you also consider this line to have six beats?
"No, neither he, nor his compeers by night,"
It seems impossible to me pronounce "NO," at a lower stress than the beat on "NEIther". Unless you're reading that word as two light syllables, producing a swing? But even then, there are countless similar examples.
I recently wrote a long and very thorough post on variation within iambic pentameter (and more to come!), so that might interest you.
"For Brutus is an honourable man," is an example of what I call a "golden line": the default enlarged 3-beat rhythm of the pentameter is brought into sharp relief when the even-numbered beats are light.
I also specifically discuss the line "Sweets with sweets war not, joy delights in joy" in a note!
I am only too aware that no two metrists agree on everything, but my own suggestion is that you keep an open mind even if I say something you don't initially agree with: there's a lot of material here, and it all interlocks, so it may be worthwhile reserving your own opinion until you've covered all of it. Anyway, I hope it's of some interest, at least!: https://poemshape.substack.com/p/the-multifaceted-pentameter-part?utm_campaign=post-expanded-share&utm_medium=web