Subcorpus Vocabulary Statistics

Long-time readers of this blog know that, along with morphology, a core research area of mine is vocabulary. Prompted by Seumas Macdonald and now as part of the Greek Texts Project, I started putting together some vocabulary coverage statistics for various subcorpora of Greek prose.

I’ve been publishing vocabulary coverage statistics for the Greek New Testament at least since 1996 (see GNT Verse Coverage Statistics and the more recent (and aptly named) Updated Vocabulary Coverage Statistics).

Back in The Core Vocabulary of New Testament Greek, I looked at Wilfred Major’s 50% and 80% lists for Classical Greek and constructed the equivalents for the Greek New Testament.

For a while, on and off, I’ve been working with reconciling Major’s list with the DCC Greek Core Vocabulary, my own GNT work, Helma Dik’s amazing work on Logeion, and other word lists based on frequency in some subcorpus of Ancient Greek. Back in early 2018, I also started https://vocab.perseus.org, primarily to serve up passage-specific vocabulary lists for https://scaife.perseus.org but also to enable exploration of vocabulary frequency in the Perseus Digital Library / Open Greek and Latin corpus.

As part of that last work, I put together an initial “core reading list” subcorpus based on works from reading lists from Harvard, Yale, and Tufts. My eventual goal was to allow the creation of custom reading lists and generate vocabulary for those. The data behind this was all based on an experimental lemmatisation done by Giuseppe Celano along with the “short defs” from Perseus via Logeion.

I’d been slowing getting back to my new vocabulary-tools code library for generating these kinds of stats for any lemmatised text when Seumas Macdonald asked about vocabulary in Plato, Lysias, and Xenophon—typical post-beginner prose.

I took the opportunity to generalise some more of my code (although I haven’t yet added it back to vocabulary-tools).

Plato + Lysias + Xenophon, as lemmatised by Celano, is 745,213 tokens with 13,274 lemmas, 3,457 of which are hapakes within the subcorpus.

Besides the actual list, what was of interest to both me and Seumas was:

how many lemmas are needed for coverage points such as 80% or 98%
what coverage particular numbers of lemmas gets you to (in frequency order)

Now I’ve talked at length here and in conference talks about the limitations of:

just going by overall token coverage not coverage of larger units like verses or sentences or paragraphs
just going by lemmas and not considering morphology, syntactic constructions, etc

but this is still useful and interesting data.

Plato + Lysias + Xenophon

Perseus Plato + Lysias + Xenophon subcorpus:

The 50% point is reached at   48 lemmas (2454 occurrences at that point)
The 80% point is reached at  439 lemmas ( 181 occurrences at that point)
The 90% point is reached at 1242 lemmas (  50 occurrences at that point)
The 95% point is reached at 2519 lemmas (  16 occurrences at that point)
The 98% point is reached at 5003 lemmas (   5 occurrences at that point)
---
The 81.40% point is reached at  500 lemmas (159 occurrences at that point)
The 88.15% point is reached at 1000 lemmas ( 66 occurrences at that point)
The 93.59% point is reached at 2000 lemmas ( 25 occurrences at that point)
The 97.21% point is reached at 4000 lemmas (  7 occurrences at that point)
The 99.19% point is reached at 8000 lemmas (  2 occurrences at that point)

Just to quickly unpack that: the first line says that you can account for 50% of the tokens in the subcorpus just with the top 48 lemmas (by frequency). Furthermore, those 48 lemmas all occur at least 2,452 times each in the subcorpus.

Similarly, the second-to-last line says that the top 4,000 lemmas by frequency all occur at least 7 times in the subcorpus and account for 97.21% of tokens.

Plato

Just looking at Plato (with some extra lemma count breakpoints):

The 50% point is reached at   43 lemmas (1365 occurrences at that point)
The 80% point is reached at  321 lemmas ( 120 occurrences at that point)
The 90% point is reached at  893 lemmas (  32 occurrences at that point)
The 95% point is reached at 1840 lemmas (  10 occurrences at that point)
The 98% point is reached at 3631 lemmas (   3 occurrences at that point)
---
The 84.74% point is reached at  500 lemmas (66 occurrences at that point)
The 90.91% point is reached at 1000 lemmas (27 occurrences at that point)
The 95.46% point is reached at 2000 lemmas ( 9 occurrences at that point)
The 97.34% point is reached at 3000 lemmas ( 4 occurrences at that point)
The 98.32% point is reached at 4000 lemmas ( 3 occurrences at that point)
The 98.93% point is reached at 5000 lemmas ( 2 occurrences at that point)

Plato Selection

With just a selection of Plato (Euthyphro, Apology, Crito, Symposium, Republic):

The 50% point is reached at   43 lemmas (529 occurrences at that point)
The 80% point is reached at  335 lemmas ( 45 occurrences at that point)
The 90% point is reached at  908 lemmas ( 13 occurrences at that point)
The 95% point is reached at 1745 lemmas (  5 occurrences at that point)
The 98% point is reached at 3160 lemmas (  2 occurrences at that point)
---
The 84.31% point is reached at  500 lemmas (28 occurrences at that point)
The 90.85% point is reached at 1000 lemmas (11 occurrences at that point)
The 95.81% point is reached at 2000 lemmas ( 4 occurrences at that point)
The 97.76% point is reached at 3000 lemmas ( 2 occurrences at that point)
The 98.76% point is reached at 4000 lemmas ( 1 occurrences at that point)
The 99.51% point is reached at 5000 lemmas ( 1 occurrences at that point)

New Testament

It’s interesting to compare that to MorphGNT given it’s the same size:

The 50% point is reached at   27 lemmas (662 occurrences at that point)
The 80% point is reached at  316 lemmas ( 48 occurrences at that point)
The 90% point is reached at  890 lemmas ( 13 occurrences at that point)
The 95% point is reached at 1753 lemmas (  5 occurrences at that point)
The 98% point is reached at 3103 lemmas (  2 occurrences at that point)
---
The 84.84% point is reached at  500 lemmas (27 occurrences at that point)
The 90.95% point is reached at 1000 lemmas (11 occurrences at that point)
The 95.80% point is reached at 2000 lemmas ( 4 occurrences at that point)
The 97.85% point is reached at 3000 lemmas ( 2 occurrences at that point)
The 98.94% point is reached at 4000 lemmas ( 1 occurrences at that point)
The 99.66% point is reached at 5000 lemmas ( 1 occurrences at that point)

although note these figures are based on my own more-curated lemmatisation of the New Testament, not Celano’s data which may have systematic differences that make this comparison slightly problematic.

Core Reading List

Here’s the “core reading list” with more 1000-markers:

The 50% point is reached at    79 lemmas (2019 occurrences at that point)
The 80% point is reached at  1107 lemmas ( 134 occurrences at that point)
The 90% point is reached at  3020 lemmas (  39 occurrences at that point)
The 95% point is reached at  5948 lemmas (  14 occurrences at that point)
The 98% point is reached at 10920 lemmas (   5 occurrences at that point)
---
The 71.19% point is reached at   500 lemmas (310 occurrences at that point)
The 78.92% point is reached at  1000 lemmas (149 occurrences at that point)
The 86.17% point is reached at  2000 lemmas ( 69 occurrences at that point)
The 89.94% point is reached at  3000 lemmas ( 40 occurrences at that point)
The 92.28% point is reached at  4000 lemmas ( 26 occurrences at that point)
The 93.88% point is reached at  5000 lemmas ( 19 occurrences at that point)
The 95.05% point is reached at  6000 lemmas ( 14 occurrences at that point)
The 95.95% point is reached at  7000 lemmas ( 11 occurrences at that point)
The 96.65% point is reached at  8000 lemmas (  9 occurrences at that point)
The 97.20% point is reached at  9000 lemmas (  7 occurrences at that point)
The 97.66% point is reached at 10000 lemmas (  6 occurrences at that point)
The 98.33% point is reached at 12000 lemmas (  4 occurrences at that point)
The 98.98% point is reached at 15000 lemmas (  2 occurrences at that point)
The 99.57% point is reached at 20000 lemmas (  1 occurrences at that point)

Full Perseus

And finally, here’s the full Perseus / OGL (as of two years ago with the Celano lemmatisation):

The 50% point is reached at   42 lemmas (64483 occurrences at that point)
The 80% point is reached at  648 lemmas ( 3270 occurrences at that point)
The 90% point is reached at 1951 lemmas (  855 occurrences at that point)
The 95% point is reached at 4052 lemmas (  298 occurrences at that point)
The 98% point is reached at 8004 lemmas (   87 occurrences at that point)
---
The 77.44% point is reached at   500 lemmas (4263 occurrences at that point)
The 84.20% point is reached at  1000 lemmas (2018 occurrences at that point)
The 90.19% point is reached at  2000 lemmas ( 825 occurrences at that point)
The 93.14% point is reached at  3000 lemmas ( 480 occurrences at that point)
The 94.93% point is reached at  4000 lemmas ( 303 occurrences at that point)
The 96.10% point is reached at  5000 lemmas ( 208 occurrences at that point)
The 96.92% point is reached at  6000 lemmas ( 151 occurrences at that point)
The 97.53% point is reached at  7000 lemmas ( 114 occurrences at that point)
The 98.00% point is reached at  8000 lemmas (  87 occurrences at that point)
The 98.36% point is reached at  9000 lemmas (  69 occurrences at that point)
The 98.65% point is reached at 10000 lemmas (  55 occurrences at that point)
The 99.06% point is reached at 12000 lemmas (  36 occurrences at that point)
The 99.44% point is reached at 15000 lemmas (  20 occurrences at that point)
The 99.75% point is reached at 20000 lemmas (   8 occurrences at that point)

It is interesting how much quicker the 50-80-90-95-98 points are hit with the full corpus over the core reading list. Normally a larger corpus would take longer but I think it’s indicative of the fact that the “core reading” has a richer vocabulary per token than a larger sample (an interesting study in itself of any subcorpus).

Next Steps

Since calculating all this, Seumas and I have been working on a different prose subcorpus for post-beginner learners that combines the Plato selection, the New Testament, other orators in addition to Lysias and other history in addition to Xenophon. I’ll talk about that work in some future posts (and hopefully Seumas will too!)

I also want to talk about how the different subcorpora differ in what lemmas they have. How much of the Plato 80% is in the New Testament 80%, for example (and vice versa)? There’s also the question of lexical dispersion. There’s value in separating function words from content words, and grouping lemmas into word families. Lots more coming.

The code and for much this will added to vocabulary-tools when I get a chance but if people are interested in other subcorpora before then, please get in contact with me.

UPDATE (2019-11-06): Now see Seumas’s post Sore Thumbs in Subcorpus vocabulary looking at particular words that differ in frequency between the New Testament and the larger Classical Greek prose subcorpus we’ve been working with.