GNT Verse Coverage with Frequency Ordering
[if you'll indulge me, I'm trying to get all my thoughts and previous writing on these topics in one place and this list is a good place to do it]
[this is based on a post to b-greek and my blog. I hope the table comes out! ]
It is fairly common, in the context of learning vocabulary for a particular corpus like the Greek New Testament, to talk about what proportion of the text one could read if one learnt the top N words. I even produced such a table for the GNT back in 1996—see New Testament Vocabulary Count Statistics.
But these sort of numbers are highly misleading because they don't tell you what proportion of sentences (or as a rough proxy in the GNT case: verses) you could read, only what proportion of words.
Reading theorists have suggested that you need to know 95% of the vocabulary of a sentence to comprehend it. So a more interesting list of statistics would be how many verses can one understand 95% of the vocab of if one know a certain number of words. Of course, there's a lot more to reading comprehension than knowing the vocab. But it was enough for me to decide to write some code yesterday afternoon to run against my MorphGNT database.
To first of all give you a flavour in the specific before moving to the final numbers, consider John 3.16, which is, from a vocabulary point of view, a very easy verse to read.
To be able to read 50% of it, you only need to know the top 28 lexemes in the GNT. To read 75% you only need the top 85 (up to κόσμος). With the top 204 lexemes, you can read 90% of the verse and only a few more: up to 236 (αἰώνιος) gives you the 95%. The only word you would not have come across learning the top 236 words would be μονογενής but even that is in the top 1,200.
This example does highlight some of the shortcomings of this sort of analysis. There's no consideration of necessary knowledge of morphology, syntax, idioms, etc. Nor for the fact that the meaning of something like μονογενής is fairly easy to guess from knowledge of more common words. But I still think it's much more useful than the pure word coverage statistics I linked to above.
So let's actually run the numbers on the complete GNT. If you know the top N words, how many verses could you understand 50% of, 75%, 90% or 95% of...
vocab / coverage any 50% 75% 90% 95% 100% 100 99.9% 91.3% 24.4% 2.1% 0.6% 0.4% 200 99.9% 96.9% 51.8% 9.8% 3.4% 2.5% 500 99.9% 99.1% 82.3% 36.5% 18.0% 13.9% 1,000 100.0% 99.7% 93.6% 62.3% 37.3% 30.1% 1,500 100.0% 99.8% 97.2% 76.3% 53.5% 44.8% 2,000 100.0% 99.9% 98.4% 85.1% 65.5% 56.5% 3,000 100.0% 100.0% 99.4% 93.6% 81.0% 74.1% 4,000 100.0% 100.0% 99.7% 97.4% 90.0% 85.5% 5,000 100.0% 100.0% 100.0% 99.4% 96.5% 94.5% all 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
What this means is purely from a vocabulary point of view if you knew the top 1000 lexemes, then 37.3% of verses in the GNT would be 95% familiar to you.
Note that this uses:
- verses as the reading target
- lexemes as the individual items to be learnt
- frequency of lexemes as the ordering
It is possible to alter any of these variables and in subsequent posts I will do this.
 (via Internet Archive's Wayback Machine) http://web.archive.org/web/19961104033056/www.entmp.org/HGrk/grammar/lexicon/NTcount.shtml
I've checked in my Python code as: http://code.google.com/p/graded-reader/source/browse/trunk/code/vocab-coverage.py
If you're not comfortable running it yourself, I can run it on any data you provide.
(if you send data, I suggest you do it off-list and be careful because a "reply" will go to the entire mailing list)
Remember that, as I said in my post, there's no consideration of necessary knowledge of morphology, syntax, idioms, etc. Over time, we can incorporate that, but for now the results are limited to the somewhat naïve assumptions that:
- comprehension is only at the level of the target (the verse in my example data)
- learning the items (lexemes in the example table I gave) is all that matters to comprehending the target
- all items are equally easy to learn
- there is no dependency between items
and, of course, the table assumes a frequency ordering of items. Soon I'll be starting a separate thread on alternative orderings.
But all that said, the numbers produced are far more useful than misleading notions like "the top 10 words account for 37% of the text".
Incidentally, here is the table when applied to forms in the Greek NT rather than lexemes:
0% 50% 75% 90% 95% 100% 100 99.8% 57.7% 1.1% 0.0% 0.0% 0.0% 200 99.8% 79.2% 6.4% 0.3% 0.0% 0.0% 500 99.9% 93.0% 27.0% 2.2% 0.5% 0.4% 1,000 99.9% 96.9% 51.4% 7.9% 2.3% 1.7% 2,000 99.9% 98.7% 72.5% 21.8% 7.9% 5.7% 5,000 99.9% 99.7% 91.0% 52.3% 28.6% 21.5% 8,000 99.9% 99.9% 96.7% 71.6% 47.6% 37.8% 12,000 100.0% 99.9% 99.2% 86.3% 64.8% 54.2% 16,000 100.0% 100.0% 99.9% 97.9% 88.9% 82.4% 20,000 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
The fact that it takes 1,000 forms just to get 2.3% of verses at 95% coverage is indicative of the fact that frequency alone is not the way
to go. Soon, I'll also produce similar tables using clauses (in the OpenText.org sense), rather than verses, as the target.