A post to the graded-reader mailing list from March 29, 2008.
I thought I’d calculate the vocabulary coverage table assuming the ordering generated for the post “just how much can frequency ordering be improved on?”. To do this, I modified vocab-coverage.py to load in an arbitrary learning programme instead of assuming a frequency ordering. The code is now checked in as vocab-coverage-arbitrary.py
.
Here’s the original frequency ordering of forms in the Greek NT (using counts rather than percentages in the cells):
0% 50% 75% 90% 95% 100%
100 7928 4585 88 1 0 0
200 7931 6291 515 26 4 4
500 7935 7388 2149 182 46 39
1000 7937 7700 4085 631 184 141
2000 7938 7838 5765 1736 628 456
5000 7939 7920 7232 4161 2275 1711
8000 7939 7935 7684 5691 3784 3004
12000 7941 7939 7879 6858 5149 4310
16000 7941 7941 7937 7777 7060 6549
20000 7941 7941 7941 7941 7941 7941
And here’s the table with the ordered produced in the “just how much
can frequency ordering be improved on?” post:
0% 50% 75% 90% 95% 100%
100 7896 1762 78 *37* *36* *36*
200 7927 4590 339 *81* *71* *70*
500 7933 6781 1572 *315* *225* *213*
1000 7935 7455 3155 *802* *526* *491*
2000 7936 7739 4872 *1820* *1242* *1144*
5000 7939 7869 6400 3592 *3246* *3244*
8000 7939 7908 7156 5071 *4745* *4742*
12000 7939 7924 7501 6501 *6463* *6463*
16000 7940 7933 7791 7646 *7645* *7645*
20000 7941 7941 7941 7941 7941 7941
I’ve marked with asterisks those instances where the number is better than the frequency ordering.
Note that because the ordering algorithm was highly biased towards reading entire verses, it is actually worse for coverage 75th or below. Even for 90% it’s only better for the first 2000 items.
But for the 100% familiarity level, you can see just how much better even the simple algorithm I used (which I will explain shortly) is than frequency ordering. For 200 forms, you get 70 verses instead of 4!
I’ll repeat the caveats I mentioned in the other post, though: items are considered independent and equally easy to learn, there’s no consideration of morphology, syntax, idiom and this is using verses as targets. We’ll fix all that over time.
James