Vocab Coverage Table for a Better Ordering

A post to the graded-reader mailing list from March 29, 2008.

I thought I’d calculate the vocabulary coverage table assuming the ordering generated for the post “just how much can frequency ordering be improved on?”. To do this, I modified vocab-coverage.py to load in an arbitrary learning programme instead of assuming a frequency ordering. The code is now checked in as vocab-coverage-arbitrary.py.

Here’s the original frequency ordering of forms in the Greek NT (using counts rather than percentages in the cells):

                   0%     50%     75%     90%     95%     100%

          100    7928    4585      88       1       0        0
          200    7931    6291     515      26       4        4
          500    7935    7388    2149     182      46       39
         1000    7937    7700    4085     631     184      141
         2000    7938    7838    5765    1736     628      456
         5000    7939    7920    7232    4161    2275     1711
         8000    7939    7935    7684    5691    3784     3004
        12000    7941    7939    7879    6858    5149     4310
        16000    7941    7941    7937    7777    7060     6549
        20000    7941    7941    7941    7941    7941     7941

And here’s the table with the ordered produced in the “just how much
can frequency ordering be improved on?” post:

                   0%     50%     75%     90%       95%      100%

          100    7896    1762      78     *37*      *36*      *36*
          200    7927    4590     339     *81*      *71*      *70*
          500    7933    6781    1572    *315*     *225*     *213*
         1000    7935    7455    3155    *802*     *526*     *491*
         2000    7936    7739    4872   *1820*    *1242*    *1144*
         5000    7939    7869    6400    3592     *3246*    *3244*
         8000    7939    7908    7156    5071     *4745*    *4742*
        12000    7939    7924    7501    6501     *6463*    *6463*
        16000    7940    7933    7791    7646     *7645*    *7645*
        20000    7941    7941    7941    7941      7941      7941

I’ve marked with asterisks those instances where the number is better than the frequency ordering.

Note that because the ordering algorithm was highly biased towards reading entire verses, it is actually worse for coverage 75th or below. Even for 90% it’s only better for the first 2000 items.

But for the 100% familiarity level, you can see just how much better even the simple algorithm I used (which I will explain shortly) is than frequency ordering. For 200 forms, you get 70 verses instead of 4!

I’ll repeat the caveats I mentioned in the other post, though: items are considered independent and equally easy to learn, there’s no consideration of morphology, syntax, idiom and this is using verses as targets. We’ll fix all that over time.

James