A post to the graded-reader mailing list from March 26, 2008.

I’m going to talk in more detail about alternatives to frequency order in a different thread but I wanted to share the results of a quite striking little test I did.

In my last post, I show the vocab/coverage table applied to fully inflected forms in the Greek NT rather than lexemes. You may have noticed that the 100% coverage column and even the 95% coverage column said 0.0% verses for the 100 most frequent forms.

If you did, you might then have wondered: is this just a rounding error? The answer is no. Even if you knew the 100 most frequent inflected forms in the GNT, there is not a single verse you would know all the forms in (of course assuming you couldn’t guess).

I wanted to test if this was because of just one outlier. So I modified (added 4 extra lines) the code that produced the table to instead output a list of the top ten targets (i.e. verses) whose second least frequent item (i.e. form) is most frequent overall.

Here are the results:

032030      2         [1, 2, 1077]
030146     35         [1, 35, 524]
041135     46         [2, 46, 14597]
130528     66         [5, 19, 38, 45, 49, 59, 65, 66, 235]
071623     66         [5, 19, 38, 45, 59, 66, 235]
070323     68         [3, 3, 29, 65, 68, 131]
020940     72         [8, 18, 22, 22, 44, 49, 49, 72, 102]
012425     78         [36, 78, 2846]
060211     96         [8, 14, 18, 22, 79, 96, 4276]
130519     98         [7, 17, 98, 14731]

What this listing is showing is that, for example, target 032030 (Luke 20.30) consists of the 1st, 2nd and 1077th most frequent forms; target 030146 (Luke 1.46) consists of the 1st, 35th and 524th most frequent forms. So if the rarest word wasn’t needed, they would jump from needing the top 1077 forms to just the top 2 and from needing the top 524 forms to the top 35.

Now you may argue that many of these are bad examples because the verse doesn’t make sense in isolation (a good reason to be more careful about what to use as targets) or that the one rare word is actually the one carrying most of the semantic weight.

But this little test demonstrates that sometimes a single rare item can massively delay reading an otherwise quite readable target unit.

By the way, here’s the same listing based on lexemes rather than fully inflected forms:

032030      2           [1, 2, 346]
030146      9           [2, 9, 509]
011615      9           [3, 4, 5, 7, 8, 9, 9, 33]
032448     13           [4, 13, 415]
090124     14           [1, 2, 6, 7, 14, 267]
021337     16           [4, 5, 9, 9, 12, 16, 588]
040620     17           [1, 3, 5, 7, 8, 9, 17, 180]
041135     19           [1, 19, 4752]
040426     19           [1, 1, 3, 4, 7, 8, 9, 19, 56]
031934     24           [1, 1, 3, 5, 9, 15, 23, 24, 311]

I’ll check in the code that produces this shortly.

James


It’s now available at

http://code.google.com/p/graded-reader/source/browse/trunk/code/if-only.py

James