Just How Much Can Frequency Ordering Be improved On?

A post to the graded-reader mailing list from March 26, 2008.

Here's a quick demonstration. Recall that in my previous post, I pointed out that learning the top 100 inflected forms gives you 0 (zero, nada) target versus in the GNT. I showed that, for example target 130528 (1 Thessalonians 5.28) gets excluded because of one form that is #235 while the other eight forms appear in the top 66.

Well, what if those 9 forms were learnt first? That is:

Χριστοῦ, κυρίου, Ἰησοῦ, ὑμῶν, μετά, τοῦ, χάρις, ἡ, ἡμῶν

Not only could 130528 be read but also 071623

Now if the reader learnt πάντων (just one more form) they could read three more verses: 140318, 191325 and 272221

Now introduce these six forms:

καί, ὑμῖν, ἀπό, εἰρήνη, πατρός, θεοῦ

and suddenly seven more verses are readable: 140102, 070103, 100102, 110102, 090103, 180103, 080102

This was just with one algorithm I'm experimenting with (which I'll explain and provide code for soon) and there are likely others than do better.

So instead of 100 forms giving 0 verses, we now have just 16 forms giving us 12 entire verses from an actual corpus.

The usual caveats apply: items are considered independent and equally easy to learn, there's no consideration of morphology, syntax, idiom
and this is using verses as targets. We'll fix all that over time.

James