Mean Log Frequency of Lexemes

One component of many readability measures on texts is the mean log word frequency. Here I do a basic calculation across chapters in the Greek New Testament (with code provided).

Usually, the mean log word frequency is used in conjunction with something like the log mean sentence length (for example in the Lexile® framework). The latter is used as a proxy for syntactic complexity but, having a syntactic analysis, I think we can do better and I’ll explore that in a future post.

For now, though, I wanted to get a per-chapter measure just based on mean log frequency of lexemes.

The code is available here. It’s easy to adjust the targets (by default chapters, specified on line 14) and the items (by default lexemes, specified on line 15).

The result of running the script is something like this:

where the first column is -1000 times the mean log frequency (so the higher, the harder to read), the second column is the book and chapter number and the third column is just the number of word tokens in that chapter.

If we sort this output, we should get a list of the easiest chapters to read (at least by the measure of mean log lexeme frequency):

It is perhaps not surprising that the easiest chapters are from 1John and John’s gospel (with Rev 10 coming it at number 10).

It will be interesting to see if we get similar results once we factor in some measure of syntactic complexity.

Incidentally, the most difficult chapter to read based on mean log lexeme frequency is 2 Peter 2 although 1 Timothy and Titus feature quite a bit in the most difficult ten chapters as well.