Mean Log Frequency of Lexemes

One component of many readability measures on texts is the mean log word frequency. Here I do a basic calculation across chapters in the Greek New Testament (with code provided).

Usually, the mean log word frequency is used in conjunction with something like the log mean sentence length (for example in the Lexile® framework). The latter is used as a proxy for syntactic complexity but, having a syntactic analysis, I think we can do better and I'll explore that in a future post.

For now, though, I wanted to get a per-chapter measure just based on mean log frequency of lexemes.

The code is available here. It's easy to adjust the targets (by default chapters, specified on line 14) and the items (by default lexemes, specified on line 15).

The result of running the script is something like this:

6153 0101 436
5757 0102 457
5471 0103 331
5487 0104 428
5437 0105 821
5532 0106 648

where the first column is -1000 times the mean log frequency (so the higher, the harder to read), the second column is the book and chapter number and the third column is just the number of word tokens in that chapter.

If we sort this output, we should get a list of the easiest chapters to read (at least by the measure of mean log lexeme frequency):

4704 2304 449
4746 2305 429
4926 0417 498
4949 2301 207
4973 0414 577
5025 0408 905
5036 2303 467
5044 2302 585
5080 0403 657
5090 2710 291

It is perhaps not surprising that the easiest chapters are from 1John and John's gospel (with Rev 10 coming it at number 10).

It will be interesting to see if we get similar results once we factor in some measure of syntactic complexity.

Incidentally, the most difficult chapter to read based on mean log lexeme frequency is 2 Peter 2 although 1 Timothy and Titus feature quite a bit in the most difficult ten chapters as well.