Ordering Vocabulary by Pericope Dispersion

Jesse Egbert's Plenary at JAECS 2020 is giving me a bunch of ideas of things to try on the New Testament and larger Greek corpora. In this post, I briefly explore text dispersion keyness using pericopes as a way of ordering vocabulary.

Back in Lexical Dispersion in the Greek New Testament Via Gries’s DP I wrote:

My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.

Egbert's plenary (available here after free signup) encouraged me to try a very simple metric instead of frequency: what proportion of text units in the corpus does the word appear in? Egbert emphasises using linguistically meaningful units of text (definitely not fixed-length windows) and pericopes seem perfect for this. There are dispersion measures that allow for varying sizes of text unit (like Gries's DP) but it seemed to me that just seeing what proportion of pericopes the item appears in might be a good measure of the importance to learn (instead of frequency).

This downplays words that might get repeated a lot in just a handful of pericopes and favours those that appear in lots of pericopes even if only one or two times in that pericope. Intuitively this makes sense, A word that appears 10 times in one passage in the New Testament (and nowhere else) isn't as generally useful to learn as a word that appears once in ten different passages. Overall corpus frequency can therefore be misleading because it treats these two cases as the same.

With vocabulary-tools it was trivial to produce a list of all the New Testament lemmas sorted by pericope dispersion.

This gist contains the code and the list:

https://gist.github.com/jtauber/fc4b0476a4c4a94d7cb01d068161892e

Eyeballing the resultant list, it seems a very promising ordering although I welcome comments on anything interesting people notice.

Next steps are:

quantitative comparison with pure frequency
application to other lemmatized Greek corpora with meaningful text units similar to pericopes
try other meaningful text units I have for NT such as books or paragraphs or even sentences