Lexical Dispersion in the Greek New Testament Via Gries’s DP

Measures of dispersion are interesting to apply to a corpus because they tell you whether a word is distributed across parts of the corpus as expected or concentrated more in just some parts. I thought I’d play around with Gries’s DP as a measure of dispersion on the SBLGNT lemmas.

There are lots of measures of dispersion but Stefan Th. Gries’s is perhaps the simplest (see [1] for a detailed survey of lots of different measures as well as the original definition of his own).

Here it is in Python for lemmas:

dp = sum(abs((p[part] / t) - (lp[lemma][part] / l[lemma])) for part in p) / 2

where:

p[part] is a dictionary mapping corpus part to the count of words in that part
l[lemma] is a dictionary mapping lemmas to the count of that lemma in the corpus
lp[lemma][part] is a dictionary of dictionaries mapping lemmas and parts to the count of the lemma in that part

but see [1] for some simple worked examples.

One thing Gries doesn’t talk about (email me if you know of any discussion of this) is how to handle very low frequency words as they’ll dominate the high DP values.

Using books as the parts, here are the top 10 most evenly dispersed lemmas in the GNT:

0466 ὁ
1085 εἰς
1154 καί
1178 ὅς
1250 εἰμί
1358 ποιέω
1382 γίνομαι
1385 πολύς
1395 μετά
1420 μή

Here are the top 10 least evenly dispered lemmas (including all frequencies, even hapax legomena):

9984 φιλοπρωτεύω
9984 ἐπιδέχομαι
9984 μειζότερος
9984 Διοτρέφης
9984 φλυαρέω
9982 χάρτης
9982 κυρία
9976 προσοφείλω
9976 ἑκούσιος
9976 ἄχρηστος

but this list looks very different if we, say, restrict ourselves to lemmas that occur 5 times or more:

9827 ἀντίχριστος
9752 καταλαλέω
9687 ἐπιφάνεια
9681 νήφω
9680 ἀρετή
9667 μῦθος
9641 Μελχισέδεκ
9568 πλεονεκτέω
9557 νόημα
9532 ἐνέργεια

or 30 times or more:

8952 ἀρνίον
8085 καυχάομαι
8024 θηρίον
7987 μέλος
7969 εἴτε
7266 συνείδησις
7202 περιτομή
7199 θρόνος
7139 ὑποτάσσω
7116 Παῦλος

If we use chapters as the corpus division, we get a little different top ten most evenly distributed by Gries’s DP:

0677 ὁ
1440 καί
1913 εἰμί
2084 εἰς
2117 αὐτός
2259 ἐν
2366 οὗτος
2378 ὅς
2437 δέ
2561 οὐ

and obviously this is even more problematic for lower frequency words at the other end.

It’s interesting to look, though, at chapters within a single book. For example, here are the most evenly distributed lemmas in John’s gospel using chapters for parts:

0574 ὁ
0867 καί
0977 αὐτός
1331 οὐ
1391 οὗτος
1440 ὅτι
1480 λέγω
1569 δέ
1576 εἰμί
1658 εἰς

and here are the least evenly distributed lemmas that occur at least 10 times:

9470 σταυρόω
9414 Ἀβραάμ
9126 νίπτω
8958 Πιλᾶτος
8914 πρόβατον
8812 Λάζαρος
8493 καρπός
8426 ἄρτος
8371 προσκυνέω
8221 ψυχή

Obviously Gries’s DP is extremely easy to calculate, and I plan to experimentally include it in the Greek Vocabulary Tool for the Perseus Project but there are still some things to work out with low frequency words.

It’s very interesting, though, as a way of contrasting words that otherwise have the same frequency in a corpus. For example, here are all the lemmas that occur exactly 30 times in the SBLGNT, with their book-based Gries’s DP:

3276 διδαχή
3558 ἐγγύς
3708 σκότος
4143 ἀγοράζω
5360 σκανδαλίζω
5833 συνέρχομαι
6230 ἴδε
6485 ἐπικαλέω
7266 συνείδησις
8952 ἀρνίον

There is a massive range in the DP which I think is quite illustrative.

Here is the list with their chapter-based DP (notice how high the lowest DP now is):

8769 ἀγοράζω
8821 σκότος
8869 συνέρχομαι
8958 σκανδαλίζω
9016 ἐγγύς
9016 διδαχή
9034 ἴδε
9083 ἐπικαλέω
9441 συνείδησις
9609 ἀρνίον

One of my reasons for exploring Gries’s DP (and potentially other measures of lexical dispersion) is the application to language learning. My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.

[1] Gries, Stefan Th. (2008) Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13:4. John Benjamins.