Measures of dispersion are interesting to apply to a corpus because they tell you whether a word is distributed across parts of the corpus as expected or concentrated more in just some parts. I thought I’d play around with Gries’s DP as a measure of dispersion on the SBLGNT lemmas.
There are lots of measures of dispersion but Stefan Th. Gries’s is perhaps the simplest (see [1] for a detailed survey of lots of different measures as well as the original definition of his own).
Here it is in Python for lemmas:
dp = sum(abs((p[part] / t) - (lp[lemma][part] / l[lemma])) for part in p) / 2
where:
p[part]
is a dictionary mapping corpus part to the count of words in that partl[lemma]
is a dictionary mapping lemmas to the count of that lemma in the corpuslp[lemma][part]
is a dictionary of dictionaries mapping lemmas and parts to the count of the lemma in that part
but see [1] for some simple worked examples.
One thing Gries doesn’t talk about (email me if you know of any discussion of this) is how to handle very low frequency words as they’ll dominate the high DP values.
Using books as the parts, here are the top 10 most evenly dispersed lemmas in the GNT:
0.0466 ὁ
0.1085 εἰς
0.1154 καί
0.1178 ὅς
0.1250 εἰμί
0.1358 ποιέω
0.1382 γίνομαι
0.1385 πολύς
0.1395 μετά
0.1420 μή
Here are the top 10 least evenly dispered lemmas (including all frequencies, even hapax legomena):
0.9984 φιλοπρωτεύω
0.9984 ἐπιδέχομαι
0.9984 μειζότερος
0.9984 Διοτρέφης
0.9984 φλυαρέω
0.9982 χάρτης
0.9982 κυρία
0.9976 προσοφείλω
0.9976 ἑκούσιος
0.9976 ἄχρηστος
but this list looks very different if we, say, restrict ourselves to lemmas that occur 5 times or more:
0.9827 ἀντίχριστος
0.9752 καταλαλέω
0.9687 ἐπιφάνεια
0.9681 νήφω
0.9680 ἀρετή
0.9667 μῦθος
0.9641 Μελχισέδεκ
0.9568 πλεονεκτέω
0.9557 νόημα
0.9532 ἐνέργεια
or 30 times or more:
0.8952 ἀρνίον
0.8085 καυχάομαι
0.8024 θηρίον
0.7987 μέλος
0.7969 εἴτε
0.7266 συνείδησις
0.7202 περιτομή
0.7199 θρόνος
0.7139 ὑποτάσσω
0.7116 Παῦλος
If we use chapters as the corpus division, we get a little different top ten most evenly distributed by Gries’s DP:
0.0677 ὁ
0.1440 καί
0.1913 εἰμί
0.2084 εἰς
0.2117 αὐτός
0.2259 ἐν
0.2366 οὗτος
0.2378 ὅς
0.2437 δέ
0.2561 οὐ
and obviously this is even more problematic for lower frequency words at the other end.
It’s interesting to look, though, at chapters within a single book. For example, here are the most evenly distributed lemmas in John’s gospel using chapters for parts:
0.0574 ὁ
0.0867 καί
0.0977 αὐτός
0.1331 οὐ
0.1391 οὗτος
0.1440 ὅτι
0.1480 λέγω
0.1569 δέ
0.1576 εἰμί
0.1658 εἰς
and here are the least evenly distributed lemmas that occur at least 10 times:
0.9470 σταυρόω
0.9414 Ἀβραάμ
0.9126 νίπτω
0.8958 Πιλᾶτος
0.8914 πρόβατον
0.8812 Λάζαρος
0.8493 καρπός
0.8426 ἄρτος
0.8371 προσκυνέω
0.8221 ψυχή
Obviously Gries’s DP is extremely easy to calculate, and I plan to experimentally include it in the Greek Vocabulary Tool for the Perseus Project but there are still some things to work out with low frequency words.
It’s very interesting, though, as a way of contrasting words that otherwise have the same frequency in a corpus. For example, here are all the lemmas that occur exactly 30 times in the SBLGNT, with their book-based Gries’s DP:
0.3276 διδαχή
0.3558 ἐγγύς
0.3708 σκότος
0.4143 ἀγοράζω
0.5360 σκανδαλίζω
0.5833 συνέρχομαι
0.6230 ἴδε
0.6485 ἐπικαλέω
0.7266 συνείδησις
0.8952 ἀρνίον
There is a massive range in the DP which I think is quite illustrative.
Here is the list with their chapter-based DP (notice how high the lowest DP now is):
0.8769 ἀγοράζω
0.8821 σκότος
0.8869 συνέρχομαι
0.8958 σκανδαλίζω
0.9016 ἐγγύς
0.9016 διδαχή
0.9034 ἴδε
0.9083 ἐπικαλέω
0.9441 συνείδησις
0.9609 ἀρνίον
One of my reasons for exploring Gries’s DP (and potentially other measures of lexical dispersion) is the application to language learning. My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.
[1] Gries, Stefan Th. (2008) Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13:4. John Benjamins.