Seumas Macdonald asked me about vocabulary coverage for each work of Plato assuming one has learnt the New Testament vocabulary.
It turned out to be very simple to do with vocabulary-tools
and you can now see the script in the repo as examples3.py.
But here let me share the results and give some caveats.
In the table below:
- lemmas is the number of unique lemmas in the work; e.g. Crito has 712 unique lemmas
- tokens is the number of total tokens in the work; e.g. Crito has 4,172 tokens
- GNT lemmas is how many of those lemmas are in the GNT; e.g. 433 (of the 712) lemmas in Crito are also in the GNT
- GNT tokens is how many of the total tokens in the work have lemmas in the GNT; e.g. 3,429 of the 4,172 tokens in Crito have lemmas in the GNT
- % GNT lemmas and % GNT tokens just express those counts as percentages; e.g. 60.81% of the lemmas in Crito are in the GNT and 82.19% of tokens in Crito have lemmas seen in the GNT
id | title | lemmas | tokens | GNT lemmas | GNT tokens | % GNT lemmas | % GNT tokens |
---|---|---|---|---|---|---|---|
001 | Euthyphro | 690 | 5,181 | 441 | 4,274 | 63.91% | 82.49% |
002 | Apology | 1,112 | 8,745 | 631 | 7,357 | 56.74% | 84.13% |
003 | Crito | 712 | 4,172 | 433 | 3,429 | 60.81% | 82.19% |
004 | Phaedo | 1,921 | 21,825 | 1,000 | 18,033 | 52.06% | 82.63% |
005 | Cratylus | 1,607 | 17,944 | 781 | 14,701 | 48.6% | 81.93% |
006 | Theaetetus | 2,072 | 22,489 | 966 | 17,962 | 46.62% | 79.87% |
007 | Sophist | 1,598 | 16,024 | 788 | 12,932 | 49.31% | 80.7% |
008 | Statesman | 2,013 | 16,953 | 937 | 13,384 | 46.55% | 78.95% |
009 | Parmenides | 805 | 15,155 | 478 | 12,738 | 59.38% | 84.05% |
010 | Philebus | 1,567 | 17,668 | 800 | 14,076 | 51.05% | 79.67% |
011 | Symposium | 1,949 | 17,461 | 961 | 13,806 | 49.31% | 79.07% |
012 | Phaedrus | 2,266 | 16,645 | 1,027 | 12,935 | 45.32% | 77.71% |
013 | Alcibiades 1 | 1,138 | 10,264 | 628 | 8,356 | 55.18% | 81.41% |
014 | Alcibiades 2 | 711 | 4,268 | 420 | 3,449 | 59.07% | 80.81% |
015 | Hipparchus | 431 | 2,256 | 281 | 1,890 | 65.2% | 83.78% |
016 | Lovers | 473 | 2,391 | 284 | 1,923 | 60.04% | 80.43% |
017 | Theages | 627 | 3,485 | 374 | 2,811 | 59.65% | 80.66% |
018 | Charmides | 919 | 8,311 | 534 | 6,875 | 58.11% | 82.72% |
019 | Laches | 960 | 7,674 | 559 | 6,100 | 58.23% | 79.49% |
020 | Lysis | 911 | 6,980 | 524 | 5,729 | 57.52% | 82.08% |
021 | Euthydemus | 1,268 | 12,453 | 686 | 10,015 | 54.1% | 80.42% |
022 | Protagoras | 1,753 | 17,795 | 869 | 14,306 | 49.57% | 80.39% |
023 | Gorgias | 1,938 | 26,337 | 951 | 21,467 | 49.07% | 81.51% |
024 | Meno | 961 | 9,791 | 534 | 8,066 | 55.57% | 82.38% |
025 | Hippias Major | 958 | 8,448 | 528 | 6,730 | 55.11% | 79.66% |
026 | Hippias Minor | 698 | 4,360 | 396 | 3,387 | 56.73% | 77.68% |
027 | Ion | 721 | 4,024 | 382 | 3,012 | 52.98% | 74.85% |
028 | Menexenus | 958 | 4,808 | 571 | 3,985 | 59.6% | 82.88% |
029 | Cleitophon | 418 | 1,549 | 284 | 1,293 | 67.94% | 83.47% |
030 | Republic | 4,846 | 88,878 | 1,782 | 71,377 | 36.77% | 80.31% |
031 | Timaeus | 2,666 | 23,662 | 1,122 | 18,644 | 42.09% | 78.79% |
032 | Critias | 1,130 | 4,950 | 638 | 3,997 | 56.46% | 80.75% |
033 | Minos | 528 | 2,859 | 309 | 2,333 | 58.52% | 81.6% |
034 | Laws | 5,227 | 103,193 | 1,804 | 82,652 | 34.51% | 80.09% |
035 | Epinomis | 1,014 | 6,309 | 590 | 5,135 | 58.19% | 81.39% |
036 | Epistles | 2,026 | 16,964 | 1,015 | 13,768 | 50.1% | 81.16% |
It’s encouraging how any works are above the 80% level. Here are some caveats, though:
- the Plato lemmatization is from Diorisis so has not been checked and may have errors throwing things off
- the GNT lemmatization is MorphGNT and so even if Diorisis got it “right” it may have a different lemmatization scheme than MorphGNT
- this assumes 100% knowledge of GNT lemmas
- this doesn’t take into account word families nor individual forms
- this coverage calculation theoretically favours shorter works. You can see that in how much lower the % GNT lemmas is for longer works like the Laws and Republic although (perhaps significantly) this doesn’t actually seem to skew token coverage
Favouring shorter works isn’t necessary a bad thing if the goal is to find the most readable (by vocabulary) works of Plato post-GNT.
Here’s a run of the code only assuming the 80% level of GNT vocabulary rather than the whole thing.
id | title | lemmas | tokens | GNT lemmas | GNT tokens | % GNT lemmas | % GNT tokens |
---|---|---|---|---|---|---|---|
001 | Euthyphro | 690 | 5,181 | 149 | 3,135 | 21.59% | 60.51% |
002 | Apology | 1,112 | 8,745 | 165 | 5,551 | 14.84% | 63.48% |
003 | Crito | 712 | 4,172 | 150 | 2,581 | 21.07% | 61.86% |
004 | Phaedo | 1,921 | 21,825 | 214 | 13,647 | 11.14% | 62.53% |
005 | Cratylus | 1,607 | 17,944 | 192 | 11,208 | 11.95% | 62.46% |
006 | Theaetetus | 2,072 | 22,489 | 215 | 13,416 | 10.38% | 59.66% |
007 | Sophist | 1,598 | 16,024 | 183 | 9,644 | 11.45% | 60.18% |
008 | Statesman | 2,013 | 16,953 | 194 | 9,577 | 9.64% | 56.49% |
009 | Parmenides | 805 | 15,155 | 140 | 9,852 | 17.39% | 65.01% |
010 | Philebus | 1,567 | 17,668 | 187 | 10,209 | 11.93% | 57.78% |
011 | Symposium | 1,949 | 17,461 | 208 | 10,437 | 10.67% | 59.77% |
012 | Phaedrus | 2,266 | 16,645 | 212 | 9,395 | 9.36% | 56.44% |
013 | Alcibiades 1 | 1,138 | 10,264 | 177 | 6,296 | 15.55% | 61.34% |
014 | Alcibiades 2 | 711 | 4,268 | 142 | 2,566 | 19.97% | 60.12% |
015 | Hipparchus | 431 | 2,256 | 111 | 1,339 | 25.75% | 59.35% |
016 | Lovers | 473 | 2,391 | 104 | 1,427 | 21.99% | 59.68% |
017 | Theages | 627 | 3,485 | 124 | 2,129 | 19.78% | 61.09% |
018 | Charmides | 919 | 8,311 | 158 | 5,277 | 17.19% | 63.49% |
019 | Laches | 960 | 7,674 | 165 | 4,632 | 17.19% | 60.36% |
020 | Lysis | 911 | 6,980 | 150 | 4,204 | 16.47% | 60.23% |
021 | Euthydemus | 1,268 | 12,453 | 181 | 7,640 | 14.27% | 61.35% |
022 | Protagoras | 1,753 | 17,795 | 195 | 10,973 | 11.12% | 61.66% |
023 | Gorgias | 1,938 | 26,337 | 205 | 16,301 | 10.58% | 61.89% |
024 | Meno | 961 | 9,791 | 159 | 6,042 | 16.55% | 61.71% |
025 | Hippias Major | 958 | 8,448 | 154 | 5,123 | 16.08% | 60.64% |
026 | Hippias Minor | 698 | 4,360 | 134 | 2,446 | 19.2% | 56.1% |
027 | Ion | 721 | 4,024 | 133 | 2,236 | 18.45% | 55.57% |
028 | Menexenus | 958 | 4,808 | 161 | 2,877 | 16.81% | 59.84% |
029 | Cleitophon | 418 | 1,549 | 113 | 966 | 27.03% | 62.36% |
030 | Republic | 4,846 | 88,878 | 252 | 53,090 | 5.2% | 59.73% |
031 | Timaeus | 2,666 | 23,662 | 210 | 13,555 | 7.88% | 57.29% |
032 | Critias | 1,130 | 4,950 | 171 | 2,872 | 15.13% | 58.02% |
033 | Minos | 528 | 2,859 | 121 | 1,776 | 22.92% | 62.12% |
034 | Laws | 5,227 | 103,193 | 250 | 58,891 | 4.78% | 57.07% |
035 | Epinomis | 1,014 | 6,309 | 165 | 3,700 | 16.27% | 58.65% |
036 | Epistles | 2,026 | 16,964 | 211 | 10,229 | 10.41% | 60.3% |
The Plato coverage generally drops from around 80% to 60% which suggests it might be worth “topping up” one’s vocabulary with some common Plato words not in the GNT before embarking on a specific work. It would be easy to generate such a list with vocabulary-tools
.
But it was quite striking to me in both tables just how little the token % drops due to length (in contrast to the lemma %).
This just goes to show that longer works introduce a lot of new words but very sparsely (probably with only one occurrence in many cases).
I might explore that graphically in a follow-up post.