Seumas Macdonald asked me about vocabulary coverage for each work of Plato assuming one has learnt the New Testament vocabulary.

It turned out to be very simple to do with vocabulary-tools and you can now see the script in the repo as examples3.py.

But here let me share the results and give some caveats.

In the table below:

  • lemmas is the number of unique lemmas in the work; e.g. Crito has 712 unique lemmas
  • tokens is the number of total tokens in the work; e.g. Crito has 4,172 tokens
  • GNT lemmas is how many of those lemmas are in the GNT; e.g. 433 (of the 712) lemmas in Crito are also in the GNT
  • GNT tokens is how many of the total tokens in the work have lemmas in the GNT; e.g. 3,429 of the 4,172 tokens in Crito have lemmas in the GNT
  • % GNT lemmas and % GNT tokens just express those counts as percentages; e.g. 60.81% of the lemmas in Crito are in the GNT and 82.19% of tokens in Crito have lemmas seen in the GNT
id title lemmas tokens GNT lemmas GNT tokens % GNT lemmas % GNT tokens
001 Euthyphro 690 5,181 441 4,274 63.91% 82.49%
002 Apology 1,112 8,745 631 7,357 56.74% 84.13%
003 Crito 712 4,172 433 3,429 60.81% 82.19%
004 Phaedo 1,921 21,825 1,000 18,033 52.06% 82.63%
005 Cratylus 1,607 17,944 781 14,701 48.6% 81.93%
006 Theaetetus 2,072 22,489 966 17,962 46.62% 79.87%
007 Sophist 1,598 16,024 788 12,932 49.31% 80.7%
008 Statesman 2,013 16,953 937 13,384 46.55% 78.95%
009 Parmenides 805 15,155 478 12,738 59.38% 84.05%
010 Philebus 1,567 17,668 800 14,076 51.05% 79.67%
011 Symposium 1,949 17,461 961 13,806 49.31% 79.07%
012 Phaedrus 2,266 16,645 1,027 12,935 45.32% 77.71%
013 Alcibiades 1 1,138 10,264 628 8,356 55.18% 81.41%
014 Alcibiades 2 711 4,268 420 3,449 59.07% 80.81%
015 Hipparchus 431 2,256 281 1,890 65.2% 83.78%
016 Lovers 473 2,391 284 1,923 60.04% 80.43%
017 Theages 627 3,485 374 2,811 59.65% 80.66%
018 Charmides 919 8,311 534 6,875 58.11% 82.72%
019 Laches 960 7,674 559 6,100 58.23% 79.49%
020 Lysis 911 6,980 524 5,729 57.52% 82.08%
021 Euthydemus 1,268 12,453 686 10,015 54.1% 80.42%
022 Protagoras 1,753 17,795 869 14,306 49.57% 80.39%
023 Gorgias 1,938 26,337 951 21,467 49.07% 81.51%
024 Meno 961 9,791 534 8,066 55.57% 82.38%
025 Hippias Major 958 8,448 528 6,730 55.11% 79.66%
026 Hippias Minor 698 4,360 396 3,387 56.73% 77.68%
027 Ion 721 4,024 382 3,012 52.98% 74.85%
028 Menexenus 958 4,808 571 3,985 59.6% 82.88%
029 Cleitophon 418 1,549 284 1,293 67.94% 83.47%
030 Republic 4,846 88,878 1,782 71,377 36.77% 80.31%
031 Timaeus 2,666 23,662 1,122 18,644 42.09% 78.79%
032 Critias 1,130 4,950 638 3,997 56.46% 80.75%
033 Minos 528 2,859 309 2,333 58.52% 81.6%
034 Laws 5,227 103,193 1,804 82,652 34.51% 80.09%
035 Epinomis 1,014 6,309 590 5,135 58.19% 81.39%
036 Epistles 2,026 16,964 1,015 13,768 50.1% 81.16%

It’s encouraging how any works are above the 80% level. Here are some caveats, though:

  1. the Plato lemmatization is from Diorisis so has not been checked and may have errors throwing things off
  2. the GNT lemmatization is MorphGNT and so even if Diorisis got it “right” it may have a different lemmatization scheme than MorphGNT
  3. this assumes 100% knowledge of GNT lemmas
  4. this doesn’t take into account word families nor individual forms
  5. this coverage calculation theoretically favours shorter works. You can see that in how much lower the % GNT lemmas is for longer works like the Laws and Republic although (perhaps significantly) this doesn’t actually seem to skew token coverage

Favouring shorter works isn’t necessary a bad thing if the goal is to find the most readable (by vocabulary) works of Plato post-GNT.

Here’s a run of the code only assuming the 80% level of GNT vocabulary rather than the whole thing.

id title lemmas tokens GNT lemmas GNT tokens % GNT lemmas % GNT tokens
001 Euthyphro 690 5,181 149 3,135 21.59% 60.51%
002 Apology 1,112 8,745 165 5,551 14.84% 63.48%
003 Crito 712 4,172 150 2,581 21.07% 61.86%
004 Phaedo 1,921 21,825 214 13,647 11.14% 62.53%
005 Cratylus 1,607 17,944 192 11,208 11.95% 62.46%
006 Theaetetus 2,072 22,489 215 13,416 10.38% 59.66%
007 Sophist 1,598 16,024 183 9,644 11.45% 60.18%
008 Statesman 2,013 16,953 194 9,577 9.64% 56.49%
009 Parmenides 805 15,155 140 9,852 17.39% 65.01%
010 Philebus 1,567 17,668 187 10,209 11.93% 57.78%
011 Symposium 1,949 17,461 208 10,437 10.67% 59.77%
012 Phaedrus 2,266 16,645 212 9,395 9.36% 56.44%
013 Alcibiades 1 1,138 10,264 177 6,296 15.55% 61.34%
014 Alcibiades 2 711 4,268 142 2,566 19.97% 60.12%
015 Hipparchus 431 2,256 111 1,339 25.75% 59.35%
016 Lovers 473 2,391 104 1,427 21.99% 59.68%
017 Theages 627 3,485 124 2,129 19.78% 61.09%
018 Charmides 919 8,311 158 5,277 17.19% 63.49%
019 Laches 960 7,674 165 4,632 17.19% 60.36%
020 Lysis 911 6,980 150 4,204 16.47% 60.23%
021 Euthydemus 1,268 12,453 181 7,640 14.27% 61.35%
022 Protagoras 1,753 17,795 195 10,973 11.12% 61.66%
023 Gorgias 1,938 26,337 205 16,301 10.58% 61.89%
024 Meno 961 9,791 159 6,042 16.55% 61.71%
025 Hippias Major 958 8,448 154 5,123 16.08% 60.64%
026 Hippias Minor 698 4,360 134 2,446 19.2% 56.1%
027 Ion 721 4,024 133 2,236 18.45% 55.57%
028 Menexenus 958 4,808 161 2,877 16.81% 59.84%
029 Cleitophon 418 1,549 113 966 27.03% 62.36%
030 Republic 4,846 88,878 252 53,090 5.2% 59.73%
031 Timaeus 2,666 23,662 210 13,555 7.88% 57.29%
032 Critias 1,130 4,950 171 2,872 15.13% 58.02%
033 Minos 528 2,859 121 1,776 22.92% 62.12%
034 Laws 5,227 103,193 250 58,891 4.78% 57.07%
035 Epinomis 1,014 6,309 165 3,700 16.27% 58.65%
036 Epistles 2,026 16,964 211 10,229 10.41% 60.3%

The Plato coverage generally drops from around 80% to 60% which suggests it might be worth “topping up” one’s vocabulary with some common Plato words not in the GNT before embarking on a specific work. It would be easy to generate such a list with vocabulary-tools.

But it was quite striking to me in both tables just how little the token % drops due to length (in contrast to the lemma %).

This just goes to show that longer works introduce a lot of new words but very sparsely (probably with only one occurrence in many cases).

I might explore that graphically in a follow-up post.