Plato Vocabulary Coverage After the New Testament

Seumas Macdonald asked me about vocabulary coverage for each work of Plato assuming one has learnt the New Testament vocabulary.

It turned out to be very simple to do with vocabulary-tools and you can now see the script in the repo as examples3.py.

But here let me share the results and give some caveats.

In the table below:

lemmas is the number of unique lemmas in the work; e.g. Crito has 712 unique lemmas
tokens is the number of total tokens in the work; e.g. Crito has 4,172 tokens
GNT lemmas is how many of those lemmas are in the GNT; e.g. 433 (of the 712) lemmas in Crito are also in the GNT
GNT tokens is how many of the total tokens in the work have lemmas in the GNT; e.g. 3,429 of the 4,172 tokens in Crito have lemmas in the GNT
% GNT lemmas and % GNT tokens just express those counts as percentages; e.g. 60.81% of the lemmas in Crito are in the GNT and 82.19% of tokens in Crito have lemmas seen in the GNT

id	title	lemmas	tokens	GNT lemmas	GNT tokens	% GNT lemmas	% GNT tokens
001	Euthyphro	690	5,181	441	4,274	63.91%	82.49%
002	Apology	1,112	8,745	631	7,357	56.74%	84.13%
003	Crito	712	4,172	433	3,429	60.81%	82.19%
004	Phaedo	1,921	21,825	1,000	18,033	52.06%	82.63%
005	Cratylus	1,607	17,944	781	14,701	48.6%	81.93%
006	Theaetetus	2,072	22,489	966	17,962	46.62%	79.87%
007	Sophist	1,598	16,024	788	12,932	49.31%	80.7%
008	Statesman	2,013	16,953	937	13,384	46.55%	78.95%
009	Parmenides	805	15,155	478	12,738	59.38%	84.05%
010	Philebus	1,567	17,668	800	14,076	51.05%	79.67%
011	Symposium	1,949	17,461	961	13,806	49.31%	79.07%
012	Phaedrus	2,266	16,645	1,027	12,935	45.32%	77.71%
013	Alcibiades 1	1,138	10,264	628	8,356	55.18%	81.41%
014	Alcibiades 2	711	4,268	420	3,449	59.07%	80.81%
015	Hipparchus	431	2,256	281	1,890	65.2%	83.78%
016	Lovers	473	2,391	284	1,923	60.04%	80.43%
017	Theages	627	3,485	374	2,811	59.65%	80.66%
018	Charmides	919	8,311	534	6,875	58.11%	82.72%
019	Laches	960	7,674	559	6,100	58.23%	79.49%
020	Lysis	911	6,980	524	5,729	57.52%	82.08%
021	Euthydemus	1,268	12,453	686	10,015	54.1%	80.42%
022	Protagoras	1,753	17,795	869	14,306	49.57%	80.39%
023	Gorgias	1,938	26,337	951	21,467	49.07%	81.51%
024	Meno	961	9,791	534	8,066	55.57%	82.38%
025	Hippias Major	958	8,448	528	6,730	55.11%	79.66%
026	Hippias Minor	698	4,360	396	3,387	56.73%	77.68%
027	Ion	721	4,024	382	3,012	52.98%	74.85%
028	Menexenus	958	4,808	571	3,985	59.6%	82.88%
029	Cleitophon	418	1,549	284	1,293	67.94%	83.47%
030	Republic	4,846	88,878	1,782	71,377	36.77%	80.31%
031	Timaeus	2,666	23,662	1,122	18,644	42.09%	78.79%
032	Critias	1,130	4,950	638	3,997	56.46%	80.75%
033	Minos	528	2,859	309	2,333	58.52%	81.6%
034	Laws	5,227	103,193	1,804	82,652	34.51%	80.09%
035	Epinomis	1,014	6,309	590	5,135	58.19%	81.39%
036	Epistles	2,026	16,964	1,015	13,768	50.1%	81.16%

It's encouraging how any works are above the 80% level. Here are some caveats, though:

the Plato lemmatization is from Diorisis so has not been checked and may have errors throwing things off
the GNT lemmatization is MorphGNT and so even if Diorisis got it “right” it may have a different lemmatization scheme than MorphGNT
this assumes 100% knowledge of GNT lemmas
this doesn’t take into account word families nor individual forms
this coverage calculation theoretically favours shorter works. You can see that in how much lower the % GNT lemmas is for longer works like the Laws and Republic although (perhaps significantly) this doesn’t actually seem to skew token coverage

Favouring shorter works isn't necessary a bad thing if the goal is to find the most readable (by vocabulary) works of Plato post-GNT.

Here's a run of the code only assuming the 80% level of GNT vocabulary rather than the whole thing.

id	title	lemmas	tokens	GNT lemmas	GNT tokens	% GNT lemmas	% GNT tokens
001	Euthyphro	690	5,181	149	3,135	21.59%	60.51%
002	Apology	1,112	8,745	165	5,551	14.84%	63.48%
003	Crito	712	4,172	150	2,581	21.07%	61.86%
004	Phaedo	1,921	21,825	214	13,647	11.14%	62.53%
005	Cratylus	1,607	17,944	192	11,208	11.95%	62.46%
006	Theaetetus	2,072	22,489	215	13,416	10.38%	59.66%
007	Sophist	1,598	16,024	183	9,644	11.45%	60.18%
008	Statesman	2,013	16,953	194	9,577	9.64%	56.49%
009	Parmenides	805	15,155	140	9,852	17.39%	65.01%
010	Philebus	1,567	17,668	187	10,209	11.93%	57.78%
011	Symposium	1,949	17,461	208	10,437	10.67%	59.77%
012	Phaedrus	2,266	16,645	212	9,395	9.36%	56.44%
013	Alcibiades 1	1,138	10,264	177	6,296	15.55%	61.34%
014	Alcibiades 2	711	4,268	142	2,566	19.97%	60.12%
015	Hipparchus	431	2,256	111	1,339	25.75%	59.35%
016	Lovers	473	2,391	104	1,427	21.99%	59.68%
017	Theages	627	3,485	124	2,129	19.78%	61.09%
018	Charmides	919	8,311	158	5,277	17.19%	63.49%
019	Laches	960	7,674	165	4,632	17.19%	60.36%
020	Lysis	911	6,980	150	4,204	16.47%	60.23%
021	Euthydemus	1,268	12,453	181	7,640	14.27%	61.35%
022	Protagoras	1,753	17,795	195	10,973	11.12%	61.66%
023	Gorgias	1,938	26,337	205	16,301	10.58%	61.89%
024	Meno	961	9,791	159	6,042	16.55%	61.71%
025	Hippias Major	958	8,448	154	5,123	16.08%	60.64%
026	Hippias Minor	698	4,360	134	2,446	19.2%	56.1%
027	Ion	721	4,024	133	2,236	18.45%	55.57%
028	Menexenus	958	4,808	161	2,877	16.81%	59.84%
029	Cleitophon	418	1,549	113	966	27.03%	62.36%
030	Republic	4,846	88,878	252	53,090	5.2%	59.73%
031	Timaeus	2,666	23,662	210	13,555	7.88%	57.29%
032	Critias	1,130	4,950	171	2,872	15.13%	58.02%
033	Minos	528	2,859	121	1,776	22.92%	62.12%
034	Laws	5,227	103,193	250	58,891	4.78%	57.07%
035	Epinomis	1,014	6,309	165	3,700	16.27%	58.65%
036	Epistles	2,026	16,964	211	10,229	10.41%	60.3%

The Plato coverage generally drops from around 80% to 60% which suggests it might be worth "topping up" one's vocabulary with some common Plato words not in the GNT before embarking on a specific work. It would be easy to generate such a list with vocabulary-tools.

But it was quite striking to me in both tables just how little the token % drops due to length (in contrast to the lemma %).

This just goes to show that longer works introduce a lot of new words but very sparsely (probably with only one occurrence in many cases).

I might explore that graphically in a follow-up post.