J. K. Tauber

Working on Plato Texts

2020-10-21T14:35:00+08:00

I started working on some Plato texts a while ago but now I'm back to it, integrating various information and hitting some more issues with the Diorisis corpus.

About a year ago, I wrote about some vocabulary statistics I'd put together around various texts. This included a subcorpus of Plato I wanted to put together for the Greek Learner Texts Project. Based on the "core" works list I'd put together for https://vocab.perseus.org, this included:

Euthyphro
Apology
Crito
Symposium
Republic

At the time, I used Giuseppe Celano's experimental lemmatization as the basis for the vocabulary counts.

In the intervening period, I went further with Crito and restructured the citation scheme to be based on units of dialogue and sentences. You can see HTML generated from the result at:

https://jtauber.github.io/plato-texts/

but the underlying data is available at https://github.com/jtauber/plato-texts/blob/master/text/crito.txt.

I also took a first pass at aligning an English translation at the sentence level and the raw data for that is available at https://github.com/jtauber/plato-texts/blob/master/analysis/crito_aligned.txt.

My plan was always to return to the other four texts and last weekend I started on that, freshly bringing in the texts (in both Greek and English) from Perseus, the Diorisis corpus tagging, and the treebanks from AGLDT (Euthyphro) and Vanessa Gorman (the Apology).

I also added:

Phaedo
Meno

and may add others if they seem appropriate for the Greek Learner Texts Project.

The first thing I did was produce a stripped-down tokenized version of the Greek texts from Perseus with minimal markdown. In this process, I found a small number of issues with the Perseus XML which I'll submit corrections for shortly (mostly some stray gammas).

I then wrote a script to extract similar tokens from Diorisis for alignment. As I've written about before, the Diorisis corpus made the odd choice to use betacode for the tokens so I had to do a conversion. Then the real fun began.

Firstly, the Perseus text, based on the Burnet edition, has various editorial markup like <add>, <del>, <corr>, <sic>. I quickly discovered that the Diorisis text drops the <del> and <sic> elements. That's fine although I might seek the advice of people more familiar with Burnet and the text scholarship of Plato as to what the Greek Learner Texts edition should do.

Secondly, in Phaedo at least, named entities are marked up in the Perseus TEI XML. People and places are all appropriately tagged. I don't happen to need that right now although it's potentially useful information. But the Diorisis corpus drops those elements. I don't just mean it drops the tags, it dropps the elements. So if the sentence was <persName>John</persName> loves <persName>Mary</persName>, Diorisis would just give the sentence as loves (at least in Phaedo). Fairly easy to work around for alignment purposes, though.

The more time consuming aspect is the odd way Diorisis handles quotations. It seems to repeat the tokens of each quotation, once in context and then once in a sentence of its own. Except sometimes the repetition is incorporated in an unrelated sentence.

For example, the Homeric quotation in 408a (Republic Book 3) is analyzed inline but then also repeated in another sentence where it's part of the first sentence of 409a ("δικαστὴς δέ γε...") which, unless I'm missing something considerable, is just completely wrong.

I'm manually correcting all this (it comes up as an alignment mismatch and I'm going in and editing the Diorisis XML to remove the duplication). But even without the bad sentence merges, this also means that the vocabulary counts I've previously generated from Diorisis (and in vocabulary-tools) may have doubled up on any words appearing in quotations.

So there's lots more to do with Plato, not least of all the manual curation of lemmatization. But the goal, like that of the Greek Learner Texts Project as a whole, is to have a set of openly-licensed, high-quality, lemmatized texts for extensive reading by language learners.

Collaboration always welcome. Just ping me on the Greek Learner Texts Project Slack workspace.

Ordering Vocabulary by Pericope Dispersion

2020-10-04T22:35:00+08:00

Jesse Egbert's Plenary at JAECS 2020 is giving me a bunch of ideas of things to try on the New Testament and larger Greek corpora. In this post, I briefly explore text dispersion keyness using pericopes as a way of ordering vocabulary.

Back in Lexical Dispersion in the Greek New Testament Via Gries’s DP I wrote:

My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.

Egbert's plenary (available here after free signup) encouraged me to try a very simple metric instead of frequency: what proportion of text units in the corpus does the word appear in? Egbert emphasises using linguistically meaningful units of text (definitely not fixed-length windows) and pericopes seem perfect for this. There are dispersion measures that allow for varying sizes of text unit (like Gries's DP) but it seemed to me that just seeing what proportion of pericopes the item appears in might be a good measure of the importance to learn (instead of frequency).

This downplays words that might get repeated a lot in just a handful of pericopes and favours those that appear in lots of pericopes even if only one or two times in that pericope. Intuitively this makes sense, A word that appears 10 times in one passage in the New Testament (and nowhere else) isn't as generally useful to learn as a word that appears once in ten different passages. Overall corpus frequency can therefore be misleading because it treats these two cases as the same.

With vocabulary-tools it was trivial to produce a list of all the New Testament lemmas sorted by pericope dispersion.

This gist contains the code and the list:

https://gist.github.com/jtauber/fc4b0476a4c4a94d7cb01d068161892e

Eyeballing the resultant list, it seems a very promising ordering although I welcome comments on anything interesting people notice.

Next steps are:

quantitative comparison with pure frequency
application to other lemmatized Greek corpora with meaningful text units similar to pericopes
try other meaningful text units I have for NT such as books or paragraphs or even sentences

More on Plato After GNT

2020-09-14T08:35:00+08:00

In the previous post, we looked at lemma and token coverage in the works of Plato assuming knowledge of Greek New Testament vocabulary. Here we graphically look at those results and make an important observation.

For this first chart, I haven't just shown the GNT 100% and 80% but also the 98%, 95%, and 90% levels. The chart shows, assuming you've learned a certain % of GNT lemmas, how many tokens in the works of Plato are from those lemmas plotted against the length of the Plato work. All the plots here are log-log because of the Zipfian nature of word distributions (although it is more important in subsequent plots than this one).

At mentioned in the previous post, I was actually surprised at how little coverage drops off as a function of the length of the Plato work. A 100,000 token work has very similar token coverage than a 5,000 token work.

Visually this can be seen in how horizontal the best-fit lines are above.

However, when it comes to lemma coverage rather than token coverage, the story is very different:

The drop-off above as the Plato work gets longer is quite dramatic (especially when you consider this is a log-log plot). The points fit quite well to a line, though, indicating how Zipfian the distribution is. This demonstrates the clear relationship between the length of the work and how many lemmas you're likely familiar with. The longer a work is, the more distinct lemmas it will use, although they tend to be low frequency within the work (hence how horiztonal the lines in the first chart are).

Notice there are some outliers—some works that seem to have higher coverage than their length would suggest given the best-fit line. I've called out one here, showing just the GNT 80% points and best-fit line (although it's an outlier on the others too):

This suggests that this work might be, in some sense, easier for a GNT reader to read compared with other works of Plato. It suggests that perhaps the vocabulary of that particular work is closer to that of the GNT. The data was all there in the previous post but it's a lot easier to spot the outliers graphically.

The work indicated above is Parmenides. I started wonder what it was about that work that made it more "GNT like".

Then I took a step back because I realised there may be a confounding factor here. The statement "this work might be easier for a GNT reader to read compared with other works of Plato" stands but note this might not be a property of any GNT/Parmenides shared vocabulary but rather just the word distribution in Parmenides itself. In other words, Parmenides might just be easier compared with other works of Plato and that might have nothing to do with any vocabulary similarity to the GNT.

So I decided to just plot the token-to-lemma counts in the works of Plato. This doesn't involve the GNT at all, just how many tokens each work in Plato has versus how many unique lemmas that work has.

Here is the result with Parmenides called out:

In other words, a large part (and maybe all) of why Parmenides stands off the line in the coverage after GNT is because it simply has fewer lemmas for its overall token count. Its vocabulary is just smaller for its length.

In fact, visually you can see that most of the deviations of works from the line in the early charts maps to corresponding deviations in this chart (which remember has nothing to do with the GNT).

This is just some visual comparison. There are more quantative ways of actually measuring how much the deviations in the first three charts can be explained by those in the last chart. But I'll save that for another post.

The important takeaway for now is that, to the extent some works of Plato might be easier to read after the GNT than others, this probably has little to do with any relationship between their vocabularies, and is more to do with the inherent token-to-lemma ratio of the target work of Plato. It is possible to separate out the effects of each, though, and I will explore that in the future.

Note all the caveats I listed in my previous post about this data. Better lemmatization and richer vocabulary models are still needed.

Plato Vocabulary Coverage After the New Testament

2020-09-02T19:40:00+08:00

Seumas Macdonald asked me about vocabulary coverage for each work of Plato assuming one has learnt the New Testament vocabulary.

It turned out to be very simple to do with vocabulary-tools and you can now see the script in the repo as examples3.py.

But here let me share the results and give some caveats.

In the table below:

lemmas is the number of unique lemmas in the work; e.g. Crito has 712 unique lemmas
tokens is the number of total tokens in the work; e.g. Crito has 4,172 tokens
GNT lemmas is how many of those lemmas are in the GNT; e.g. 433 (of the 712) lemmas in Crito are also in the GNT
GNT tokens is how many of the total tokens in the work have lemmas in the GNT; e.g. 3,429 of the 4,172 tokens in Crito have lemmas in the GNT
% GNT lemmas and % GNT tokens just express those counts as percentages; e.g. 60.81% of the lemmas in Crito are in the GNT and 82.19% of tokens in Crito have lemmas seen in the GNT

id	title	lemmas	tokens	GNT lemmas	GNT tokens	% GNT lemmas	% GNT tokens
001	Euthyphro	690	5,181	441	4,274	63.91%	82.49%
002	Apology	1,112	8,745	631	7,357	56.74%	84.13%
003	Crito	712	4,172	433	3,429	60.81%	82.19%
004	Phaedo	1,921	21,825	1,000	18,033	52.06%	82.63%
005	Cratylus	1,607	17,944	781	14,701	48.6%	81.93%
006	Theaetetus	2,072	22,489	966	17,962	46.62%	79.87%
007	Sophist	1,598	16,024	788	12,932	49.31%	80.7%
008	Statesman	2,013	16,953	937	13,384	46.55%	78.95%
009	Parmenides	805	15,155	478	12,738	59.38%	84.05%
010	Philebus	1,567	17,668	800	14,076	51.05%	79.67%
011	Symposium	1,949	17,461	961	13,806	49.31%	79.07%
012	Phaedrus	2,266	16,645	1,027	12,935	45.32%	77.71%
013	Alcibiades 1	1,138	10,264	628	8,356	55.18%	81.41%
014	Alcibiades 2	711	4,268	420	3,449	59.07%	80.81%
015	Hipparchus	431	2,256	281	1,890	65.2%	83.78%
016	Lovers	473	2,391	284	1,923	60.04%	80.43%
017	Theages	627	3,485	374	2,811	59.65%	80.66%
018	Charmides	919	8,311	534	6,875	58.11%	82.72%
019	Laches	960	7,674	559	6,100	58.23%	79.49%
020	Lysis	911	6,980	524	5,729	57.52%	82.08%
021	Euthydemus	1,268	12,453	686	10,015	54.1%	80.42%
022	Protagoras	1,753	17,795	869	14,306	49.57%	80.39%
023	Gorgias	1,938	26,337	951	21,467	49.07%	81.51%
024	Meno	961	9,791	534	8,066	55.57%	82.38%
025	Hippias Major	958	8,448	528	6,730	55.11%	79.66%
026	Hippias Minor	698	4,360	396	3,387	56.73%	77.68%
027	Ion	721	4,024	382	3,012	52.98%	74.85%
028	Menexenus	958	4,808	571	3,985	59.6%	82.88%
029	Cleitophon	418	1,549	284	1,293	67.94%	83.47%
030	Republic	4,846	88,878	1,782	71,377	36.77%	80.31%
031	Timaeus	2,666	23,662	1,122	18,644	42.09%	78.79%
032	Critias	1,130	4,950	638	3,997	56.46%	80.75%
033	Minos	528	2,859	309	2,333	58.52%	81.6%
034	Laws	5,227	103,193	1,804	82,652	34.51%	80.09%
035	Epinomis	1,014	6,309	590	5,135	58.19%	81.39%
036	Epistles	2,026	16,964	1,015	13,768	50.1%	81.16%

It's encouraging how any works are above the 80% level. Here are some caveats, though:

the Plato lemmatization is from Diorisis so has not been checked and may have errors throwing things off
the GNT lemmatization is MorphGNT and so even if Diorisis got it “right” it may have a different lemmatization scheme than MorphGNT
this assumes 100% knowledge of GNT lemmas
this doesn’t take into account word families nor individual forms
this coverage calculation theoretically favours shorter works. You can see that in how much lower the % GNT lemmas is for longer works like the Laws and Republic although (perhaps significantly) this doesn’t actually seem to skew token coverage

Favouring shorter works isn't necessary a bad thing if the goal is to find the most readable (by vocabulary) works of Plato post-GNT.

Here's a run of the code only assuming the 80% level of GNT vocabulary rather than the whole thing.

id	title	lemmas	tokens	GNT lemmas	GNT tokens	% GNT lemmas	% GNT tokens
001	Euthyphro	690	5,181	149	3,135	21.59%	60.51%
002	Apology	1,112	8,745	165	5,551	14.84%	63.48%
003	Crito	712	4,172	150	2,581	21.07%	61.86%
004	Phaedo	1,921	21,825	214	13,647	11.14%	62.53%
005	Cratylus	1,607	17,944	192	11,208	11.95%	62.46%
006	Theaetetus	2,072	22,489	215	13,416	10.38%	59.66%
007	Sophist	1,598	16,024	183	9,644	11.45%	60.18%
008	Statesman	2,013	16,953	194	9,577	9.64%	56.49%
009	Parmenides	805	15,155	140	9,852	17.39%	65.01%
010	Philebus	1,567	17,668	187	10,209	11.93%	57.78%
011	Symposium	1,949	17,461	208	10,437	10.67%	59.77%
012	Phaedrus	2,266	16,645	212	9,395	9.36%	56.44%
013	Alcibiades 1	1,138	10,264	177	6,296	15.55%	61.34%
014	Alcibiades 2	711	4,268	142	2,566	19.97%	60.12%
015	Hipparchus	431	2,256	111	1,339	25.75%	59.35%
016	Lovers	473	2,391	104	1,427	21.99%	59.68%
017	Theages	627	3,485	124	2,129	19.78%	61.09%
018	Charmides	919	8,311	158	5,277	17.19%	63.49%
019	Laches	960	7,674	165	4,632	17.19%	60.36%
020	Lysis	911	6,980	150	4,204	16.47%	60.23%
021	Euthydemus	1,268	12,453	181	7,640	14.27%	61.35%
022	Protagoras	1,753	17,795	195	10,973	11.12%	61.66%
023	Gorgias	1,938	26,337	205	16,301	10.58%	61.89%
024	Meno	961	9,791	159	6,042	16.55%	61.71%
025	Hippias Major	958	8,448	154	5,123	16.08%	60.64%
026	Hippias Minor	698	4,360	134	2,446	19.2%	56.1%
027	Ion	721	4,024	133	2,236	18.45%	55.57%
028	Menexenus	958	4,808	161	2,877	16.81%	59.84%
029	Cleitophon	418	1,549	113	966	27.03%	62.36%
030	Republic	4,846	88,878	252	53,090	5.2%	59.73%
031	Timaeus	2,666	23,662	210	13,555	7.88%	57.29%
032	Critias	1,130	4,950	171	2,872	15.13%	58.02%
033	Minos	528	2,859	121	1,776	22.92%	62.12%
034	Laws	5,227	103,193	250	58,891	4.78%	57.07%
035	Epinomis	1,014	6,309	165	3,700	16.27%	58.65%
036	Epistles	2,026	16,964	211	10,229	10.41%	60.3%

The Plato coverage generally drops from around 80% to 60% which suggests it might be worth "topping up" one's vocabulary with some common Plato words not in the GNT before embarking on a specific work. It would be easy to generate such a list with vocabulary-tools.

But it was quite striking to me in both tables just how little the token % drops due to length (in contrast to the lemma %).

This just goes to show that longer works introduce a lot of new words but very sparsely (probably with only one occurrence in many cases).

I might explore that graphically in a follow-up post.

A Tour of Greek Morphology: Part 48

2020-08-30T21:12:00+08:00

Part forty-eight of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We previously introduced the (θ)η-aorists. In this post, we'll mention the stem variants and then go over some counts.

In terms of stem variants, we first of all have δέω, where we find the infinitive δεθῆναι alongside the 1SG ἐδεήθην and 3SG ἐδεήθη. The infinitive form suggests a stem of δε-θη whereas the finite forms suggest a stem of ἐ-δεη-θη with an extra η.

Secondly, we have two 3SG forms of ἁρπάζω: ἡρπάγη and ἡρπάσθη.

Finally we have ἀνοίγω with its confused augmentation (which we've seen in other aorists) and also both a θ and non-θ form. Putting aside the ἠνοι- vs ἀνεῳ- vs ἠνεῳ- variation, we have 3SG ἠνοίχθη alongside ἠνοίγη and 3PL ἠνοίχθησαν alongside ἠνοίγησαν.

Notice that in both the ἁρπάζω and ἀνοίγω cases, we have a non-θ form with γ before the η. We'll look at the letters we find before η and θη later in this post.

But first let's do our usual counts of tokens and lemmas.

class	# lemmas	# tokens	# hapakes
-θη-	250	954	130
-η-	34	79	19

As one can see, the non-θ forms are more rare lexically and the lexemes that do take them occur less frequently. They both, however, seem productive.

	-θη-	-η-
INF	166	4
1SG	29	6
2SG	8	2
3SG	489	43
1PL	30	5
2PL	44	3
3PL	188	16

The distribution above is what we might expect except for the INF which are disproportionately -θη-. This is not due to a single lexical item (unlike the 3SG where ἀπεκρίθη dominates).

This will be worth further investigation but we have other things to cover first. For example, is there any phonological reason why a non-θ form might be used rather than a θ-form? We saw previously, for example, that the existence or absence of the sigma in the alphathematic aorists was largely (although not entirely) predicted by the preceding letter.

It turns out, at least in our text (we'll look more broadly later) there's quite a strong correlation between whether a θ is found or not and what the preceding letter is.

For example, if the preceding letter is any of the vowels ε η ι ο υ ω, then we always find the θη form in the SBLGNT. α is the only exception and even then only in one lexical item out of 14, the κατακαίω form κατεκάη. (Notably κατεκαύ(σ)θη is more common elsewhere but we'll have to wait a little to discuss καίω forms in general)

If the preceding letter is σ, then we always find the θη form. This is actually the most likely letter to precede θ by far, followed by η.

ξ ψ and ζ don't appear in (θ)η aorists in the SBLGNT. Nor do δ τ or θ.

Amongst the velars: κ doesn't appear in (θ)η aorists in the SBLGNT but γ and χ both do. γ is always followed directly by η (and in fact the bigram γθ never appears in the SBLGNT at all). In contrast, χ always takes the θ form (which might be explained by an underlying κ or γ becoming χ because of the following θ but this doesn't explain why the θ would be absent in the -γ-η- instances).

Amongst the bilabials: both π and β are always followed directly by η (and neither πθ nor βθ appear as bigrams in the SBLGNT). φ however is found both in θη and η forms with a slight preference for φθη over φη.

This leaves our resonances: the liquids λ and ρ, and the nasals μ and ν. The bigram λθ is definitely allowed in Greek but we only find -λ-η- aorists, not -λ-θη-. With ν and ρ we find both θ and non-θ forms. There are no μ examples in the SBLGNT, nor do we find the bigram μθ.

Here's a summary with lexeme counts:

	-θη-	-η-
α	13	1
ε	21	-
η	80	-
ι	17	-
ο	11	-
υ	26	-
ω	52	-
σ	108	-
ξ	-	-
ψ	-	-
ζ	-	-
τ	-	-
δ	-	-
θ	-	-
κ	-	-
γ	-	12
χ	37	-
β	-	2
π	-	3
φ	16	10
λ	-	6
ρ	6	8
μ	-	-
ν	16	3

Clearly there are some patterns here. Vowels, σ, and the aspirated stops strongly (or even entirely) favour -θη-. The non-aspirated stops seem to entirely favour a plain η. The resonances are a mixed bag.

There are definitely some correlations but it's unclear what the casual relationship is. And it raises the important question of where the letter before the θ (or η) comes from in the first place. This relates more broadly to the question of the aorist stem. What is the relationship between the aorist stems used in the active, middle and (θ)η forms? In the next post, we'll start to explore that. Then, after reviewing all our endings so far, we'll move on to the even bigger question: what's the relationship between the aorist stem and the present stem?

A Tour of Greek Morphology: Part 47

2020-08-19T12:20:00+08:00

Part forty-seven of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We now turn to the (θ)η-aorists. These are often called aorist 'passives' but this is an unhelpful and confusing term. When talking about the form, it's better to give a label that simply refers to the form itself rather than to one of the functions that form may or (often) may not be used for. Naming the form for one of its functions (especially when other forms can be used for the same function) runs the risk of overemphasizing that function and somehow treating other functions as anomalies.

We must be clear, though, that "(θ)η-aorist" is not a category like "root aorist" or "thematic aorist" or "sigmatic aorist" where different lexemes fall (in most cases exclusively) into just one of those categories without there necessarily being a morphsyntactic distinction. The (θ)η-aorist is a new paradigm available to verbs for expressing a certain voice in contrast to the active and middle forms that we've already seen.

A lot more could be said about all this but that's outside the scope of a tour of morphological forms. The main point is that aorists can come in three voice-contrasting paradigms.

Three of the most common (θ)η-aorists in the New Testament, with broad coverage across personal endings are γενηθῆναι, ἀποκριθῆναι, and χαρῆναι.

	γίνομαι	ἀποκρίνομαι	χαίρω
INF	γενηθῆναι	ἀποκριθῆναι	χαρῆναι
1SG	ἐγενήθην	ἀπεκρίθην	ἐχάρην
2SG		ἀπεκρίθης
3SG	ἐγενήθη	ἀπεκρίθη	ἐχάρη
1PL	ἐγενήθημεν		ἐχάρημεν
2PL	ἐγενήθητε		ἐχάρητε
3PL	ἐγενήθησαν	ἀπεκρίθησαν	ἐχάρησαν

All the above forms appear in the SBLGNT.

The "vertical" distinguishers are our familiar endings seen in the root aorist actives:


INF	-ναι
1SG	-ν
2SG	-ς
3SG	-
1PL	-μεν
2PL	-τε
3PL	-σαν

The "horizontal" distinguishers, however, look like this:


INF	-(θ)ῆναι
1SG	-(θ)ην
2SG	-(θ)ης
3SG	-(θ)η
1PL	-(θ)ημεν
2PL	-(θ)ητε
3PL	-(θ)ησαν

The whole category always has a -η- before the ending and most often a -θη-, hence the name (θ)η-aorist.

By far the most common form in the SBLGNT is ἀπεκρίθη (82 tokens). The plural ἀπεκρίθησαν is the third most common form (19 tokens). The second most common form is ἐδόθη (31 tokens).

The next post will look at further counts of these (θ)η-aorists and then we'll look at the relationship between aorist active, middle and (θ)η forms before moving on to the large question of the relationship between perfective and imperfective forms.

A Tour of Greek Morphology: Part 46

2020-08-15T14:00:00+08:00

Part forty-six of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We saw in Part 42 that the aorist middle endings were:

-σθαι
-μην
-σο (often with loss of sigma and subsequent contraction)
-το
-μεθα
-σθε
-ντο

either:

preceded by alpha
preceded by a theme vowel ε/ο
affixed directly

which correspond to our classes:

alphathematic (first aorists)
thematic (second aorists)
root

Note again what we said in the previous post: this is just a classification based on the distinguisher paradigms and there are other ways of categorizing aorist forms.

Does this cover all aorist middle indicatives and infinitives in SBLGNT? Are there any words in more than one class or with more than one stem? And what are the counts for the different classes and dominant lemmas or forms within each class?

We'll cover that here.

There is one form that doesn't match our distinguisher patterns and that's ξυρᾶσθαι in 1 Cor 11.6. This seems to be an error in the MorphGNT tagging, though. It's clearly a present (imperfective) infinitive not an aorist (perfective) infinitive and so is not relevant here.

Now in terms of multiple stems, we do have an augmentation difference with ἐργάζομαι. We find both ἠργασ- and εἰργασ-.

And in terms of words that seem to appear in more than one class, we have these forms of ἐξαιρέω:

ἐξελέσθαι, which is clearly thematic
ἐξειλάμην and ἐξείλατο, which are clearly alphathematic

We also have these forms of ἀποδίδωμι (which we've brought up before):

ἀπέδετο
ἀπέδοσθε
ἀπέδοντο

We would expect the forms to follow the root pattern. ἀπέδοσθε is unambiguously root, ἀπέδετο is unambiguously thematic. ἀπέδοντο could be taken to be root or thematic. For the purposes of the counts below, we'll take ἀπέδοντο to be root.

Note that ἐκδίδωμι only appears as ἐξέδετο which is also thematic so there's definitely some reanalysis going on with the δίδωμι compounds.

Here are the total counts across classes for aorist middles in SBLGNT:

class	# lemmas	# tokens	# hapakes
alphathematic	109	393	49
thematic	20	320	10
root	14	39	3

(yes, "hapakes" is a running joke, equivalent to calling them "the onces")

ἀπο-δο is the only root ending with ο and the rest of the root endings are θε and compounds:

θε
ἐκ-θε
ἀπο-θε
ἐπι-θε
δια-θε
κατα-θε
παρα-θε
προσ-θε
προ-θε
συν-θε
ἀνα-θε
προσ-ανα-θε
συν-επι-θε

The thematics come from ten familes and are:

γεν (γίνομαι)
παρα-γεν
πυθ (πυνθάνομαι)
ἐξ-ελ (-αἰρέω)
συμ-βαλ (-βάλλω)
περι-βαλ
ἀνα-βαλ
ἀνα-σχ (-έχομαι)
ἀπ-ολ (-όλλυμι)
συν-απ-ολ
ἐκ-δο (-δίδωμι)
ἀπο-δο
συλ-λαβ (-λαμβάνω)
κατα-λαβ
ἐπι-λαβ
προσ-λαβ
ἀντι-λαβ
ἀφ-ικ (-ικνέομαι)
ἐφ-ικ
ἐπι-λαθ (-λανθάνομαι)

But note that γίνομαι alone makes up 269 out of the 320 tokens!

Now by person/number:

	alphathematic	thematic	root
INF	63	45	3
1SG	23	16	3
2SG	8	2	1
3SG	208	227	16
1PL	12	0	0
2PL	22	4	2
3PL	57	26	14

Because μι verbs have root forms throughout the middle (not just in the infinitive like in the active in Hellenistic Greek) we don't get the disproportionately high INF root counts that we did in the active.

The 3SG expectedly dominates. This is particularly true in the thematic, in large part due to ἐγένετο. But in addition to dominance of the 3SG by ἐγένετο, all the 2SG and 2PL thematic aorist middles are γεν and most of the INF, 1SG and 3PL are too, as show here in our table showing dominant forms:

	alphathematic	thematic	root
INF		γενέσθαι 36/45	καταθέσθαι 2/3 ἀποθέσθαι 1/3
1SG		ἐγενόμην 12/16	προεθέμην 1/3 προσανεθέμην 1/3 ἀνεθέμην 1/3
2SG	κατηρτίσω 2/8 ἠρνήσω 2/8	ἐγένου 2/2	ἔθου 1/1
3SG		ἐγένετο 201/227	ἔθετο 7/16
1PL
2PL		ἐγένεσθε 4/4	ἔθεσθε 1/2 ἀπέδοσθε 1/2
3PL	ἤρξαντο 19/57	ἐγένοντο 14/26	ἔθεντο 4/14

As with the actives, there is greater lexical variety amongst the alphathematics than amongst the thematics.

class	token-lemma ratio	% hapakes
alphathematic	3.61	45.0 %
thematic	16.00	50.0 %
root	2.79	21.4 %

The top 5% of alphathematic lemmas make up 32.1% of the tokens whereas the top 5% of thematic lemmas make up a whopping 84.1% of tokens. For the actives, recall the numbers were 44.1% and 60.7% respectively.

In the next couple of posts we'll look at the (θ)η aorists (often misleadingly called aorist "passives").

Picking the Words for Greek Typing

2020-08-01T16:20:00+08:00

Last week, we launched greektyping.com to help people get better at typing Greek. Aurélien Berra asked what the method of choosing words to type was so I thought I'd write a blog post about it.

Step 1.

I took MorphGNT SBLGNT and wrote a script that made a list of words from it as follows:

every token in the text including punctuation
every token in the text with punctuation stripped
every normalized token in the text but if it has a movable final character, add both with and without
the previous but with accents stripped
every lemma in the text
the lemma but with accents stripped

So up to 8 potential "words" from each token in the SBLGNT, but then with duplicates removed. This led to 55,496 unique "words".

Step 2.

I grouped every individual Greek character (209 of them) found in the above word list into 30 "chapter" buckets. For example, I put "κ" in chapter 1 and "ξ" in chapter 4 and "έ" in chapter 8 and "ἤ" in chapter 14 and so on. This wasn't done computationally, just manually. Each chapter has a theme: something new that gets introduced and, other than chapter 5 which covers the uppercase letters, there are no more than 9 new characters in each chapter and usually 5–8.

Step 3.

I then wrote a script that went through all 55,496 "words" from Step 1 and, for each character, looked up which chapter from Step 2 that character was introduced in. Then, for each word, the script noted the earliest chapter needed for all the characters in that word.

In other words, if chapter is a mapping from a character to what chapter number it is in, calculate max(chapter[character] for character in word)

At this point the script has built a table of 55,496 words each with the "target chapter" they can be introduced in.

Step 4.

When a user on greektyping.com is doing a particular chapter, here's what happens:

the table is queried for all the words whose target chapter is the current chapter being done.
a sample of 10 is taken from the result (less than 10 if there are fewer than 10 words for a given target chapter, which happens in chapters 22, 24, 25, 26, and 28)
this sample is sorted by length
the user is presented with that list

So that's how it works. It would be fairly easy to apply to other Greek texts (they don't have to be analysed to the extent MorphGNT is). But even with just the MorphGNT there's a lot of "replayability". Chapter 8 alone has 16,704 words you could be tested on.

We'll probably add some richer statistics at some point and also typing of longer units of text but for now our focus is on adding instructions for more keyboard layouts (the drills themselves will stay the same, though).

A Tour of Greek Morphology: Part 45

2020-07-28T23:20:00+08:00

Part forty-five of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We've classified aorist active endings into three classes:

alphathematic (first aorists including kappa, sigmatic, and pseudo-sigmatic)
thematic (second aorists)
root

It's important to stress that this is a classification of distinguisher paradigms. It is related to but distinct from other ways of classifying aorists based on the properties of the stem and how it relates to the imperfective (present) stem. We'll get to those other ways in a few of posts' time but for now, our classification is just based on the distinctive set of endings.

As we've done before, we'll now take this classifcation and look at various counts in the SBLGNT. How many times do we encounter tokens of each class? How many different lemmas are in each class? Which paradigm cells are most common for each class?

Let's start with just the lemma and token counts as well as the number of lemmas that only occur once in the SBLGNT text.

class	# lemmas	# tokens	# hapakes
alphathematic	661	2973	326
thematic	103	2082	33
root	36	262	15

There is more lexical variety in the alphathematic class, especially when compared with the thematic class. This can be seen in the token-lemma ratio and in the percentage of lemmas that are hapakes.

class	token-lemma ratio	% hapakes
alphathematic	4.50	49.3 %
thematic	20.21	32.0 %
root	7.28	41.7 %

Another way to see this is what % of tokens are forms of the top % of lemmas.

	5%	10%	25%	50%
alphathematic	44.1%	57.4%	76.1%	88.7%
thematic	60.7%	76.8%	89.6%	96.4%
root	21.8%	47.7%	80.5%	92.0%

This table is saying that the top 5% of lemmas with alphathematic forms make up 44.1% of alphathematic tokens but the top 5% of lemmas with thematic forms make up 60.7% of thematic tokens.

In other words, the thematic aorist active tokens are drawn from a smaller set of lemmas than the alphathematic. In fact, a third of thematic aorist active tokens in SBLGNT are forms of εἶπον (and, as we'll see in a moment, mostly 3SG).

One interesting anomaly perhaps worth coming back to at some stage (I wasn't aware of it until now) is that at the top 5% and top 10% lemma level, the root aorists token % is lower than the alphathematic but at the 25% and 50% level is above.

Okay, that's distribution across the three classes of ending. What about individual paradigm cell counts?

	alphathematic	thematic	root
INF	509	351	95
1SG	224	163	12
2SG	88	30	3
3SG	1244	1143	94
1PL	80	45	3
2PL	119	40	13
3PL	709	310	42

In all cases, the infinitive and third person dominate.

It is interesting that in the alphathematics, 3SG dominates with 3PL next and then INF. In the thematics, 3SG dominates even more followed by INF with 3PL not far behind. In the root aorists, the INF is actually up with the 3SG with 3PL a distant third. Recall the μι verbs have a root form in the INF but nowhere else. This likely explains why the INF makes up such a large proportion of root form tokens.

Within the 1st and 2nd person cells, the 1SG dominates in the alphathematic and especially the thematic. In the root, the 2PL is actually on par with the 1SG.

Again this is worthy of closer inspection but there are definitely individual lexical items at work here.

As we've done before, let's look at which lemmas (if any) dominate particular cell paradigm counts.

	thematic	root
INF		δοῦναι 33/95
1SG	εἶδον 54/163	ἔγνων 6/12 ἀνέβην 3/12
2SG	εἶδες 8/30	ἔγνως 3/3
3SG	εἶπε(ν) 610/1143
1PL		ἐνέβημεν 1/3 ἐπέγνωμεν 1/3 ἐξέστημεν 1/3
2PL	ἐλάβετε 13/40	ἀνέγνωτε 10/13
3PL		ἔγνωσαν 17/42

Consistent with its greater lexical variety, the alphathematic cells are not dominated by any one lexical item at all.

In the thematics, though, we see the disproportionate occurrence of εἶδον in the 1SG and 2SG and especially of εἶπον in the 3SG where it makes up more than half the occurrences of thematic 3SG aorist actives.

Note that no root aorist lemma dominates the 3SG cell but all the other cells have a small set of lemmas covering a lot of occurrences. ἔγνως is the only root 2SG form in the SBLGNT, and ἀνέγνωτε makes up 77% of root 2PL occurences.

One thing that might be slightly misleading about the lemma numbers for the thematic and (especially) root aorists is inclusion of compound verbs with preverbs. The 103 thematic aorist active lemmas actually come from 27 base verbs (there are 16 lexical items just from ἔρχομαι/ἦλθον for example). The 36 root aorist active lemmas actually come from just 7 base verbs and 3 of those (δίδωμι, τίθημι, and ἀφ-ίημι) only have a root ending in the infinitive.

So the only fully root verbs in the SBLGNT are the γνω family, the βη family, the στη family, and δυ. With the exception of δυ which has only one instance, the rest have reasonable token counts (82 for γνω, 71 for βη, 55 for στη).

The thematic aorist base verbs with the highest token counts are: the εἰπ family (689), the ἐλθ family (538), εἰδ (178), the λαβ family (125), the ἀγαγ family (71), the βαλ family (70), the ἀπο-θαν family (67), εὑρ (58), the πεσ family (57).

Next up we'll look at the aorist middles again.

A Polytonic Greek Typing Tutor

2020-07-26T16:20:00+08:00

I've revived an old web application to help people practice typing Ancient Greek.

Being able to type Greek fluently, diacritics and all, is an often neglected skill for classical and biblical language students but it's one that is increasingly important whether you're doing searches, writing essays, editing electronic editions, or just chatting about (or even better, in) Greek online.

A few years ago, I wrote a simple web application to help me practice typing using the built-in Greek - Polytonic input source on macOS. I grouped all the characters (including with full diacritics) into an ordered sequence with 30 stages then wrote a script to find Greek words in the New Testament that only used letters and diacritics appropriate for each stage in the sequence.

Talking to a classics lecturer a couple of weeks ago, he brought up the increasing need for students to be able to type Greek, and I said: "oh, I have a web app for that". But I realised it needed a bit of polish.

That polish is now done (with some help from my colleague Patrick Altman) and we've now launched

https://greektyping.com

The instructions are currently still just for the macOS Greek - Polytonic input source but we've put in place some of the framework to support different instructions depending on your operating system and keyboard layout setup. We'll add new instructions for new keyboard layouts over time.

Even with missing instructions, it should mostly be possible to actual do the timed exercises with any keyboard layout as you are just assessed on the Unicode characters you are producing, not how you produced them on your particular keyboard.

Hopefully, though, this will be a helpful resource to all those who want to be able to type Ancient Greek faster. And we'll continue to improve it over time, both in terms of instructions for other layouts but also some more features, interesting stats, and fun games.

And there is no reason we can't extend it to other writing systems too.

A Tour of Greek Morphology: Part 44

2020-07-06T08:20:00+08:00

Part forty-four of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Let's now go through the remaining aorist actives (indicatives and infinitives) that exhibit multiple stems or multiple stems and ending classes.

Differences in just the stem

προεφήτευσεν vs ἐπροφήτευσεν

The difference is just whether the augment honours the preverb or not.

| προεφήτευσεν | | προ- | -ε- | -φήτευσεν | | ἐπροφήτευσεν | ἐ- | -προ- | | -φήτευσεν |

ἀνέῳξεν vs ἤνοιξεν vs ἠνέῳξεν

ἀνέῳξ- was the earlier aorist form and then later we find ἤνοιξ- and ἠνέῳξ-. The SBLGNT has all three in the 3SG. Like in the previous example, this is also a difference in whether the augmentation honours the preverb (ἀνέῳξ-) or not (ἤνοιξ-) but with a third form where it's effectively augmented in both places (ἠνέῳξ-).

| ἀνέῳξεν | ἀν- | -έῳ- | -ξεν | | ἤνοιξεν | ἤν- | -οι- | -ξεν | | ἠνέῳξεν | ἠν- | -έῳ- | -ξεν |

πεῖν vs πιεῖν

In the aorist active infinitive of πίνω/ἔπῐον can exhibit a "Hellenistic contraction":

-ιει- /-ǐī-/ → -ει- /-ī-/.

ἔγημα vs ἐγάμησε

ἔγημα is the earlier form and ἐγάμησα developed later later, presumably by analogy with other -εω verbs (which we'll explore later).

παρήγγειλε vs παρήγγελλε

I think this is a mistake in MorphGNT SBLGNT and should be tagged imperfect (in Luke 8.29).

κατέλειπε vs κατέλιπε

I think this is a mistake in MorphGNT SBLGNT and should be tagged imperfect (in Lk 10.40).

Differences in stem and class (non-μι verbs)

παρέλαβον vs παρελάβοσαν

παραλαμβάνω has a pretty standard thematic aorist but we find the 3PL form παρελάβοσαν alongside the expected παρέλαβον.

ἀνέκραξαν vs ἀνέκραγον

The aorist of ἀνακράζω was originally thematic ἀν-έ-κραγ-ο- but started to develop a sigmatic variant ἀν-έ-κραγ-σ-.

In the SBLGNT we mostly find the later sigmatic variant but the 3PL also appears in the original thematic form (ἀνέκραγον alongside ἀνέκραξαν).

ἤγαγον vs ἦξα (and compounds)

In the SBLGNT, we find συνήγαγον vs συνῆξα and ἐπισυναγαγεῖν vs ἐπισυνάξαι.

In other words, the stem ἀγ-αγ-ο vs αγ-σ.

Differences in stem and class (μι verbs other than ἵστημι and compounds)

Recall that in the Hellenistic period, the INF was still mostly a root aorist but most of the other forms were kappa alphathematics (not just in the singular, as was the case classically, but in the plural too through levelling). Occasionally a thematic form creeps in though.

τίθημι and compounds

All root in the INF and kappa alphathematics elsewhere.

δίδωμι and compounds

As expected except for the 3PL παρέδοσαν (i.e. the classical form) alongside παρέδωκαν.

ἀφίημι

As expected except for the 2SG ἀφῆκες where we'd expect ἀφῆκας.

ἵστημι and compounds

Unlike the other verbs with μι presents, ἵστημι has no kappa alphathematics. In fact, even classically, the entire aorist paradigm had a full set of root aorist forms alongside a full set of sigmatic alphathematic aorists.

This is true of the compounds too.

We'll later take up the topic of the different usage between the two sets. But for now I just want to highlight that, unlike most of the other examples in this post or the previous one, this is not an example of a shift between aorist classes happening before our eyes but something more ingrained in the earlier history of Greek. We'll definitely return to it, but there are other matters to cover first.

We've now allocated all our aorist active infinitive and indicative forms to inflectional (or at least ending) classes and in the next post we'll look at some counts.

A Tour of Greek Morphology: Part 43

2020-06-29T21:00:00+08:00

Part forty-three of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Before we get to counts in the various aorist classes, we need to dive a little more into the verbs that appear to be in more than one class.

We've already seen the kappa aorists like ἔδωκα and ἔθηκα that, in the infinitive (and, classically, in the plural), are root aorists but elsewhere have alphathematic endings and a slightly different stem.

In this post we're going to look at the aorist active verbs in the SBLGNT that have a consistent stem throughout but exhibit both thematic (2nd aorist) and alphathematic (1st aorist) variants. In other words, for some cells in the paradigm there is a form that follows the thematic distinguisher pattern, for some cells there is a form that follows the alphathematic distinguisher pattern, and in some cells we find both forms. In theory, both forms might be possible in any cell, but we're just using a small corpus so in practice the paradigms will be sparse.

In all cases, the thematic aorist is the older form and the alphathematic form developed later (particularly during the Hellenistic period) as part of a general movement towards having fewer classes of aorist.

Note that the 3SG ending -ε(ν) is ambiguous as to which class the form is in (between these two classes).

I should also note that the stem and its relationship to the imperfective stem can be used as a diagnostic for aorist class. But we are ignoring that for now and just focusing on the classes of ending (or more precisely, the distinguishers).

The relevant verbs are:

ἔρχομαι/ἦλθον and compounds
λέγω/εἶπον and compounds
φέρω/ἤνεγκα compounds
πίπτω/ἔπεσα and compounds
βάλλω/ἔβαλον and compounds
εὑρίσκω/εὗρον
ὁράω/εἶδον
ἀναιρέω/ἀνεῖλον

ἔρχομαι/ἦλθον and compounds

The alphathematic variants seem more likely in the plural (although we'll defer any actual statistics for now).

Note these could not be reanalyzed as sigmatic or pseudo-sigmatic.

	thematic	alphathematic
INF	ἐλθεῖν
1SG	ἦλθον
2SG	ἦλθες
3SG	ἦλθε(ν)	←
1PL	ἤλθομεν	ἤλθαμεν
2PL		ἤλθατε
3PL	ἦλθον	ἦλθαν

	thematic	alphathematic
INF	ἀπελθεῖν
1SG	ἀπῆλθον	ἀπῆλθα
3SG	ἀπῆλθε(ν)	←
3PL	ἀπῆλθον	ἀπῆλθαν

	thematic	alphathematic
INF	εἰσελθεῖν
1SG	εἰσῆλθον
2SG	εἰσῆλθες
3SG	εἰσῆλθε(ν)	←
1PL	εἰσήλθομεν
2PL		εἰσήλθατε
3PL	εἰσῆλθον

	thematic	alphathematic
INF	ἐξελθεῖν
1SG	ἐξῆλθον
2SG	ἐξῆλθες
3SG	ἐξῆλθε(ν)	←
1PL	ἐξήλθομεν
2PL		ἐξήλθατε
3PL	ἐξῆλθον	ἐξῆλθαν

	thematic	alphathematic
3SG	προσῆλθε(ν)	←
3PL	προσῆλθον	προσῆλθαν

	thematic	alphathematic
INF	συνελθεῖν
3SG	συνῆλθε(ν)	←
3PL	συνῆλθον	συνῆλθαν

λέγω/εἶπον and compounds

Note these could not be reanalyzed as sigmatic or pseudo-sigmatic.

	thematic	alphathematic
INF	εἰπεῖν
1SG	εἶπον	εἶπα
2SG	εἶπες	εἶπας
3SG	εἶπε(ν)	←
2PL		εἴπατε
3PL	εἶπον	εἶπαν

	thematic	alphathematic
1SG	προεῖπον
3SG	προεῖπε(ν)	←
1PL		προείπαμεν

φέρω/ἤνεγκα compounds

Note the stem ends in a kappa and so it resembles a kappa aorist when alphathematic. It is therefore particularly interesting that the indicatives are all alphathematic (or in the case of the 3SG, could be taken as in that class).

In other words, the existence of the kappa may have made speakers feel a little more comfortable using the alpha endings.

	thematic	alphathematic
INF	ἀνενεγκεῖν	ἀνενέγκαι
3SG	ἀνήνεγκε(ν)	←

	thematic	alphathematic
INF	ἀπενεγκεῖν
3SG	ἀπήνεγκε(ν)	←
3PL		ἀπήνεγκαν

	thematic	alphathematic
INF	εἰσενεγκεῖν
1PL		εἰσηνέγκαμεν

	thematic	alphathematic
INF	ὑπενεγκεῖν
1SG		ὑπήνεγκα

πίπτω/ἔπεσα and compounds

Note the stem ends in a sigma and so it resembles a sigmatic aorist when alphathematic. As with ἤνεγκα, it is therefore interesting that the indicatives are all alphathematic (or in the case of the 3SG, could be taken as in that class).

In other words, the existence of the sigma may have made speakers feel a little more comfortable using the alpha endings.

	thematic	alphathematic
INF	πεσεῖν
1SG		ἔπεσα
3SG	ἔπεσε(ν)	←
3PL		ἔπεσαν

	thematic	alphathematic
INF	ἀναπεσεῖν
3SG	ἀνέπεσε(ν)	←
3PL		ἀνέπεσαν

	thematic	alphathematic
INF	ἐκπεσεῖν
3SG	ἐξέπεσε(ν)	←
2PL		ἐξεπέσατε
3PL		ἐξέπεσαν

	thematic	alphathematic
3SG	ἐπέπεσε(ν)	←
3PL		ἐπέπεσαν

	thematic	alphathematic
3SG	προσέπεσε(ν)	←
3PL		προσέπεσαν

βάλλω/ἔβαλον and compounds

Notice that, as often has been the case before, the 3PL appears in both classes. In a future post we'll run some numbers as it could just be that the 3PL is simply more common in general.

The stem here ends in a resonant, so the alphathematics look a little more like pseudo-sigmatics.

	thematic	alphathematic
INF	βαλεῖν
3SG	ἔβαλε(ν)	←
3PL	ἔβαλον	ἔβαλαν

	thematic	alphathematic
INF	ἐπιβαλεῖν
3SG	ἐπέβαλε(ν)	←
3PL	ἐπέβαλον	ἐπέβαλαν

εὑρίσκω/εὗρον

The stem here ends in a resonant, so the alphathematics look a little more like pseudo-sigmatics.

	thematic	alphathematic
INF	εὑρεῖν
1SG	εὗρον
2SG	εὗρες
3SG	εὗρε(ν)	←
1PL	εὕρομεν	εὕραμεν
3PL	εὗρον

ὁράω/εἶδον

Note that, like λέγω/εἶπον, these could not be reanalyzed as sigmatic or pseudo-sigmatic.

	thematic	alphathematic
INF	ἰδεῖν
1SG	εἶδον
2SG	εἶδες
3SG	εἶδε(ν)	←
1PL	εἴδομεν	εἴδαμεν
2PL	εἴδετε
3PL	εἶδον	εἶδαν

ἀναιρέω/ἀνεῖλον

The stem here ends in a resonant, so the alphathematics look a little more like pseudo-sigmatics.

	thematic	alphathematic
INF	ἀνελεῖν
2SG	ἀνεῖλες
3SG	ἀνεῖλε(ν)	←
2PL		ἀνείλατε
3PL		ἀνεῖλαν

In the next post, we'll cover other aorist active verbs that have some variant forms. Then we'll be in a position to do some counts.

Lemmatization for the Morphological Lexicon

2020-06-15T08:42:00-04:00

As I slowly expand my plans for a Morphological Lexicon of New Testament Greek to a Morphological Lexicon of Ancient Greek, I'm dealing with extra challenges in lemmatization.

One of the things I'm doing to verify my work is take existing morphologically-tagged and lemmatized Greek texts and see if my code and (more importantly) data generates the same form given the lemma and morphosyntactic properties. In particular, I'm currently doing this with noun forms in Vanessa Gorman's Greek Dependency Trees.

Along the way, I'm having to amend a number of Gorman's lemmas. Not because they are wrong per se, but because they are serving a different purpose than what I need. This is not a problem unique to the Gorman trees and I gave an entire talk at SBL 2017 about related issues.

As I said there, a lemma provides a link between a token in a text and an entry in a lexical resource. It acts as a key by which to retrieve a record in a lexical database (traditionally a print dictionary).

There are two problems with this, however:

you may wish to link to multiple independent lexical resources and the entries in each may not have a one-to-one mapping
data within the lexical entry may not be valid for all tokens linking to that lexical entry and if the lemma only identifies the lexical entry as a whole, there's no way to discern which specific properties apply in each case.

I give many examples in my SBL 2017 talk. One obvious example of a problem is words with multiple senses. Sometimes texts will be tagged with more sense information, λέγω3 rather than just λέγω for example. This typically assumes a single canonical lexical resource (like the LSJ for Greek).

But the problem exists with other information attached to a lexical entry. Notably relevant in my case is morphological information like stems or inflectional classes.

Now if your goal in lemmatizing a text is to link back to an entry in LSJ (or maybe a particular sense) that's fine but it is not precise enough a reference to hang morphological information off. And this is why I'm having to augment the lemmatization in annotated texts like Vanessa Gorman's (and later the Diorsis corpus).

Much of this has to do with dialect variation. For this reason I didn't discuss many examples in my SBL 2017 (which was primarily focused on the New Testament). I did have an example of spelling variation, though, which is similar.

The example I gave there was ἀνάπειρος versus ἀνάπηρος. And, as I said at the time, you may not care to distinguish these if you're doing lexical semantics but if you're a textual critic or phonologist, you might. And I made the following point which is particularly relevant here:

why should ἀναπείρους be lemmatized ἀνάπηρος?

Now again, it seems innocuous if your goal in lemmatizing is just to link to a canonical dictionary. But if you're doing any kind of morphological modelling, then ἀνα-πείρ-ο- and ἀνά-πηρ-ο- are different stems. Because they are different stems, there are different objects needed to hang the "stem" property off and you want the token "ἀναπείρους" to point to the object with stem = ἀνα-πείρ-ο- not the object with stem = ἀνά-πηρ-ο-.

My SBL 2017 talk briefly listed a few other examples:

ἀναλόω ~ ἀναλίσκω
ἀποκτείνω ~ ἀποκτέννω
ἑλκύω ~ ἕλκω
ἵστημι ~ ἱστάνω

as well as:

ἀλλάσσω ~ ἀλλάττω
ἁρμόττω ~ ἁρμόζω
κλαίω ~ κλάω
αὐξάνω ~ αὔξω
μείγνυμι ~ μίγνυμι
οἶμαι ~ οἴομαι

As I tried to emphasize in my talk, choosing to lump (or split) these isn't wrong per se. But for my specific morphological purposes, and to the extent that there is a difference in any morphological property, whether it be the stem or the distinguisher paradigm or the inflectional class, or whatever, there needs to be a separate object to point to.

For this reason I'm adding distinct lemmas for each dialect. So far that has meant changing the lemmas for about 5% of the forms in the Gorman trees (note that's unique forms, not tokens). And so μέλιτταν gets "lemmatized" by me as μέλιττα not as μέλισσα, μέγαθος gets lemmatized as μέγαθος not μέγεθος, μαντηίῳ gets lemmatized as μαντήιον not μαντεῖον, and so on.

That is not to do away with the lumping all together. I can collect variations together into groups and link the group to, say, the LSJ entry. This is then entirely appropriate to use for properties that are shared across dialect / spelling variations.

This is a key point about the lattice approach described in my SBL 2017 point. You have an object to point to where you need to specific AND an object to point to where you can be general.

Furthermore, sometimes inflected forms can be ambiguous as to their "narrow" lemma. A ᾱ~η alternation between dialects in an ending, for example, will be neutralized in endings with a short ᾰ. And so even in the morphologically-focused case, there is sometimes a need for lumping across dialects.

It's not just a matter of dialects and spelling variation: suppletion and heteroclisis also comes into play here and benefit from this approach.

This is all extra work but I think it's necessary for a more precise, corpus-based language description.

I want to finish working through the Gorman nouns before I share some of this data but that should happen in the coming months. And I want to emphasize that, in most cases, I'm not actually changing the lemmatization, I'm just adding to it (although I am finding the occasional error whose correction I will send upstream).

It's still early days and one thing I haven't settled on is good terminology. I'm inclined to go with the lexeme ~ flexeme distinction. But then do I call the key for the latter the "flemma"?

A Tour of Greek Morphology: Part 42

2020-05-10T09:00:00+08:00

Part forty-two of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We now turn to the middle aorist endings.

Recall that the imperfect middle endings were:

-μην
-σο (often with loss of sigma and subsequent contraction)
-το
-μεθα
-σθε
-ντο

Adding -σθαι for the infinitive, we unsurpisingly get the following distinguishers for the middles for alpha and thematic aorists:

| INF | Xασθαι | Xέσθαι | | 1SG | Xάμην | Xόμην | | 2SG | Xω | Xου | | 3SG | Xατο | Xετο | | 1PL | Xάμεθα | Xόμεθα | | 2PL | Xασθε | Xεσθε | | 3PL | Xαντο | Xοντο |

Notice that in the 2SG, ασο > αο > ω and εσο > εο > ου (although not all dialects do this).

For reasons we may touch on later, the root aorists don't generally appear in the middle but δίδωμι, τίθημι, and ἵημι (with stems δο-, θε-, and ἑ- respectively) have aorist middle forms that essentially act like root aorists (just as the aorist active plurals do in Classical Greek):

Again notice in the 2SG we get a loss of sigma in the case of δίδωμι and τίθημι and οσο > οο > ου and εσο > εο > ου although this time the contraction is with the root vowel, not a (alpha-)thematic vowel. Presumably εἷσο resists sigma loss and contraction because it's disyllabic.

The ambiguities are straightforward to deal with:

in the 3SG, 2PL, INF, there is an ambiguity between the thematic and τίθημι
in the 1SG, 1PL, 3PL, there is an ambiguity between the thematic and δίδωμι
in the 2SG, there is an ambiguity between the thematic, τίθημι, and δίδωμι

As we've seen before, this all comes down to whether the ε or ο is from the root vowel or theme vowel.

In the next couple of posts, we'll look at the frequency distributions of the aorist classes. We'll then start to explore in more detail the relationship between the perfective (aorist) and imperfective (present and imperfect) stems.

Tool Updates

2020-03-19T15:42:00-04:00

I have made a minor update to greek-normalisation, a more significant update to vocabulary-tools, and have started a new project postag-convert for converting between various morphosyntactic tagging schemes.

greek-normalisation

https://github.com/jtauber/greek-normalisation

The utils module in this package previously had a function for converting U+02BC and U+1FBF to U+2019 but now (in the 0.4 release) additionally provides it as a shell command.

Once the package is installed, you can type to2019 in > out in the shell and the file named in will be converted to a file named out with all the U+02BC and U+1FBF characters changed to U+2019.

vocabulary-tools

https://github.com/jtauber/vocabulary-tools

I previously mentioned I'd incorporated lemma counts from Vanessa Gorman's treebanks into vocabulary-tools. I didn't check the Unicode normalisation first, though, and it turns out it was inconsistent (which led to bad numbers). That's now been fixed and the data converted to NFC.

I've also added the Diorisis lemma counts too and cleaned up the code to share more between the two data sets.

Thirdly, I took a pass at finding the intersection between the passages covered by Gorman and Diorisis and generated separate lemma counts for each version of the intersection. I'll write a dedicated blog post about this later but basically I'm trying to track down systemic problems with various lemmatisations and having identical texts to compare (to make sure discrepencies aren't just subcorpus bias) is very helpful.

Fourthly, I've implemented a calculation of log rank differences between lemmas in two subcorpora—in other words, a measure of how far the rank of a particular lemma differs in two subcorpora. This has (at least) two applications: one is to find which lemmas are disproportionally more common in one text versus another. For example, in the Gorman texts from Thucydides and Xenophon, the lemma Κῦρος is ranked 885th vs 30th—a log rank difference of 4.9. The lemma δράω is ranked 196th in Thucydides but 2345th in Xenophon—a log rank difference of 3.6.

The second application is to compare two lemmatisations of the same subcorpus (e.g. Gorman vs Diorisis) to try to identify systemic problems. It turns out, for example, that the log rank difference of λέγω between Gorman and Diorisis is a whopping 5.974 (you'd expect it to be at or near zero for the same corpus). Turns out that's because Gorman distinguishes λέγω, λέγω2, λέγω3 and Diorisis doesn't.

postag-convert

https://github.com/jtauber/postag-convert

Over the years (actually decades), various projects of mine have had to convert between different schemes for morphsyntactic tagging. Whether the original CCAT scheme, my own variants of that scheme, the Robinson scheme, the Morpheus/Perseus scheme, or its Logeion variant, I've written code at various points to do conversions from one to another. I was well overdue writing a reusable library!

I also want to be able to support Leipzig Glossing Rules and the Universal Dependencies codes as well as fully spelling out properties and values in multiple natural languages (e.g. localising terms like "case" or "accusative").

postag-convert is still in the early days but is intended to eventually be useful for all of the above (and potentially reusable beyond Ancient Greek too).

A Tour of Greek Morphology: Part 41

2020-03-17T12:00:00-05:00

Part forty-one of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In part 39, we outlined the distinguisher paradigms for the sigmatic (first), thematic (second), and root aorists in the active indicative and infinitive:

| INF | Xαι | Xεῖν | Xναι | 1SG | Xα | Xον | Xν | 2SG | Xας | Xες | Xς | 3SG | Xε(ν) | Xε(ν) | X | 1PL | Xαμεν | Xομεν | Xμεν | 2PL | Xατε | Xετε | Xτε | 3PL | Xαν | Xον | Xσαν

For the sigmatic aorists, I didn't show the actual sigma because it was consistent across the paradigm (and hence not part of the "distinguisher"). This turned out to be a useful way to think about it for other reasons too.

We've already seen (in part 40) that verbs like ἔδωκα and ἔθηκα follow the sigmatic paradigm in the singular (or in both the singular and plural in the Hellenstic period) despite not having a sigma at all.

But there are other verbs that have the alpha endings too but without a sigma either because

(a) the sigma sound is incorporated into the letter ξ or ψ:

(b) the sigma has dropped out because the previous sound is a resonant (nasal: μ, ν; or liquid: λ, ρ):

(forms not in SBLGNT in italics)

We'll discuss this in detail in another post but the loss of sigma in (b) is accompanied by a lengthening of the vowel before the resonant. Hence, for example, ἔμεινα compared with present μένω. These aorists are sometimes called pseudo-sigmatic aorists.

For the purposes of categorising distinguisher paradigms, (a) and (b) still just follow the alpha endings.

And so there are three sets of endings:

alpha endings (including the sigmatic, pseudo-sigmatic and kappa)
thematic endings
root endings

As we shall see later, there are a few other verbs that sometimes take on alpha endings despite not even an underlying sigma. There are also verbs that mix one type of aorist and another (sometimes with a semantic distinction).

We'll come back to looking at the frequency distribution of the different types of aorist but, before we do that, let's take a look at the middle aorist endings.

Tolkien, Sonnenschein and Language Learning

2020-02-24T03:41:21-05:00

Via an unusual route, I discovered Edward Adolf Sonnenschein and his thoughts at the turn of the 20th century on teaching Latin (and Greek).

It started last week when I was looking through Oronzo Cilli's wonderful book Tolkien's Library: An Annotated Checklist for entries relating to Greek. One of the books mentioned was Sonnenschein's A Greek Grammar for Schools: Based on the Principles and Requirements of the Grammatical Society, marked King Edward's School (where Tolkien went) and with Tolkien's brother Hilary's name.

This was clear evidence that Hilary Tolkien, and possibly John Ronald himself used Sonnenschein's grammar at King Edward's School. Sonnenschein was a classics professor in Birmingham, editor of a series of grammars (of which the Greek Grammar was one), and co-founder of the Classical Association.

The grammars were published by Swan Sonnenschein, founded by his brother and which, incidentally, merged with George Allen & Co just before it became George Allen & Unwin. Two decades later, of course, Unwin published The Hobbit.

Talking to Seumas Macdonald about the Greek grammar (and Tolkien's classics education), he mentioned he was familiar with Sonnenschein from his Latin readers.

Now quite independent of this, I was looking at The Greek War Of Independence, a easy Greek reader by Charles D. Chambers. We're producing a digital edition of it as part of the Greek Learner Texts Project. Not only was the original book published by Swan Sonnenschein but the preface begins:

This book an attempt to apply to Greek the methods which Professor Sonnenschein has expounded in his Ora Maritima and Pro Patria. The main principle is that the systematic study of grammar should proceed side by side with the reading of a narrative.

So here was another mention of Sonnenschein's Latin readers. I dug up an online scan of Ora Maritima and discovered the following in the preface (written in 1908):

My apology for adding another to the formidable array of elementary Latin manuals is that there is no book in existence which satisfies the requirements which I have in mind as of most importance for the fruitful study of the language by beginners. What I desiderate is:— 1. A continuous narrative from beginning to end, capable of appealing in respect of its vocabulary and subject matter to the minds and interests of young pupils, and free from all those syntactical and stylistic difficulties which make even the easiest of latin authors something of a problem. 2. A work which shall hold the true balance between too much and too little in the matter of systematic grammar. In my opinion, existing manuals are disfigured by a disproportionate amount of lifeless Accidence. The outcome of the traditional system is that the pupil learns a multitude of Latin forms (Cases, Tenses, Moods), but very little Latin.

I love the phrase "disfigured by a disproportionate amount of lifeless Accidence". It reminds me of the style of Tolkien's reviews in The Year's Work in English Studies in 1924. I also love "the pupil learns a multitude of Latin forms...but very little Latin".

As Fletcher Hardison pointed out when I shared this quote with the team working on the easy Greek readers: "I think we just found our manifesto".

Later in the preface, Sonnenschein writes:

The pupil who has mastered this book ought to be able to read and write the easiest kind of Latin with some degree of fluency and without serious mistakes: in a word, Latin ought to have become in some degree a living language to him.

The use of "no book in existence" at the start makes me wonder whether this was the first real attempt at applying the Direct Method for historical languages. I wonder also if this is the first mention of Latin alongside the phrase "living language".

Ora Martima also includes an earlier essay Sonnenschein wrote in 1900 entitled New Methods in the Teaching of Latin. Presumably the reader is an attempt to implement the ideas in this essay.

In it, Sonnenschein writes:

Grammar has its proper place in any systematised method of teaching a language; but that place is not at the beginning but rather at the end of each of the steps into which a well-graduated course must be divided.

...

There should be no preliminary study of grammar apart from the reading of a text.

...

Each new grammatical feature of the language would be presented as it is wanted, in an interesting context, and would be firmly grasped by the mind; at convenient points the knowledge acquired would be summed up in a table (the declension of a noun or the forms of a tense).

This is almost identical to what I outlined back in my "New Kind of Graded Reader" video in 2008 (100 years after Ora Maritima!): "we are first introduced to forms as they are used in context and then come back later to consolidate, abstract, and generalize later."

The use of "would" suggests that Sonnenschein does not (yet) consider the idea to have been implemented in any books (hence the goal of Ora Maritima eight years later).

The entire essay is worth reading. It's available at the Internet Archive although I might correct the OCR and make available a proper transcription.

All of this has made me interested in the history of classical language teaching at the turn of the century. What was the relationship of Sonnenschein's work to that of W. H. D. Rouse? Did Sonnenschein know Gouin's book? Was Tolkien exposed to the Direct Method at all?

Via Seumas, I became aware of The Living Word: W. H. D. Rouse and the Crisis of Classics in Edwardian England by Christopher Stray. I promptly ordered a copy as well as Stray's The Classical Association: The First Century 1903-2003.

What started as a tenuous connection to Tolkien's classics education has returned me to a study of the pioneers of the Direct Method applied to classical languages and given me even more inspiration to work on the Greek Learner Texts Project.

UPDATE: I've finished a first pass correcting the OCR of Sonnenschein's 1900 essay Newer Methods in the Teaching of Latin (as reproduced in Ora Maritima) here.

Vanessa Gorman's Lemmatisation Now in vocabulary-tools

2020-02-13T21:19:29-05:00

Last year I started the Python library vocabulary-tools to consolidate the various scripts I've written over the years to analyse vocabulary in (particularly New Testament) texts. I've just added support for the vocabulary in Vanessa Gorman's treebanks.

As part of the greek-texts project it's important to have as many texts lemmatised as possible. For a while (for example in https://vocab.perseus.org) I used Giuseppe Celano's automated lemmatisation of Perseus. Recently I started working on cleaning up the Diorisis Ancient Greek Corpus which is also an automated lemmatisation.

Automated lemmatisation is around 90% accurate but that's quite low for doing vocabulary work, especially as the lemmatisation errors are often systematic and so can throw off stats in quite a significant way.

Ultimately, we need hand-curated lemmatisation and one of the goals of the greek-texts project is to help facilitate that. I obviously already have a lemmatisation of the Greek New Testament. There is also the Ancient Greek Dependency Treebank. But one of the most impressive efforts in this area (especially in light of the fact it is a solo effort) is Vanessa Gorman's work.

There are now over 500,000 tokens of Greek prose treebanked by Professor Gorman. So not only lemmas but morphosyntactic tagging and syntactic dependency tagging as well.

But for the short term, it's the lemmas I'm interested in. I extracted them from the XML format produced by Arethusa and built lemma counts that could be loaded into vocabulary-tools.

There's still a lot of work to do on my library but I can now do things like generate an 80% vocabulary list for the Gorman corpus. Or see what words you'd have to learn to read Plato's Apology if you can read Lysias's On the Murder of Eratosthenes and the New Testament at the 95% level.

I also took the opportunity to add more features to vocabulary-tools including incorporating the code I wrote for the Subcorpus Vocabulary Statistics post (which was based on the Celano lemmatisation).

Ultimately I'd like more post-beginner Greek prose and the LXX lemmatised. I'm currently working on Plato's Crito and Epictetus's Enchiridion. If you're interested in any of this stuff, please check out the greek-texts project and join our Slack workspace.

A Tour of Greek Morphology: Part 40

2020-02-04T22:35:16-05:00

Part forty of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the classical Attic dialect we find the following aorist active forms for δίδωμι and τίθημι:

Looking at this vertically, we might split the constant part from the distinguisher as follows:

but looking horizontally, we might split it this way:

It looks like the singular forms share δω or θη and the plural forms share δο or θε.

The plurals seem to be inflecting like root aorists with δο and θε as the root. The infinitives are consistent with this too (with an -εναι ending).

However, the singular forms with the lengthened grade vowel seem to inflect with the alpha like we saw with the sigmatic aorists except we have kappa not a sigma.

And so we have:

These will sometimes be referred to as kappa aorists even though the kappa is only found in the singular and the forms are otherwise consistent with root aorists.

In the SBLGNT and other Hellenistic period Greek, we find these forms, though:

although in Luke 1.2 we find the older ἔδοσαν not ἔδωκαν.

The infinitives and singulars have stayed the same, but the plurals have changed to be consistent with the singulars. They have the lengthened vowel grade, the kappa, and the alphathematic endings.

This is an example of paradigm levelling within the active indicatives. The contrast in number between singular and plural was being indicated not only by the personal ending but (redundantly) by the vowel grade, the existence/absence of the kappa and the existence/absence of the alpha theme vowel.

Redundancy is not a bad thing—it improves comprehensibility in the face of noise—but it is still easy to see how this sort of levelling might happen. A speaker might internalise from other verbs that if the aorist active 2SG is Χς, the 2PL is Χτε. This pattern works for root aorists, it works for thematic aorists, and it works for sigmatic aorists. Following that, a speaker familiar with ἔδωκα-ς might naturally produce ἐδώκα-τε. It would be obvious to listeners what was meant, even if the form ἔδοτε was the "correct" one. Over time, ἐδώκατε might be accepted and eventually dominate. A similar thing presumably happened with all the plural forms.

We're almost done with an initial tour of the indicative aorist actives with just a couple more paradigms to look at before we switch to the middles.

Working with the Diorisis Ancient Greek Corpus

2020-01-20T09:03:54-05:00

I've recently started working on cleaning up the Diorisis Ancient Greek Corpus for my own vocabulary and morphology work as well as potential use in Scaife.

I can't remember if I simply didn't know about the Diorisis corpus until recently or simply had put it on a list of things to look at one day and forgot to get back to to it. But it was Rodda et al's Vector space models of Ancient Greek word meaning, and a case study on Homer in TAL 60.3 that put me (back) on to it.

The Diorisis Ancient Greek Corpus was produced by Alessandro Vatri and Barbara McGillivray and is a 10-million word corpus of 820 texts from Perseus and some other sources (in TEI XML, non-TEI XML, HTML, and apparently Microsoft Word).

Vatri and McGillivray compiled it for studying semantic change but obviously it's useful for a number of my own research interests. I've been working with Giuseppe Celano's lemmatisation (also used in Scaife) but it has some problems and I'd always planned to clean it up a bit. Diorisis excited me as a potentially better curated (if smaller) corpus.

I also liked the fact that the Diorisis corpus had work-level metadata with dates and genre (which I've always wanted on my corpus for a variety of reasons). And of course, it is cc-by-sa licensed.

So I started a repo to begin processing:

https://github.com/jtauber/diorisis

The first thing I did was write a script to extract the work-level metadata, with this result:

https://github.com/jtauber/diorisis/blob/master/catalog.tsv

Then I started extracting the token-level data. Like the Celano data, the XML format used for the analysed corpus is huge (over 2GB on disk) and I wanted something a little more normalised along the lines of other work I'm doing.

But here's where I hit my first problem. Vatri and McGillivry made the odd decision to use BetaCode for the corpus word forms (although not the lemmas). What is even more odd is their paper (linked above) argues for the benefits of BetaCode over Unicode. The arguments, however, seem to stem from a misunderstanding of Unicode.

Section 3.3 of their paper begins:

All Greek characters have been converted to Beta Code, in order to adopt a uniform and consistent encoding and with a view to automatic parsing and lemmatization. For these purposes, Beta Code was chosen because of its flexibility and ease of use in the following look-up operations:

They then proceed with three arguments. The first is:

Word-forms to be automatically analysed and annotated may or may not start with a capital letter; in order to be matched to entries in a digital dictionary, forms should be converted to the formats corresponding to the entries. Greek lowercase and uppercase letters are encoded as different characters in the Unicode table (e.g. the lower-case letter α corresponds to utf-8 code 0391, the upper-case letter A corresponds to utf-8 code 03B1), which would require an ad-hoc conversion for each character between its lower-case and upper-case versions. Beta Code simply encodes capitalization through the juxtaposition of an asterisk (*) character (lower-case α is encoded as A, and upper-case A is encoded as *A), which can be easily added or removed in the look-up process.

Firstly, this confuses Unicode code points with UTF-8 encoding forms. Secondly, they get the Unicode code points for α and Α around the wrong way. α is U+03B1 (which in UTF-8 would be CE B1). Thirdly, "ad-hoc conversion for each character between its lower-case and upper-case versions" is an overstatement of the problem. A simple .lower() in Python, for example, is arguably easier than removing * (and certainly not "ad-hoc") especially when one considers the wide range of accent and breathing placements one finds in uppercase BetaCode characters in the wild.

Their second argument is:

Diacritics such as the Greek diaeresis ( ̈) may or may not appear in dictionary entries (for instance, editors may add them to Greek words to mark hiatuses in metrical texts). Greek characters containing the diaeresis (alone or in combination with other diacritic marks) all have different utf-8 codes (e.g. ϊ = 03ca, ΐ = 0390, ῒ = 1fd2, ῗ = 1fd7), whereas Beta Code encodes the diaeresis through the juxtaposition of a plus sign (+; e.g. ϊ = I+, ΐ = I/+, ῒ = I\+, ῗ = I=+). This makes it very easy to process diacritics in the look-up process.

Again there is a confusion between Unicode code points and UTF-8. "03ca" is a code point not UTF-8. But more significantly, the argument that BetaCode is superior because it encodes the diaereses as a separate character completely ignores decomposed characters in Unicode. Ironically, copy-pasting from their paper, ϊ is already decomposed as two code points:

U+03B9 GREEK SMALL LETTER IOTA
U+0308 COMBINING DIAERESIS

In other words, Unicode provides exactly what they are asking for.

Finally, they argue:

In ag orthography, the grave accent (`) is only used to mark the alteration of the pitch normally marked by an acute accent in connected speech; thus, it never appears in dictionary entries (which only contain acute or circumflex accents). Whereas Unicode has different codes for Greek characters with an acute or a grave accent, Beta Code encodes such diacritics as forward (/) and backward (\) slashes, respectively; this makes grave accents easy to convert into acute accents in the look-up process.

Again this ignores Unicode normalisation and decomposed characters. It is really no harder to convert graves to accents with Unicode than with BetaCode.

These misunderstandings and misrepresentations of Unicode would be one thing if my argument was just that Unicode is no worse than BetaCode. But the choice of BetaCode is problematic for other reasons.

Most of these problems have to do with ( and especially ). BetaCode texts should use ' for apostrophes marking elision. The Diorisis corpus sometimes does. But it also (in around six thousand cases by my initial estimate) uses ) instead. And so we have KAT) for κατ’ instead of KAT'. Diorisis is hardly the only culprit here. In Perseus I still find cases where ) was used in BetaCode so the (incorrect) Unicode comes out κατ̓. To make matters worse, ( and ) are also used for actual parentheses.

And so in BetaCode in the wild, ) could mean smooth breathing or an apostrophe or a parenthesis. With BetaCode this has to be manually disambiguated. With Unicode it does not (unless incorrectly converted from ambiguous BetaCode).

And so my process of converting Diorisis to using Unicode is not a trivial one. My initial conversion code flagged almost ten thousand tokens that need to be manually checked. The majority of these seem to be ) for elision but some are for parentheses. Eyeballing what I have so far, there are also cases of multiple words not properly tokenised into two and also some bad text (OCR or keying errors) that needs to be corrected.

Note that some of these issues were likely problems in the upstream text and so my task ahead is partly just doing that correction. But most of the work is addressing ambiguities in the BetaCode that would not exist if Unicode had been used everywhere. This makes the fact BetaCode was chosen for unnecessary reasons even more frustrating.

One could argue I'm talking at most about 0.1% of the text so I could just ignore problematic tokens. Given the automated lemmatisation is considerably less accurate than 99.9% (more like 90% at best) it might seem like a pedantic thing to focus on. But the problematic tokens tend to be of a particular type and aren't uniformly distributed in the corpus and so depending on the task the corpus is being used for, they can become more prominent than one might think from a figure like "0.1%".

My goal in the coming weeks is to have a slightly cleaned up Diorisis corpus completely in Unicode. This can then be used for some initial vocabulary stats work. My next goal after that is to improve the lemmatisation, initially using curated lemmatisations that did not exist when Diorisis was done. Longer term, I plan to continue to curate the lemmatisation.

This improved Diorsis can then form the basis for a lot of the work I previously used the Celano analysis for. There will definitely be blog posts reporting status along the way.

I am extremely grateful for the work that went into producing the Diorisis corpus. It is just a shame that a misunderstanding of Unicode led to a decision that is creating extra work now. But that will be overcome soon and hopefully my incremental improvements will turn out to be useful to others over time.

A Tour of Greek Morphology: Part 39

2020-01-05T16:00:00+08:00

Part thirty-nine of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Now we'll take an initial look at the aorist active infinitive and indicatives for λύω:

λῦσαι
ἔλυσα
ἔλυσας
ἔλυσε(ν)
ἐλύσαμεν
ἐλύσατε
ἔλυσαν

Probably the most common term for this type is first aorist but this implies some ordering (versus the "second aorists" in particular) that isn't particular helpful in most cases.

If we're contrasting the indicatives here with their imperfect equivalents, we'd have

| 1SG | Xσα | Xον | 2SG | Xσας | Xες | 3SG | Xσε(ν) | Xε(ν) | 1PL | Xσαμεν | Xομεν | 2PL | Xσατε | Xετε | 3PL | Xσαν | Xον

The existence of the sigma is why these are often alternatively called sigmatic aorists.

But if we just look at the distinguishers within the paradigm, we can drop the sigma and get the following (with the thematic and root aorist distinguisher patterns shown for comparison):

| 1SG | Xα | Xον | Xν | 2SG | Xας | Xες | Xς | 3SG | Xε(ν) | Xε(ν) | X | 1PL | Xαμεν | Xομεν | Xμεν | 2PL | Xατε | Xετε | Xτε | 3PL | Xαν | Xον | Xσαν

Observe that the 2SG, 1PL, 2PL, and 3PL all appear to have the same ending as the thematic aorists but with an alpha instead of the ε/ο theme vowel. This is why these aorists are often alternatively called alphathematic aorists.

In the 1SG, the alpha ending is actually related to the ν in the thematic and root aorists. If a final ν (coming from a Proto-Indo-European -m) follows a consonant, it becomes an α in Greek (coming from a syllablic -m̥ in Proto-Indo-European). This is just a way of making an otherwise unpronounceable sequence pronounceable (in Greek). We see this same phenomenon in the accusative singular nouns (-ν when preceded by a vowel like in the 1st and 2nd declension, -α when preceded by a consonant in the 3rd declension).

So ἔλυσα makes sense instead of ˣἔλυσν. But now we have an interesting question: is the -α the ending or part of the aorist stem? Its origins are clearly as the regular ending but in light of forms like ἔλυσας, ἐλύσαμεν, one might reanalyse it as part of what distinguishes this type of aorist.

In the 3SG we find the bare ε (with movable nu in many cases) presumably by analogy with the thematic aorists. If an α had been used, it would be easily confused (without a nu) for the 1SG and (with the nu) for the 3PL

The 3PL is particularly interesting because it potentially explains the -σαν ending we see in the root aorists (and even athematic imperfects). Again this comes back to an interesting reanalysis.

ἔλυσαν, if thought about in terms of the historical 3PL ending, could be segmented ἔλυσα-ν. If thought about in the context of the other personal endings in its paradigm, it could be segmented ἔλυσ-αν. But it could also be segmented ἔλυ-σαν, particularly in comparison to the imperfect. The morph -σαν could then have been internalised as indicating 3PL in the aorist or even aorist/imperfect.

There's more we can say about this once we've covered the perfect endings (probably a little while off!) as they're likely involved in this as well but this is yet another example of how morphology isn't really about concatenating morphemes with some compositional meaning resulting. It's a complex interaction of forms within a system.

A Tour of Greek Morphology: Part 38

2020-01-03T16:00:00+08:00

Part thirty-eight of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the last post, we introduced the root aorist actives. We'll now introduce another type of aorist active.

Here are the aorist active infinitive and indicative forms of λαμβάνω:

| INF | λαβεῖν | | 1SG | ἔλαβον | | 2SG | ἔλαβες | | 3SG | ἔλαβε(ν) | | 1PL | ἐλάβομεν | | 2PL | ἐλάβετε | | 3PL | ἔλαβον |

Notice that the infinitive -εῖν is like the present (but with a circumflex) and the indicative distinguishers follow IA-1 exactly.

These distinguishers, just as with the thematic imperfects, consist of a theme vowel (an ablauting ε/ο) with the usual endings:

| 1SG | Xον | ο + ν | | | 2SG | Xες | ε + ς | | | 3SG | Xε(ν) | ε + - | movable nu | | 1PL | Xομεν | ο + μεν | | | 2PL | Xετε | ε + τε | | | 3PL | Xον | ο + ν | historically ντ but final τ dropping off |

These are often called "second" aorists (although we haven't looked at the so-called "first" aorists yet). I'll generally avoid that term and instead use the term thematic aorist because of the theme vowel. Focusing on this distinctive makes it clearer what's going on with these types of aorist.

However, the thematic aorist distinguisher patterns seem to pose an even bigger problem than the root aorist distinguisher patterns: how do these not get confused for imperfects (or presents in the case of the infinitive)?

The answer is the same for the root aorists: the stem itself is also conveying grammatical information.

The present/imperfect stem is λαμβαν+ε/ο but the aorist stem is λαβ+ε/ο. So λαβεῖν cannot be confused for the present infinitive because that would be λαμβάνειν. ἔλαβον canot be confused for the imperfect 1SG or 3PL because they would be ἐλάμβανον. ἔλαβες cannot be confused for the imperfect 2SG because that would be ἐλάμβανες.

This does mean, however, that you need to know the stems. If you don't know λαμβαν- / λαβ- at all, you won't know whether ἔλαβες is imperfect or aorist. Xες is ambiguous as to aspect unless you know whether X corresponds to a imperfective (present/imperfect) stem or a perfective (aorist) stem.

Here are some other examples:

εὑρίσκω has the imperfective stem εὑρισκ+ε/ο but the perfective stem εὑρ+ε/ο
ὁράω has the imperfective stem ὁρα+ε/ο but the perfective stem ἰδ+ε/ο (we'll discuss later why this augments as εἰδ-)
ἔρχομαι has the imperfective stem ἐρχ+ε/ο but the perfective stem ἐλθ+ε/ο
λέγω has the imperfective stem λεγ+ε/ο but the perfective stem εἰπ+ε/ο

We'll talk a lot more about the relationship between these stems in future posts so don't worry about those details just yet. The main thing I want to start to get across here is that the endings don't discriminate imperfective and perfective. The stem itself indicates both the lexeme AND the aspect. For this reason, they are sometimes called aspect stems and, as we have already done above, we can refer to the perfective stem or the imperfective stem. Lots more on that soon!

This has implications for morphological theory and morpheme-based approaches. There's no "morpheme" in ἔλαβες expressing just the perfective aspect.

We'll end this post summarising the differences between the root aorists and thematic aorists (which otherwise share the same endings):

root aorists	thematic aorists
no thematic vowel	thematic vowel
infinitive -ναι	infinitive -εῖν
3rd plural ending -σαν	3rd plural ending -ν < -ντ

A Tour of Greek Morphology: Part 37

2020-01-02T16:00:00+08:00

Part thirty-seven of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

For our exploration of the aorist forms, we're going to start with the aorist active infinitive and indicatives of βαίνω:

βῆναι
ἔβην
ἔβης
ἔβη
ἔβημεν
ἔβητε
ἔβησαν

This may seem a little unusual (don't worry, we'll get to the aorist forms of λύω soon enough) but it will turn out to lay a better foundation, I think.

Here are a few more paradigms of the same type:

	γι(γ)νώσκω	βαίνω	ἵστημι
INF	γνῶναι*	βῆναι (διαβῆναι*)	στῆναι*
1SG	ἔγνων*	ἔβην (ἀνέβην*)	ἔστην (ἀντέστην*)
2SG	ἔγνως*	ἔβης	ἔστης
3SG	ἔγνω*	ἔβη (ἀνέβη*)	ἔστη*
1PL	ἔγνωμεν (ἐπέγνωμεν*)	ἔβημεν (ἐνέβημεν*)	ἔστημεν (ἐξέστημεν*)
2PL	ἔγνωτε (ἐπέγνωτε*)	ἔβητε	ἔστητε
3PL	ἔγνωσαν*	ἔβησαν (ἀνέβησαν*)	ἔστησαν*

* indicates that the form appears in SBLGNT. Where the base form does not appear but a compound with a preverb does, I've included that in parentheses.

Note the following:

the INF does not have an augment but the indicatives do
the INF is always a properispomenon. In other words, it has a circumflex on the penultimate syllable. This could be explained in the above cases by the ending being -εναι with contraction taking place (although we'd want other evidence to be sure)
the consistent, lexeme-specific part of the form within a paradigm is a consonant or consontant cluster followed by a long vowel: γνω, βη, στη
the present/imperfect stem and aorist stem are not the same and, in fact, the relationship between the present/imperfect stem and aorist stem appears to be different for each lexeme so far!
the regular recessive accent means the indicative forms always end up having an acute on the augment
there is no thematic vowel (i.e. no ablauting ε/o at the end of the stem)
there is no vowel length alternation between the singular and plural
the 3PL ending is -σαν like the athematic imperfects
the rest of the endings are like all the IA from IA-1 through IA-9
the fact the endings are like the IA would lead to lack of distinction between the imperfect and aorist if not for the stem differences!

To summarise, our distinguishers (augment aside) are:

| INF | -ναι | | 1SG | -ν | | 2SG | -ς | | 3SG | - | | 1PL | -μεν | | 2PL | -τε | | 3PL | -σαν |

Because the endings go directly on the verbal root with no thematic vowel and with no other morphological changes, these aorists are often called root aorists. They're not normally introduced first (they aren't common by number of distinct lexemes, although are reasonably so by token count) but I've chosen to start with them in this tour because they lay a good foundation for comparing and contrasting other types of aorist.

In the next post of this tour, we'll introduce another of these types: aorists that do have a thematic vowel.

A Tour of Greek Morphology: Part 36

2019-12-31T16:00:00+08:00

Part thirty-six of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We've now spent a lot of time looking at distinguishers and inflectional classes within each of the indicative personal ending paradigms (present active, present middle, imperfect active, and imperfect middle) of each lexeme. We also checked the consistency of the lexeme-specific part (the X or "theme") in each paradigm.

But we haven't really talked about the consistency of the lexeme-specific part across the PAs, PMs, IAs, and IMs. Perhaps not surprisingly, the same theme is used by a lexeme for both the PA and the PM (if both voices are used) and likewise the IA and IM. In other words, voice is not indicated by the theme in the present and imperfect, only by the set of endings used.

But what about the theme consistency between the present and the imperfect? That's what we'll look at now.

Given that the present and imperfect differ in their sets of endings (other than in the 1PL and 2PL) there is not a huge need to use any additional mechanism to express present versus impefect.

But as mentioned when we first started with the imperfects, there is difference besides the endings, namely the augment in the imperfect: typically either a prefixed ε before an initial consonant or a lengthed initial vowel.

The situation is made slightly more complex by the fact that this augmentation applies before the incorporation of any prepositional "preverb". Greek had a quite productive way of forming new verbs by prefixing base verbs with certain prepositions (a topic worthy of some posts another time).

But for our discussion of augments and the relationship between the present and imperfect forms, we will firstly look at the cases where there is no preverb.

There are 108 lemmas in the SBLGNT without preverbs that start with a consonant in the present and so in the imperfect are just prefixed with ἐ- (e.g. βλεπ- ~ ἐβλεπ-; διδ- ~ ἐδιδ-).

The situation is a little different when the present starts with a vowel. In such cases, the vowel essentially lengthens.

In three cases (ἐάω, ἕλκω, ἔχω) ε becomes ει (e.g. ἐχ- ~ εἰχ-)

In six cases (ἐγγίζω, ἐλπίζω, ἐργάζομαι, ἔρχομαι, ἐρωτάω, ἐσθίω) the ε becomes η (e.g. ἐρχ- ~ ἠρχ-).

We'll explore the difference in a later post.

Initial ευ stays as ευ in two cases (εὐδοκέω, εὐπορέομαι) but becomes ηυ in one (εὔχομαι ~ ηὐχόμην). Sometimes this happens within a single lexeme too (see the end of this post).

In the five cases of an initial ο- (ὁμιλέω, ὁμολογέω, ὀνειδίζω, ὀρθρίζω, ὀφείλω), it becomes ω. Note that it does not become ου.

οι in οἰκοδομέω becomes ῳ (οἰκοδομ- ~ ᾠκοδομ-).

The one case of an initial η (ἥκω) stays as η.

The vowels α, ι, and υ which can be short or long just become their long variety but of course long α generally becomes η in Attic and Koine without a preceding ι, ε, or ρ.

There are 19 cases of α becoming η (e.g. 2PL ἀγαπᾶτε ~ ἠγαπᾶτε).

There are two cases of ι becoming a long ι (ἰάομαι, ἰσχύω) and one of υ becoming a long υ (ὑμνέω).

In one case (αἰτέω) αι becomes ῃ and in two cases (αὐλίζομαι, αὐξάνω) αυ becomes ηυ.

We now turn to those verbs with a preverb (or which augment as if they do).

Because a lot of prepositions end in vowels or other sounds that interact with the start of the verb root or with the augment there is often elision or assimilation.

For example ἀναβαίν- (ανα + βαιν-) in the present becomes ἀνεβαιν- (αν’ + ε + βαιν-) in the imperfect. The ἀνα- is intact in the present but elided in the imperfect. ἀνεχ- in the present becomes ἀνειχ- in the imperfect: ἀνα- elided to ἀν’- in both present and imperfect.

The ε augment often has the effect of breaking the consecutive consonants that assimilate in the present. For example ἐμβλεπ- in the present becomes ἐνεβλεπ- in the imperfect, the ν in ἐν- no longer becoming μ in the presence of the following labial β.

συν- is particularly worth observing because you get things like:

συγχαιρ- ~ συνεχαιρ-
συζητ- ~ συνεζητ-
συλλαλ- ~ συνελαλ-
συμβαιν- ~ συνεβαιν-

At some point it might be fun to whip up the exact finite-state model for all this but for now, I'll just note the counts.

There are 100 examples with preverbs plus a consonant-initial verb stem.

Preverbs plus a vowel-initial verb stem follow the same sound rules as without a preverb and the expected elision can be found.

There are 16 cases of α>η with a preverb (e.g. περιαγ- ~ περιηγ-; ἀπαγγελλ- ~ ἀπηγγελλ-). There are two cases of αι>ῃ with a preverb, 8 cases of ε>ει, 11 cases of ε>η, one case of ευ>ευ, 4 cases of η>η, 6 cases of ι>ι, and 3 cases of ο>ω.

Here's a summary of all these counts:

	no preverb	with preverb
C > εC	108	100
ε > ει	3	8
ε > η	6	11
ευ > ευ	2	1
ευ > ηυ	1	-
ο > ω	5	3
οι > ῳ	1	-
η > η	1	4
α > η	19	16
ι > ι	2	6
υ > υ	1	-
αι > ῃ	1	2
αυ > ηυ	2	-

There is the interesting case of εὐαγγελίζω which is treated in Acts as if εὐ were a preverb and the imperfect form εὐηγγελίζοντο is found (wth α>η). This is not counted in the 16 above.

We also in Acts find προορώμην which does not have an augment but which clearly has the imperfect middle ending.

Not included in the counts above are εἰμί and compounds which are probably worth their own post at some point (although note that for the most part the imperfect stem is just η). Also not included are εἶμι and compounds where the imperfect stem is ῃ.

We have a handful of cases where the lemma having multiple inflectional classes prevents a trivial mapping between the present and imperfect stems in all instances (ἀφίημι, δέω, συγχέω, ἵστημι, and πλέω, each of which we've discussed before.) Once the right stem is chosen to map to, the rules above apply cleanly.

That leaves us with five imperfect stems whose relationship to the present stem has not yet been covered. They are:

δύναμαι having imperfects in both ἐδυν- and ἠδυν-
μέλλω having imperfects in both ἐμελλ- and ἠμελλ-
θέλω having an imperfect in ἠθελ- (because the present was originally ἐθελ-)
εὐκαιρέω having an imperfect in εὐκαιρ- and ηὐκαιρ-
εὑρίσκω having an imperfect in εὑρισκ- and ηὑρισκ=

Other than these and the ε>ει versus ε>η distinction, augmentation of the present stem to form the imperfect is entirely consistent and predictable.

We'll dive into more detail on the augment later on but we've now reached a good time to leave behind the present/imperfect and start to look at the aorist in the new year!

A Tour of Greek Morphology: Part 35

2019-12-30T16:00:00+08:00

Part thirty-five of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

To finish up our coverage of the imperfect indicative endings, we'll now use the disambiguation we did in the previous post to produce SBLGNT counts for the imperfect middles.

class	# lemmas	# tokens	# hapakes	lemmas (* = hapax)
IM-1	54	124	34
IM-2	13	27	9
IM-3	0	0	0
IM-4	6	7	5	ἐμβριμάομαι ἰάομαι προοράω ἐπακροάομαι πειράω μασάομαι*
IM-5	1	1	1	χράομαι*
IM-6	0	0	0
IM-7	1	2	0	ἐκτίθημι
IM-8	0	0	0
IM-9	4	28	2	ἐξίστημι δύναμαι ἀφίσταμαι ἀνθίστημι
IM-10	1	20	0	εἰμί
IM-11	5	13	1	συνανάκειμαι ἀνάκειμαι* κεῖμαι κατάκειμαι ἐπίκειμαι
IM-12	1	11	0	κάθημαι

And the counts for each paradigm cell for each class:

	IM-1	IM-2	IM-4	IM-5	IM-7	IM-9	IM-10	IM-11	IM-12
1SG	8	1	1	0	0	0	15	0	0
2SG	1	0	0	0	0	0	0	0	0
3SG	66	8	2	0	2	18	0	10	11
1PL	1	0	0	0	0	0	5	0	0
2PL	2	1	0	0	0	1	0	0	0
3PL	46	17	4	1	0	9	0	3	0
TOTAL	124	27	7	1	2	28	20	13	11

And just like we did for the actives, let's summarise which forms and distinguisher patterns are most common:

| | IM-1 | IM-2 | IM-4 | IM-5 | IM-7 | IM-9 | IM-10 | IM-11 | IM-12 | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 1SG | ἐβουλόμην 3/8 + other -όμην | ἐφοβούμην 1/1 | προορώμην 1/1 | | | | ἤμην 15/15 | | | | 2SG | ἤρχου 1/1 | | | | | | | | | | 3SG | -ετο | -εῖτο | ἰᾶτο 2/2 | | ἐξετίθετο 2/2 | ἐδύνατο 11/18 + other -ατο | | ἔκειτο 4/10 κατέκειτο 4/10 + other -ειτο | ἐκάθητο 11/11 | | 1PL | ἐπορευόμεθα 1/1 | | | | | | ἤμεθα 5/5 | | | | 2PL | διελογίζεσθε 1/2 ἀνείχεσθε 1/2 | ἠκαιρεῖσθε 1/1 | | | | ἐδύνασθε 1/1 | | | | | 3PL | -οντο | ἐφοβοῦντο 10/17 + other -οῦντο | ἐνεβριμῶντο 1/4 ἐπηκροῶντο 1/4 ἐπειρῶντο 1/4 ἐμασῶντο 1/4 | ἐχρῶντο 1/1 | | ἐξίσταντο 6/9 ἠδύναντο 3/9 | | συνανέκειντο 2/3 ἐπέκειντο 1/3 | |

That brings our discussion of the imperfect middles up to where we got to three posts ago with the imperfect actives and where we got with the present indicatives many posts ago.

There's one more post I'll do to close out the year (before moving on to the aorist system in 2020). I want to look at the relationship between the "X" in the present and the "X" in the imperfect within each lexeme for which we have forms in each. In other words, we'll take a closer (although not yet comprehensive) look at the augment.

A Tour of Greek Morphology: Part 34

2019-12-29T16:00:00+08:00

Part thirty-four of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

It's now time to sort out any inflectional class (IC) ambiguities in our imperfect middle endings. As usual, I've written code evaluating the forms in the SBLGNT and assigning each one a single IC. The rules used are as follows:

3SG:-ετο or 2PL:-εσθε	is	IM-1 if lemma ends in -ω or -ομαι IM-7 if lemma ends in -ημι
1SG:-όμην or 1PL:-όμεθα or 3PL:-οντο	is	IM-1 if lemma ends in -ω or -ομαι IM-8 if lemma ends in -ωμι
1SG:-ούμην or 2SG:-οῦ or 3PL:-οῦντο	is	IM-2 if lemma ends in -έω or -έομαι IM-3 if lemma ends in -όω or -όομαι
1SG:-ώμην or 2SG:-ῶ or 1PL:-ώμεθα or 3PL:-ῶντο	is	IM-5 if lemma is χράομαι IM-4 if lemma ends in -άω or -άομαι

You can download the results of the disambiguation here. We'll use this to do our counts in the next post.

Let's first ask our usual questions, though...

Are the disambiguated inflectional classes consistent for each lexeme?

Yes. In this case, they are so need no further comment.

Is the value of X in our paradigm patterns consistent across a lexeme?

There's only one exception, to do with the augment:

δύναμαι: ἐδυν or ἠδυν

A Tour of Greek Morphology: Part 33

2019-12-28T16:00:00+08:00

Part thirty-three of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In part 29, we summarised our imperfect middle indicative paradigms as follows:

	IM-1	IM-2	IM-3	IM-4	IM-5
1SG	Xόμην	Xούμην	Xούμην	Xώμην	Xώμην
2SG	Xου	Xοῦ	Xοῦ	Xῶ	Xῶ
3SG	Xετο	Xεῖτο	Xοῦτο	Xᾶτο	Xῆτο
1PL	Xόμεθα	Xούμεθα	Xούμεθα	Xώμεθα	Xώμεθα
2PL	Xεσθε	Xεῖσθε	Xοῦσθε	Xᾶσθε	Xῆσθε
3PL	Xοντο	Xοῦντο	Xοῦντο	Xῶντο	Xῶντο

	IM-6	IM-7	IM-8	IM-9
1SG	Xύμην	Xέμην	Xόμην	Xάμην
2SG	Xυσο	Xεσο	Xοσο	Xασο/Xω
3SG	Xυτο	Xετο	Xοτο	Xατο
1PL	Xύμεθα	Xέμεθα	Xόμεθα	Xάμεθα
2PL	Xυσθε	Xεσθε	Xοσθε	Xασθε
3PL	Xυντο	Xεντο	Xοντο	Xαντο

This does not quite cover the imperfect middle indicative forms we find in the SBLGNT.

Recall in the previous post, we said the copula does not appear at all in the imperfect active 1SG in the SBLGNT. It does, however, appear in the middle form ἤμην 15 times.

Furthermore, even though the 1PL active form ἦμεν does appear 8 times, we also find the middle form ἤμεθα 5 times. The SBLGNT text even has both in Galatians 4.3.

We'll create an IM-10 class for these.

We also have imperfect forms of κεῖμαι, so much like we created PM-11, we'll create an IM-11 for 3SG forms like ἔκειτο, ἀνέκειτο, ἐπέκειτο, and κατέκειτο and for 3PL forms like ἐπέκειντο and συνανέκειντο.

Finally, we have 11 occurences of ἐκάθητο, the imperfect middle 3SG form of κάθημαι, treated as if no longer being a preverb κατά +‎ ἧμαι. We'll create an IM-12 for this form.

Hence, we have:

	IM-10	IM-11	IM-12
1SG	ἤμην	-	-
2SG	-	-	-
3SG	-	Xειτο	Xητο
1PL	ἤμεθα	-	-
2PL	-	-	-
3PL	-	Xειντο	-

The remaining cells would be straightforward to fill out, but as we're testing everything against a specific corpus and grammars, we'll have to extend the tests before we include other forms.

In the next two posts, we'll finish up the middles with some disambiguation rules and corpus counts.

A Tour of Greek Morphology: Part 32

2019-12-27T16:00:00+08:00

Part thirty-two of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We'll now use the disambiguation we did in the previous post to produce SBLGNT counts for the imperfect actives like we have for the presents before. I've included all the lemmas if the list is short enough (and marked the hapakes with an asterisk).

class	# lemmas	# tokens	# hapakes	lemmas (* = hapax)
IA-1	150	540	87
IA-2	68	239	35
IA-3	7	8	6	δολιόω πληρόω ἀξιόω δηλόω (and thematic forms of δίδωμι ἀποδίδωμι παραδίδωμι )
IA-4	9	60	1
IA-5	1	2	-	ζάω
IA-6a	-	-	-
IA-7	3	4	2	τίθημι προστίθημι ἐπιτίθημι
IA-8	4	16	1	δίδωμι ἐπιδίδωμι* παραδίδωμι (and -σαν form of ἔχω)
IA-9	1	43	-	φημί
IA-10	1	435	-	εἰμί
IA-10-COMP	2	3	1	σύνειμι* πάρειμι
IA-11-COMP	3	4	2	ἄπειμι ἔξειμι εἴσειμι

And the counts for each paradigm cell for each class:

	IA-1	IA-2	IA-3	IA-4	IA-5	IA-6a	IA-7	IA-8	IA-9	IA-10	IA-10-COMP	IA-11-COMP
1SG	18	7	-	-	1	-	-	-	-	-	-	-
2SG	3	1	-	-	-	-	-	-	-	8	-	-
3SG	255	127	3	29	-	-	2	12	43	314	-	2
1PL	12	4	-	-	-	-	-	-	-	8	-	-
2PL	8	3	-	2	1	-	-	-	-	10	-	-
3PL	244	97	4+1	29	-	-	2	4	-	95	3	2
	540	239	8	60	2	-	4	16	43	435	3	4

One of the things that's obvious about these numbers is the importance of the 3SG and 3PL forms. In every class other than IA-5 (ζάω) those two person-numbers dominate (and there are only two instances of ζάω anyway). Of the 11 inflection classes with forms in the SBLGNT, 7 of them ONLY have forms in either 3SG or 3PL or both. Notice that IA-9 appears only in the 3SG (i.e. ἔφη). Notice also that, despite some showing in the 2SG, 1PL, and 2PL the copula does not appear at all in the imperfect active 1SG in the SBLGNT.

Just as a little experiment, what if we showed the imperfect active indicative paradigms with only what is found in the SBLGNT text and showed complete forms (not just the distinguisher pattern) in any case where a form made up 25% or more of the instances of that cell? The result would be the following:

	IA-1	IA-2	IA-3	IA-4	IA-5	IA-6a	IA-7	IA-8	IA-9	IA-10	IA-10-COMP	IA-11-COMP
1SG	Xον	Xουν	-	-	ἔζων	-	-	-	-	-	-	-
2SG	εἶχες, ἐζώννυες, ἤθελες	περιεπάτεις	-	-	-	-	-	-	-	ἦς, ἦσθα	-	-
3SG	ἔλεγε(ν) + other Xε(ν)	Xει	ἐπλήρου, ἠξίου, ἐδήλου	ἐπηρώτα + other Xᾱ	-	-	προσετίθει, ἐτίθει	ἐδίδου + other Xου	ἔφη	ἦν	-	εἰσῄει
1PL	Xομεν	ἐζητοῦμεν, ἐλαλοῦμεν, παρεκαλοῦμεν, εὐδοκοῦμεν	-	-	-	-	-	-	-	ἦμεν	-	-
2PL	εἴχετε, ἐπιστεύετε + other Xετε	ἐζητεῖτε, ἐποιεῖτε, ἐφρονεῖτε	-	ἠγαπᾶτε	ἐζῆτε	-	-	-	-	ἦτε	-	-
3PL	ἔλεγον + other Xον	Xουν	ἐδίδουν, ἀπεδίδουν, παρεδίδουν + ἐδολιοῦσαν	ἐπηρώτων + other Xων	-	-	ἐτίθεσαν, ἐπετίθεσαν	εἴχοσαν, ἐδίδοσαν, παρεδίδοσαν	-	ἦσαν	παρῆσαν, συνῆσαν	ἀπῄεσαν, ἐξῄεσαν

Here any form making up 25% or more of the tokens for that combination of inflectional class and person-number is show (if that's all that's shown in a cell, there are no other forms). Forms in bold also occur 10 times or more in the text,

That wraps up our exploration of the indicative imperfect active endings. In the next three posts, we'll finish up the middles.

A Tour of Greek Morphology: Part 31

2019-12-26T16:00:00+08:00

Part thirty-one of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the previous post we went through and made sure we had all our imperfect active indicative endings covered ready for counting. We still had some ambiguities, though, so we need to use rules based around the lemma to dismabiguate. We can then apply those rules to generate our data for counting.

Our disambiguation rules are:

2SG:-ης or 3SG:-η	is	IA-5 if lemma ends in -ω IA-9 if lemma is ἵστημι IA-9b if lemma is φημί
1PL:-αμεν or 2PL:-ατε or 3PL:-ασαν	is	IA-9 if lemma is ἵστημι IA-9b if lemma is φημί
2PL:-ετε	is	IA-1 if lemma ends in -ω IA-7 if lemma ends in -μι
2SG:-εις or 3SG:-ει	is	IA-2 if lemma ends in -ω IA-7 if lemma ends in -μι
1PL:-οῦμεν or 3PL:-ουν	is	IA-2 if lemma ends in -έω IA-3 if lemma ends in -όω
1SG:-ων or 1PL:-ῶμεν or 3PL:-ων	is	IA-5 if lemma is ζάω (should really just lemmatise ζήω) IA-4 if lemma is other -άω
2SG:-υς or 3SG:-υ	is	IA-3 if lemma ends in -ω IA-6a if lemma ends in -υμι (or if form not -ους / -ου) IA-8 if lemma ends in -ωμι
1SG:-υν	is	IA-2 if lemma ends in -έω IA-3 if lemma ends in -όω IA-6a if lemma ends in -υμι (or if form not -ουν) IA-8 if lemma ends in -ωμι
1PL:-ομεν	is	IA-1 if lemma ends in -ω IA-8 if lemma ends in -μι
1SG:-ειν or 3PL:-εσαν	is	IA-7 if lemma is τίθημι or ἵημι (not an issue in SBLGNT) IA-11 if lemma is εἶμι IA-11-COMP if lemma ends in -ειμι
2SG:-εις	is	IA-2 if lemma ends in -ω (not an issue in SBLGNT) IA-7 if lemma is τίθημι or ἵημι (not an issue in SBLGNT) IA-11 if lemma is εἶμι IA-11-COMP if lemma ends in -ειμι
1SG:-ην	is	IA-7 if lemma is τίθημι or ἵημι IA-9 if lemma is ἵστημι IA-9b if lemma is φημί
2PL:-ῆτε	is	IA-5 if lemma ends in -ω IA-10-COMP if lemma ends in -μι (not an issue in SBLGNT)

Encapsulating these rules into a Python script and running on our data, we now have an inflectional class for all 1,344 imperfect active indicative forms in the MorphGNT SBLGNT.

The output of my Python script looks like this:

010118  ἦν  3SG IA-10   εἰμί    IA-10   ἦν  _   ἦν
010125  ἐγίνωσκε(ν) 3SG IA-1    γινώσκω IA-1    Xε(ν)   ἐγίνωσκ ε(ν)
010209  προῆγε(ν)   3SG IA-1    προάγω  IA-1    Xε(ν)   προῆγ   ε(ν)
010209  ἦν  3SG IA-10   εἰμί    IA-10   ἦν  _   ἦν
010215  ἦν  3SG IA-10   εἰμί    IA-10   ἦν  _   ἦν
010218  ἤθελε(ν)    3SG IA-1    θέλω    IA-1    Xε(ν)   ἤθελ    ε(ν)
010304  εἶχε(ν) 3SG IA-1    ἔχω IA-1    Xε(ν)   εἶχ ε(ν)
010304  ἦν  3SG IA-10   εἰμί    IA-10   ἦν  _   ἦν
010314  διεκώλυε(ν) 3SG IA-1    διακωλύω    IA-1    Xε(ν)   διεκώλυ ε(ν)
010407  ἔφη 3SG IA-5/IA-9/IA-9b φημί    IA-9b   Xη  ἔφ  η

The columns are:

the book/chapter/verse reference
the normalized form
the morphosyntactic properties
the inflectional classes possible without disambiguation
the lemma
the disambiguated inflectional class
the distinguisher pattern
the theme (the value of X)
the distinguisher

You can download the entire thing here.

We'll use this to do our counts in the next post.

But before that, there are a couple of things we can check.

Firstly, are the disambiguated inflectional classes consistent for each lexeme?

There are five exceptions, all of which we raised in the previous post:

ἔχω is variously IA-1 or IA-8 (the alternate εἴχοσαν for the 3PL)
ἐρωτάω is variously IA-4 or IA-2 (the alternate ἠρώτουν for the 3PL)
δίδωμι is variously IA-8 or IA-3 (the alternate ἐδίδουν for the 3PL)
παραδίδωμι is variously IA-8 or IA-3 (the alternative παρεδίδουν for the 3PL)
τίθημι is variously IA-7 or IA-2 (the alternative ἐτίθουν for the 3PL)

Notice they are all in the 3PL and, with the exception of the ἐρωτάω case are alternations between the thematic and athematic ending.

Secondly, is the value of X in our paradigm patterns consistent across a lexeme?

There seem to be four exceptions, three of which are to do with the augment:

εὑρίσκω: ηὑρισκ or εὑρισκ
μέλλω: ἠμελλ or ἐμελλ
εὐκαιρέω: εὐκαιρ or ηὐκαιρ

So far we've glossed over the augment but we shall look at it in detail in a future post.

There is also * συγχέω: συνεχε or συνεχυνν

which we previously brought up. This is not just an inflectional class difference but also a stem formation difference. We'll talk a bit more about this in future posts, but for now it's probably best though of as two distinct lemmas that are conventionally conflated under the single headword συγχέω, Notice also that συνέχεον is the one example of an uncontracted IA-2 in the SBLGNT.

A Tour of Greek Morphology: Part 30

2019-12-12T02:26:11-05:00

Part thirty of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

To complete the imperfect active indicatives, there are just a few more tweaks we need to make.

Firstly, we need to add the compound versions of IA-10 and IA-11.

Secondly, because the form ἦστε (for the 2PL of ἦν) appears in one of our test grammars, we need to add that to IA-10. More on that whole paradigm later.

Thirdly, let's just rename IA-6 to IA-6a for consistency with how we named the present once we decided to include the υ.

That results in this update to IA-6a through IA-11-COMP.

	IA-6a	IA-7	IA-8	IA-9	IA-9b	IA-10	IA-10-COMP	IA-11	IA-11-COMP
1SG	Xῡν	Xην/Xειν	Xουν	Xην	Xην	ἦ/ἦν	Xῆ/Xῆν	ᾖα/ᾔειν	Xῇα/Xῄειν
2SG	Xῡς	Xεις	Xους	Xης	Xης/Xησθα	ἦς/ἦσθα	Xῆς/Xῆσθα	ᾔεις/ᾔεισθα	Xῄεις/Xῄεισθα
3SG	Xῡ	Xει	Xου	Xη	Xη	ἦν	Xῆν	ᾔει/ᾔειν	Xῄει/Xῄειν
1PL	Xῠμεν	Xεμεν	Xομεν	Xᾰμεν	Xᾰμεν	ἦμεν	Xῆμεν	ᾖμεν	Xῇμεν
2PL	Xῠτε	Xετε	Xοτε	Xᾰτε	Xᾰτε	ἦτε/ἦστε	Xῆτε/Xῆστε	ᾖτε	Xῇτε
3PL	Xῠσᾰν	Xεσᾰν	Xοσᾰν	Xᾰσᾰν	Xᾰσᾰν	ἦσᾰν	Xῆσᾰν	ᾖσᾰν/ᾔεσᾰν	Xῇσᾰν/Xῄεσᾰν

And just for completeness, here's the rest:

	IA-1	IA-2	IA-3	IA-4	IA-5
1SG	Xον	Xουν	Xουν	Xων	Xων
2SG	Xες	Xεις	Xους	Xᾱς	Xης
3SG	Xε(ν)	Xει	Xου	Xᾱ	Xη
1PL	Xομεν	Xοῦμεν	Xοῦμεν	Xῶμεν	Xῶμεν
2PL	Xετε	Xεῖτε	Xοῦτε	Xᾶτε	Xῆτε
3PL	Xον	Xουν	Xουν	Xων	Xων

Does this handle all the forms in the MorphGNT plus our test grammars?

Almost.

In Romans 3.13, we find ἐδολιοῦσαν, which does not match any of our patterns. What is happening here?

We have a contraction (suggesting IA-2 or IA-3 but, as indicated in the lemma, it's an IA-3) but with a -σᾰν ending like we would expect in an athematic verb. Because the contraction would normally only happen with a theme-vowel, we don't expect to see both -οῦ- and -σαν together.

If you look at IA-3 and IA-8 you can see they are indistinguishable in the singular. In fact IA-8 is acting like a thematic verb in the singular so there was already a merger happening between the classes. Further confusion about which endings to use in the plural makes sense, although here we have an interesting combination of distinguishers, combining the -οῦ- we might expect in an IA-3 plural with with the -σᾰν we expect in an athematic plural.

It's worth pointing out that particular phenomenon is fairly common in the Septuagint and Romans 3.13 is a quote from the Septuagint. We can't know for sure if Paul would normally have written ἐδολίουν instead but we can speculate that it's like an American writer keeping British spelling in a quotation of a British author.

In our data, that's the only form that fails to match. But there are others that exhibit a similar phenomenon that we should collect for completeness.

Twice in John 15 we find εἴχοσαν where we might expect εἶχον (and indeed do find outside of John). This is again common in the LXX.

More broadly (and not particularly characteristic of the LXX) is the replacement of athematic verbs with a thematic equivalent.

Twice in Acts we find ἐτίθουν where we would expect ἐτίθεσαν (acting like an IA-2 τιθέω).

Also in Acts we find ἀπεδίδουν (acting like an IA-3 διδόω) and both παρεδίδουν (Acts 27.1) and παρεδίδοσαν (Acts 16.4).

These athematic verbs are inflecting as if they were thematic. Note this actually causes a 1SG / 3PL ambiguity that wouldn't otherwise exist.

There are other examples of athematic verbs inflecting as if thematic:

John 21.18 has ἐζώννυες with a theme vowel (becoming thematic IA-1 instead of IA-6a).

Matthew 21.8 has ἐστρώννυον with theme vowel (becoming thematic IA-1 instead of IA-6a).

Twice in Mark we find ἤφιε(ν) (becoming thematic IA-1 instead of IA-7 where we'd expect ἠφίει).

And, in different categories:

Acts 21.27 has uncontracted 3PL συνέχεον.

Acts 9.22 has 3SG συνέχυννεν as if the lemma were συγχύννω (and I'm tempted to, in fact, lemmatise it that way).

We also have cases of confusion between -αω and -εω verbs (which long happened in Greek dialects). ἠρώτουν in Matthew 15.23 looks like an IA-2 (or IA-3) even though ἠρώτα and ἠρώτων elsewhere suggest IA-4.

Unlike the usual confusions between inflectional classes we've seen above, though, there are no distinguisher patterns shared between IA-2 and IA-4 so the underlying cause is different.

A few other points to raise to round out the full set of imperfect active forms in our data:

In MorphGNT there is an open issue about ἔστηκεν in John 8.44. MorphGNT currently analyses it as an imperfect (it would be the imperfect of στήκω) but with the lemma ἵστημι (which would have a perfect of ἕστηκεν with rough breathing). This needs to be resolved in MorphGNT so doesn't really effect our analysis of imperfect active forms here but I thought I'd mention it.

Another issue that should be considered in MorphGNT is εἰσῄει in Acts 21 should possibly be normalised with the movable nu as it's a IA-11-COMP.

In the next post, we'll go through resolving any remaining ambiguities in the imperfect active forms.

Release of text-validator 0.3

2019-12-03T23:28:37-05:00

A few weeks ago, I announced the first release of text-validator, my pluggable command-line tool for validating the formatting and orthography of text files.

Since then I've done a couple of small updates.

In 0.2, I added a validator plugin to test tokens against a list of regular expressions. This is great for catching stray characters.

For example, here's the configuration I use for my text of the Enchiridion:

TOKEN_REGEXES = [
    "\\d+\\.\\d+$",
    "[«(]*[\u0370-\u03FF\u1F00-\u1FFF]+\u2019?[.,:;»)]*$",
]

In 0.3, I made a small but significant change: the tool now returns a non-zero status code if validation fails. This doesn't make much difference if you're just running the tool manually on the command-line but if you're running it as part of a continuous integration (CI) process, this is vital.

With the 0.3 change, I was able to set up a GitHub Action on both the apostolic-fathers and enchridion repos to automatically run text-validator any time there is a new push or pull request.

You can read more about text-validator and how to use it at https://github.com/jtauber/text-validator and the linked-to wiki pages.

A Tour of Greek Morphology: Part 29

2019-11-29T03:40:34-05:00

Part twenty-nine of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In this post, we review the imperfect middle distinguishers in much the same way as we did the imperfect actives in Part 28 and the present middles in Part 14.

	IM-1	IM-2	IM-3	IM-4	IM-5
1SG	Xόμην	Xούμην	Xούμην	Xώμην	Xώμην
2SG	Xου	Xοῦ	Xοῦ	Xῶ	Xῶ
3SG	Xετο	Xεῖτο	Xοῦτο	Xᾶτο	Xῆτο
1PL	Xόμεθα	Xούμεθα	Xούμεθα	Xώμεθα	Xώμεθα
2PL	Xεσθε	Xεῖσθε	Xοῦσθε	Xᾶσθε	Xῆσθε
3PL	Xοντο	Xοῦντο	Xοῦντο	Xῶντο	Xῶντο

	IM-6	IM-7	IM-8	IM-9
1SG	Xύμην	Xέμην	Xόμην	Xάμην
2SG	Xυσο	Xεσο	Xοσο	Xασο/Xω
3SG	Xυτο	Xετο	Xοτο	Xατο
1PL	Xύμεθα	Xέμεθα	Xόμεθα	Xάμεθα
2PL	Xυσθε	Xεσθε	Xοσθε	Xασθε
3PL	Xυντο	Xεντο	Xοντο	Xαντο

and if we capture the common elements in each row:

	IM-1	IM-2	IM-3	IM-4	IM-5	IM-6	IM-7	IM-8	IM-9
1SG	-μην	-μην	-μην	-μην	-μην	-μην	-μην	-μην	-μην
2SG	-{ο}	-{ο}	-{ο}	-{ο}	-{ο}	-σο	-σο	-σο	-σο/-{ο}
3SG	-το	-το	-το	-το	-το	-το	-το	-το	-το
1PL	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα
2PL	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε
3PL	-ντο	-ντο	-ντο	-ντο	-ντο	-ντο	-ντο	-ντο	-ντο

Just as with the present middles, other than the contraction happening in 2SG (in this case obscuring the historical σο), there is no difference between the thematic and athematic endings.

As with the other paradigms we've seen, some cells across inflectional classes have identical distinguishers and so those cells alone can't identify the inflectional class (and hence all the other forms in that class). In particular:

1SG, 1PL, and 3PL can't distinguish within the set {IM-1, IM-8} or within the set {IM-2, IM-3} or within the set {IM-4, IM-5}
3SG and 2PL can't distinguish within the set {IM-1, IM-7}
The 2SG can't distinguish within the set {IM-2, IM-3} or within the set {IM-4, IM-5}.

Or to flip it around:

classes	characteristics
IM-{1, 7}	ε in 3SG, 2PL
IM-{1, 8}	ό in 1SG, 1PL; ο in 3PL
IM-{2, 3}	ού in 1SG, 1PL; οῦ in 2SG, 3PL
IM-{4, 5}	ώ in 1SG, 1PL; ῶ in 2SG, 3PL

Notice, that 1SG, 1PL, and 3PL are the ones with a theme vowel in -ο- and 3SG and 2PL are the ones with a theme vowel in -ε-. There is, of course, nothing magical about this. Cells with an omicron theme vowel will fall together with athematic stems ending in omicron and cells with an epsilon theme vowel will fall together with athematic stems ending in epsilon.

But notice also that cells that fall together because of contraction with an omicron theme vowel will be distinct in contractions with an epsilon theme vowel and vice-versa.

That means that you just need a cell for a person-number that takes an omicron theme vowel and a cell for a person-number that takes an epsilon theme vowel the two of them are enough to give you the inflectional class. In this sense, the ablaut of the theme vowel actually works to counteract the ambiguity caused by contraction.

This is the sort of systemic view of morphology that I think is very important, rather than just thinking of things in terms of parts of words being combined together.

In an upcoming post we'll check whether this covers all the imperfect middles in MorphGNT.

Mounce on Ablaut (Or Not)

2019-11-21T13:23:33-05:00

Mounce’s Basics of Biblical Greek Grammar is a very popular modern textbook, with over 400,000 copies sold and now in its fourth edition. There’s a lot one could quibble with around the usual suspects of deponency, aspect, or the general grammar-translation approach but it’s particularly odd when basic (and, as far as I know, uncontroversial) terminology is misused or misunderstood. I’m talking in particular about the way “ablaut” is discussed.

Here’s one of his Eight Noun Rules on page 422 (BBG 4th Edition):

There are several problems here:

“ablaut” is not simply “vowels changing their length”
“contraction” is definitely not a form of ablaut
“compensatory lengthening” is definitely not a form of ablaut either (certainly not in the example given, but see UPDATE)

So What is Ablaut?

More generally, vowel gradation is a grammatical alternation expressed via a vowel change. In English, sing ~ sang ~ sung expresses a contrast in tense-aspect; sing ~ song expresses a contrast in part of speech; foot ~ feet expresses a contrast in number.

Ablaut is a specific type of vowel gradation in Proto-Indo-European and many of its daughter languages. In PIE, it involved alternations between ∅ ~ e ~ o ~ ē ~ ō that were related, in part, to accentuation (∅ stands for the absence of the vowel, also called the 'zero grade').

Alternations in English like sing ~ sang ~ sung and sing ~ song can be traced to PIE ablaut. However, alternations like foot ~ feet (or man ~ men) are a different type of vowel gradation caused historically in Germanic languages by an -i that later dropped. This is called umlaut.

Because Greek had ablaut but not umlaut, the term “vowel gradation” is often used synonymously with “ablaut” when talking entirely about Greek. But technically ablaut is just one type of vowel gradation.

Ablaut is behind Greek alternations like λέγω ~ λόγος, πατήρ ~ πατέρα ~ πατρός, the theme vowel changes in -ο-μεν ~ -ε-τε, and the stem change in λείπ-ω ~ ἔ-λιπ-ον ~ λέ-λοιπ-α.

Contraction and compensatory lengthening are unrelated processes in Greek. They do involve vowel change, but not as part of a grammatical alternation.

Does Mounce Say Anything More About Ablaut?

On page 58 he again gives λογο + ι ➝ λογῷ as his example, describing “ablaut” as the “technical term” for the vowel length change. He further elaborates, saying that vowels can shorten (ω ➝ ο) or lengthen (ο ➝ ω) or disappear entirely. This explanation is somewhat okay, although λογῷ is bad example. Then in a footnote he talks about compensatory lengthening and says “this is a form of ablaut”. This is incorrect and misleading, because compensatory lengthening is not a form of ablaut or even vowel gradation. Ablaut (and vowel gradation) requires an alternation where the vowel difference signals something different grammatically.

On page 132 he describes the vocative singular as the bare stem “sometimes with the stem vowel being changed (ablaut)“. So he gets it right there even though the obvious example of second declension masculine vocatives in -ε (alternating with the ο in the nominative) contradicts his definition of ablaut as “vowels changing their length”.

On page 216, he says that “liquid roots (λ, μ, ν, ρ) are generally used without modification (except for ablaut)“. This is also a valid use of the term ablaut, but note that ablaut between verb stems is not just limited to liquid roots (I gave the example λείπ-ω ~ ἔ-λιπ-ον ~ λέ-λοιπ-α above).

Does This Actually Matter?

In terms of language acquisition, hundreds of millions of English speakers get by fine without knowing that sing ~ sang ~ sung is ablaut. Similarly it’s perfectly possible to learn Greek without even being consciously aware of all the alternations much less putting a label on them.

In this sense, students using BBG are hardly hurt by Mounce’s incorrect (or at the very least misleading) use of the term “ablaut”.

That said, Mounce's textbook is a grammar and is used as students' first introduction to Greek grammar in particular. Students should not be introduced to technical terminology in ways that are incorrect, and will require unlearning later on. Ablaut is neither a recent term (Jacob Grimm coined it 200 years ago and the concept was known to the Sanskrit grammarians two millennia ago) nor is it a contested term.

Thanks to Seumas Macdonald for his feedback on a draft of this post.

UPDATE: "I thought that πατήρ ~ πατέρα was compensatory lengthening from loss of sigma in nominative singular". Some Greek grammars explain it this way and it's potentially half-right. Many Indo-Europeanists (but not all) think that at an earlier stage, PIE *ph₂-tḗr (from where πατήρ came) was probably *ph₂-tér-s, thus keeping the nominative ending consistently -s at that earlier stage. This change *-VRs > *-VːR (where V is a vowel, Vː is a long vowel, and R is a resonant like r or n) is known as Szemerényi's law. While this is broadly a type of compensatory lengthening it has no relationship to various compensatory lengthening processes in Greek itself and the three way alternation *ph₂-tḗr ~ *ph₂-tér-m̥ ~ *ph₂-tr-és already existed in PIE before Greek.

Thus grammars that describe a three-way vowel grade alternation are correctly describing the situation in Greek. Those grammars that mention the possible earlier change in Indo-European should probably make it clear this not what is normally thought of as compensatory lengthening internal to Greek. And certainly none of this changes anything I've said about Mounce's account.

Dictionary Markup versus Lexical Modelling

2019-11-15T07:49:38-05:00

This year I've been thinking about (and working on) the representation of lexical information quite a bit.

This is nothing new, but recently, thoughts and activity have been motivated on multiple fronts including:

work earlier this year starting to extract Homeric Greek information from Cunliffe
a new project to digitise Tolkien's A Middle-English Vocabulary
work collaborating on the GreekWordNet
contributions to the Ontology-Lexica Community Group on modelling morphology (see a recent joint paper Challenges for the Representation of Morphology in Ontology Lexicons from eLex 2019)
gathering lexical information for the Greek Texts project.

Plus my long-term vision of a comprehensive, machine actionable description of Greek morphology.

One important distinction that comes up, though, and which I've ranted about a number of times on Twitter.

I guess Abbott-Smith for GNT too. I've long been interested in the relationship (and mismatches) between printed dictionaries and the modelling of the information therein (especially beyond just the glosses/definitions; e.g. morphology, etymology)
— James Tauber (@jtauber) August 16, 2019

In short, marking up a print dictionary is not the same as real modelling of lexical information.

Obviously the intention of a print dictionary is to convey that information but it is done so in a form ultimately only appropriate for human interpretation from the page (in print or an online facsimile or some sort). It's not really machine actionable.

Now wait, you might think. All we need to do is markup the dictionary electronically using some format like TEI.

Just to randomly pick an early entry in the conversion of Abbott-Smith to TEI:

<entry lemma="ἀγαθοποιός" strong="G17">
  <note type="occurrencesNT">1</note>
  <form>**† <orth>ἀγαθοποιός</orth>, <foreign xml:lang="grc">-όν</foreign> = cl. <foreign xml:lang="grc">ἀγαθουργός</foreign>,</form>
  <etym>
    <seg type="septuagint">[in LXX, of a woman who deals pleasantly in order to corrupt, <ref osisRef="Sir.42.14">Si 42:14</ref>*;]</seg>
  </etym>
  <sense><gloss>doing well</gloss>, <gloss>acting rightly</gloss> (Plut.): <ref osisRef="1Pet.2.14">I Pe 2:14</ref> (Cremer, 8; MM, <emph>VGT</emph>, s.v.).†</sense>
</entry>

There's some extractable information here, like the Strong's number, a clear lemma, some biblical references, the number of occurrences in the NT and some glosses. But some of the information goes unanalysed and merely presented as in the print dictionary. Things like the initial **† in the entry are left unexplicated. The -όν termination indicating the inflectional class (and indirectly the part of speech) is merely marked up as a Greek word. The LXX reference is treated as etymology and yet the classical equivalent has to be decoded from the = cl..

This is not to pick specifically on the Abbott-Smith conversion. Marked-up versions of Cunliffe, LSJ, even a version I had of the Barclay Newman glossary in BetaCode in the mid-90s, are all primarily attempting to convey the typography of the printed work, sprinkling a little bit of descriptive rather than purely presentational markup over the original content (so you could at least use a stylesheet to decide to make headwords bold or something, rather than doing it inline).

It would still take a lot of work to extract morphological or etymological information from this type of markup.

A very different kind of approach is to focus on actually modelling the lexical information and only then worrying about mapping it to some visual presentation.

TEI somewhat recognises the distinction and actually offers a variety of approaches to dictionary markup, one that is focused mostly on the markup for display (like in a printed book) and one that is more focused on the underlying data (although there are other formats for that too).

Of course if you're marking up an existing print dictionary you're pretty much doing the former. A more abstract modelling of the lexical information in Abbott-Smith (or LSJ, or Cunliffe, or Tolkien) is no longer a marked up version of that dictionary.

It's a challenging problem for sure (and I'm certainly experiencing it firsthand on the Tolkien project—automatic extraction of even just the etymological information from the Middle English vocabulary entries has required tens of regular expressions).

What we really need moving forward is more focus on the underlying lexical modelling and not assuming that marking up Abbott-Smith or the LSJ better is the solution. And it's not like there isn't a ton of work going on in good electronic representation of lexical information for modern languages.

I ranted a little bit about the more general issue in my BibleTech 2015 talk on Biblical Greek Databases where I talked about how many reference books should be thought of (and indeed produced) more as "UI on top of databases". In other words, you focus on the data and then at some point have a largely automated process for generating printed works from that data. One of my examples in that talk was "Readers Editions" of texts with glossaries on each page. But I think it applies even more to dictionaries.

So I think it's important that we recognise the distinction between:

the presentation of lexical information in a print dictionary (or online equivalent)
the descriptive markup of those dictionaries
the underlying linguistic information

and recognise that collaboration on and exchange of the last of those three is ultimately the most valuable.

UPDATE: Here's an entry from the upcoming Cambridge lexicon. It's about as close as you could get to machine-actionable descriptive markup that is still pretty much following the structure of a print lexicon:

<AE>
  <HG>
    <HL>ξανθό<hyph/>θριξ</HL>
    <Infl>τριχος</Infl>
    <PS>masc.fem.adj</PS>
    <Ety>
      <Ref>θρίξ</Ref>
    </Ety>
  </HG>
  <aS1>
    <Tr>fair<or/>golden-haired</Tr>
    <Au> Sol. Theoc.</Au>
  </aS1>
  <aS1>
    <Indic>of a horse</Indic>
    <Def>with light-coloured hair or mane</Def>
    <Tr>golden</Tr>
    <Au>B.</Au>
  </aS1>
</AE>

Notice that this XML doesn't include the square brackets that will go around the etymology in the print version. They are treated entirely as presentation. The disjunction between the glosses 'fair' and 'golden-haired' is markup, not content. Even whitespace and punctuation have to largely come from the stylesheet. Definitions are distinguished from translation glosses and also applications like 'of a horse'.

This still doesn't go as far as I'm talking about in terms of truly modelling the underlying linguistic information but it's a lot easier to do things with this computationally than just a markup of an existing print dictionary would be.

Release of text-validator 0.1

2019-11-11T22:37:40-05:00

I've released a first version of a pluggable command-line tool for validating the formatting and orthography of text files.

Various text projects like the apostolic-fathers have sometimes included little custom scripts I've written to validate the files. Is the Unicode normalised? Are there stray characters or bad line endings? Are references in a valid format?

I also had started included some Greek-specific tests in the greek-normalisation library.

But starting the greek-texts project, I decided it would be nice to have a generic framework for writing text file validators that could be used for all sorts of projects and files.

The result is text-validator. Think of it like a code linter but for your text files.

Each validator is its own Python module and, while a few basic tests are included in the library, the idea is that third parties can write their own validators and make them installable Python packages for others to use.

You install text-validator with

pip install text-validator

as well as installing any third-party plugins you want to use.

You then config your validator plugins with a TOML file like:

["text_validator.plugins.whitespace"]
CHECK_CRLF = true
CHECK_TABS = true
CHECK_TRAILING_WHITESPACE = true
CHECK_NO_EOF_NEWLINE = true

and run the command validate-text to run your suite of configured plugins on the files in your text project.

The GitHub repo is https://github.com/jtauber/text-validator and there you can also read more about How to Write a Plugin and look at the existing plugins in the Plugin Directory.

Create issues in the GitHub repository if you have particular validators you like to see or would like to contribute.

I haven't tried it yet but I'd like to try hooking text-validator up as a test that gets run on commits and pull requests on GitHub as part of a CI process.

Off to the UCLA Indo-European Conference Again

2019-11-07T03:51:20-05:00

Today I'm heading off to Los Angeles to attend the Thirty-First Annual UCLA Indo-European Conference.

I went two years ago and you may remember my initial nervousness as a first-timer. But everyone was so nice and I got a lot out of it so I'm headed back (plus this time I'll know more people).

Subcorpus Vocabulary Statistics

2019-11-05T18:03:44-05:00

Long-time readers of this blog know that, along with morphology, a core research area of mine is vocabulary. Prompted by Seumas Macdonald and now as part of the Greek Texts Project, I started putting together some vocabulary coverage statistics for various subcorpora of Greek prose.

I've been publishing vocabulary coverage statistics for the Greek New Testament at least since 1996 (see GNT Verse Coverage Statistics and the more recent (and aptly named) Updated Vocabulary Coverage Statistics).

Back in The Core Vocabulary of New Testament Greek, I looked at Wilfred Major's 50% and 80% lists for Classical Greek and constructed the equivalents for the Greek New Testament.

For a while, on and off, I've been working with reconciling Major's list with the DCC Greek Core Vocabulary, my own GNT work, Helma Dik's amazing work on Logeion, and other word lists based on frequency in some subcorpus of Ancient Greek. Back in early 2018, I also started https://vocab.perseus.org, primarily to serve up passage-specific vocabulary lists for https://scaife.perseus.org but also to enable exploration of vocabulary frequency in the Perseus Digital Library / Open Greek and Latin corpus.

As part of that last work, I put together an initial "core reading list" subcorpus based on works from reading lists from Harvard, Yale, and Tufts. My eventual goal was to allow the creation of custom reading lists and generate vocabulary for those. The data behind this was all based on an experimental lemmatisation done by Giuseppe Celano along with the "short defs" from Perseus via Logeion.

I'd been slowing getting back to my new vocabulary-tools code library for generating these kinds of stats for any lemmatised text when Seumas Macdonald asked about vocabulary in Plato, Lysias, and Xenophon—typical post-beginner prose.

I took the opportunity to generalise some more of my code (although I haven't yet added it back to vocabulary-tools).

Plato + Lysias + Xenophon, as lemmatised by Celano, is 745,213 tokens with 13,274 lemmas, 3,457 of which are hapakes within the subcorpus.

Besides the actual list, what was of interest to both me and Seumas was:

how many lemmas are needed for coverage points such as 80% or 98%
what coverage particular numbers of lemmas gets you to (in frequency order)

Now I've talked at length here and in conference talks about the limitations of:

just going by overall token coverage not coverage of larger units like verses or sentences or paragraphs
just going by lemmas and not considering morphology, syntactic constructions, etc

but this is still useful and interesting data.

Plato + Lysias + Xenophon

Perseus Plato + Lysias + Xenophon subcorpus:

The 50% point is reached at   48 lemmas (2454 occurrences at that point)
The 80% point is reached at  439 lemmas ( 181 occurrences at that point)
The 90% point is reached at 1242 lemmas (  50 occurrences at that point)
The 95% point is reached at 2519 lemmas (  16 occurrences at that point)
The 98% point is reached at 5003 lemmas (   5 occurrences at that point)
---
The 81.40% point is reached at  500 lemmas (159 occurrences at that point)
The 88.15% point is reached at 1000 lemmas ( 66 occurrences at that point)
The 93.59% point is reached at 2000 lemmas ( 25 occurrences at that point)
The 97.21% point is reached at 4000 lemmas (  7 occurrences at that point)
The 99.19% point is reached at 8000 lemmas (  2 occurrences at that point)

Just to quickly unpack that: the first line says that you can account for 50% of the tokens in the subcorpus just with the top 48 lemmas (by frequency). Furthermore, those 48 lemmas all occur at least 2,452 times each in the subcorpus.

Similarly, the second-to-last line says that the top 4,000 lemmas by frequency all occur at least 7 times in the subcorpus and account for 97.21% of tokens.

Plato

Just looking at Plato (with some extra lemma count breakpoints):

The 50% point is reached at   43 lemmas (1365 occurrences at that point)
The 80% point is reached at  321 lemmas ( 120 occurrences at that point)
The 90% point is reached at  893 lemmas (  32 occurrences at that point)
The 95% point is reached at 1840 lemmas (  10 occurrences at that point)
The 98% point is reached at 3631 lemmas (   3 occurrences at that point)
---
The 84.74% point is reached at  500 lemmas (66 occurrences at that point)
The 90.91% point is reached at 1000 lemmas (27 occurrences at that point)
The 95.46% point is reached at 2000 lemmas ( 9 occurrences at that point)
The 97.34% point is reached at 3000 lemmas ( 4 occurrences at that point)
The 98.32% point is reached at 4000 lemmas ( 3 occurrences at that point)
The 98.93% point is reached at 5000 lemmas ( 2 occurrences at that point)

Plato Selection

With just a selection of Plato (Euthyphro, Apology, Crito, Symposium, Republic):

The 50% point is reached at   43 lemmas (529 occurrences at that point)
The 80% point is reached at  335 lemmas ( 45 occurrences at that point)
The 90% point is reached at  908 lemmas ( 13 occurrences at that point)
The 95% point is reached at 1745 lemmas (  5 occurrences at that point)
The 98% point is reached at 3160 lemmas (  2 occurrences at that point)
---
The 84.31% point is reached at  500 lemmas (28 occurrences at that point)
The 90.85% point is reached at 1000 lemmas (11 occurrences at that point)
The 95.81% point is reached at 2000 lemmas ( 4 occurrences at that point)
The 97.76% point is reached at 3000 lemmas ( 2 occurrences at that point)
The 98.76% point is reached at 4000 lemmas ( 1 occurrences at that point)
The 99.51% point is reached at 5000 lemmas ( 1 occurrences at that point)

New Testament

It's interesting to compare that to MorphGNT given it’s the same size:

The 50% point is reached at   27 lemmas (662 occurrences at that point)
The 80% point is reached at  316 lemmas ( 48 occurrences at that point)
The 90% point is reached at  890 lemmas ( 13 occurrences at that point)
The 95% point is reached at 1753 lemmas (  5 occurrences at that point)
The 98% point is reached at 3103 lemmas (  2 occurrences at that point)
---
The 84.84% point is reached at  500 lemmas (27 occurrences at that point)
The 90.95% point is reached at 1000 lemmas (11 occurrences at that point)
The 95.80% point is reached at 2000 lemmas ( 4 occurrences at that point)
The 97.85% point is reached at 3000 lemmas ( 2 occurrences at that point)
The 98.94% point is reached at 4000 lemmas ( 1 occurrences at that point)
The 99.66% point is reached at 5000 lemmas ( 1 occurrences at that point)

although note these figures are based on my own more-curated lemmatisation of the New Testament, not Celano's data which may have systematic differences that make this comparison slightly problematic.

Core Reading List

Here’s the “core reading list” with more 1000-markers:

The 50% point is reached at    79 lemmas (2019 occurrences at that point)
The 80% point is reached at  1107 lemmas ( 134 occurrences at that point)
The 90% point is reached at  3020 lemmas (  39 occurrences at that point)
The 95% point is reached at  5948 lemmas (  14 occurrences at that point)
The 98% point is reached at 10920 lemmas (   5 occurrences at that point)
---
The 71.19% point is reached at   500 lemmas (310 occurrences at that point)
The 78.92% point is reached at  1000 lemmas (149 occurrences at that point)
The 86.17% point is reached at  2000 lemmas ( 69 occurrences at that point)
The 89.94% point is reached at  3000 lemmas ( 40 occurrences at that point)
The 92.28% point is reached at  4000 lemmas ( 26 occurrences at that point)
The 93.88% point is reached at  5000 lemmas ( 19 occurrences at that point)
The 95.05% point is reached at  6000 lemmas ( 14 occurrences at that point)
The 95.95% point is reached at  7000 lemmas ( 11 occurrences at that point)
The 96.65% point is reached at  8000 lemmas (  9 occurrences at that point)
The 97.20% point is reached at  9000 lemmas (  7 occurrences at that point)
The 97.66% point is reached at 10000 lemmas (  6 occurrences at that point)
The 98.33% point is reached at 12000 lemmas (  4 occurrences at that point)
The 98.98% point is reached at 15000 lemmas (  2 occurrences at that point)
The 99.57% point is reached at 20000 lemmas (  1 occurrences at that point)

Full Perseus

And finally, here’s the full Perseus / OGL (as of two years ago with the Celano lemmatisation):

The 50% point is reached at   42 lemmas (64483 occurrences at that point)
The 80% point is reached at  648 lemmas ( 3270 occurrences at that point)
The 90% point is reached at 1951 lemmas (  855 occurrences at that point)
The 95% point is reached at 4052 lemmas (  298 occurrences at that point)
The 98% point is reached at 8004 lemmas (   87 occurrences at that point)
---
The 77.44% point is reached at   500 lemmas (4263 occurrences at that point)
The 84.20% point is reached at  1000 lemmas (2018 occurrences at that point)
The 90.19% point is reached at  2000 lemmas ( 825 occurrences at that point)
The 93.14% point is reached at  3000 lemmas ( 480 occurrences at that point)
The 94.93% point is reached at  4000 lemmas ( 303 occurrences at that point)
The 96.10% point is reached at  5000 lemmas ( 208 occurrences at that point)
The 96.92% point is reached at  6000 lemmas ( 151 occurrences at that point)
The 97.53% point is reached at  7000 lemmas ( 114 occurrences at that point)
The 98.00% point is reached at  8000 lemmas (  87 occurrences at that point)
The 98.36% point is reached at  9000 lemmas (  69 occurrences at that point)
The 98.65% point is reached at 10000 lemmas (  55 occurrences at that point)
The 99.06% point is reached at 12000 lemmas (  36 occurrences at that point)
The 99.44% point is reached at 15000 lemmas (  20 occurrences at that point)
The 99.75% point is reached at 20000 lemmas (   8 occurrences at that point)

It is interesting how much quicker the 50-80-90-95-98 points are hit with the full corpus over the core reading list. Normally a larger corpus would take longer but I think it’s indicative of the fact that the “core reading” has a richer vocabulary per token than a larger sample (an interesting study in itself of any subcorpus).

Next Steps

Since calculating all this, Seumas and I have been working on a different prose subcorpus for post-beginner learners that combines the Plato selection, the New Testament, other orators in addition to Lysias and other history in addition to Xenophon. I'll talk about that work in some future posts (and hopefully Seumas will too!)

I also want to talk about how the different subcorpora differ in what lemmas they have. How much of the Plato 80% is in the New Testament 80%, for example (and vice versa)? There's also the question of lexical dispersion. There's value in separating function words from content words, and grouping lemmas into word families. Lots more coming.

The code and for much this will added to vocabulary-tools when I get a chance but if people are interested in other subcorpora before then, please get in contact with me.

UPDATE (2019-11-06): Now see Seumas's post Sore Thumbs in Subcorpus vocabulary looking at particular words that differ in frequency between the New Testament and the larger Classical Greek prose subcorpus we've been working with.

New Blog Platform

2019-11-03T16:55:52-05:00

Following on from my success with it on the Digital Tolkien Project website, I decided to switch to using Jekyll for the generation of jktauber.com as a static site.

I wanted to switch to static site generation partly for ease of hosting but mostly to make it much easier to author content locally and manage revisions on GitHub. My choice of Jekyll as a platform was a combination of ease-of-use and ability to trivially host on GitHub pages.

There may be a couple of issues with my migration, so let me know if you see anything funny, especially with the formatting of posts. Note also that I haven't brought back full-text search or the Labs yet.

But the change already makes it easier for me to write posts as well as organise existing posts. I've already started adding tags on a handful of posts to make it easier for you to get to all the posts on a particular topic. It's now possible, for example, to easily get a list of my morphology tour posts, for example.

I also made a couple of style tweaks but nothing major.

I'm now back to posting a lot more regularly with lots of good vocabulary and morphology stuff coming up!

Greek Texts Project

2019-11-02

A twitter conversation led to the creation of a new project to work on annotated Greek texts for language learners.

As readers of this blog know, I'm working with Seumas Macdonald on the Apostolic Fathers and had done some earlier work on Epictetus's Enchiridion and with Nathan Smith on the Septuagint. There'd been a couple of conversations earlier in October on Twitter with people wanting to make progress on the Septuagint and I'd also been keen to get back to supporting Seumas's Linguae Graecae Per Se Illustrata project.

So Fletcher Hardison's tweet triggered the idea to maybe get everyone involved in these various projects talking to each other:

We should definitely coordinate file formats for things like so we can build common tools /cc @jeltzz @sleeptillseven @_ndsmith. I'd love to consolidate AF, LXX, Epictetus, LGPSI, MorphGNT, κτλ.
— James Tauber (@jtauber) October 11, 2019

I started a Slack workspace and a landing page: https://jtauber.github.io/greek-texts/.

The results so far have been wonderful. A great group of people are already working together on various texts and data format conventions. I'm back to working on vocabulary lists (more on that in the next few days) and some new Python libraries (more on that also in the next few days).

We'd love to have you join us! Just email me at jtauber@jtauber.com and I can invite you to the Slack workspace.

Release of greek-normalisation 0.3

2019-11-01

In the last couple of weeks I've done a couple of minor releases of the greek-normalisation Python library which brings together various code I use to clean up Greek texts and normalise the forms.

The 0.2 release (which I neglected to announce) just had a small fix to the breathing_check function to support things like ἀϊ (which failed before because it didn't take into account the diaeresis). Soon I'll blog about a new Python tool I've been building which will provide a framework for doing lots of checks like this.

The 0.3 release now installs two command-line scripts toNFC and toNFD to convert a file to either an NFC or NFD Unicode Normalization Form.

Once installed you can do things like:

toNFC source.txt > nfc_version.txt

The repository is https://github.com/jtauber/greek-normalisation and it's pip-installable as greek-normalisation.

See my previous post The Normalisation Column in MorphGNT for the original work this code came form.

Summer Conferences

2019-07-06

Here are the conferences I'm attending (and in some cases, presenting at) in June through August. I probably should have posted this at the start of my conference travel, but here it is.

First LiLa Workshop: Linguistic Resources & NLP Tools for Latin

I'm excited about the LiLa project, which is about a Linguistic Linked Open Data (LLOD) approach to Latin resources. Because I'm interested in LLOD for Ancient Greek, I was keen to attend the first workshop to get ideas, but then I got asked to speak about Scaife anyway.

Quantitative Approaches to Versification

This was a conference about computational analysis of poetry (especially meter). I had done some work with Sophia Sklaviadis on the relationship between repeating n-grams and metrical position in Homer and presented a paper on it at this conference. Not normally my area but I have some more ideas to persue that I might write about here at some point.

Vocab@LEUVEN

When I went to the American Association for Applied Linguistics annual meeting last year, I mostly attended the track on vocabulary research. Regular readers of this blog know that, along with morphology, it's my main research area. Well, the Vocab@ conferences are 100% vocabulary research. I did actually submit a paper to this conference that got rejected but I'll be presenting it as a poster at EuroCALL (see below).

Digital Humanities Conference 2019

The big DH conference of the year. Will be my first time attending and I'm sure it will be overwhelming. I'm presenting as part of a panel on Confronting the Complexity of Babel in a Global and Digital Age and I'll specifically be talking about online reading environments to scaffold understanding of texts in historical languages.

After this I'm briefly heading back to Boston for a couple of weeks. Then two Tolkien-related conferences:

Omentielva

This is the International Conference on J.R.R. Tolkien’s Invented Languages. Not speaking (Elvish or otherwise) just attending.

Tolkien 2019

Giving a talk on Tolkien and Digital Philology, basically how we might treat Tolkien's works as the objects of philological study and use the same digital methods one might for, say, an Ancient Greek text. The talk will culminate in me outlining my vision for the Digital Tolkien project.

And finally:

EUROCALL 2019

This is the major European conference for Computer-Aided Language Learning. I'm presenting a poster on what is possibly the longest running topic of this blog: the sequencing of vocabulary learning from texts. There'll be lots more blog posts here on that in the future!

Release of greek-normalisation 0.1

2019-07-06

For years I’ve had Python code for normalising Greek forms, checking for stray characters, etc. I finally got around to consolidating them in a library.

It has a few little utilities like:

>>> strip_last_accent_if_two('γυναῖκά')
'γυναῖκα'

>>> grave_to_acute('τὴν')
'τήν'

>>> breathing_check('ἀι')
False

but the core of it is the normalisation of tokens with knowledge of clitics and elision.

>>> normalise('τὴν')
('τήν', ['grave'])

>>> normalise('γυναῖκά')
('γυναῖκα', ['extra'])

>>> normalise('σου')
('σου', ['enclitic'])

>>> normalise('Τὴν')
('τήν', ['grave', 'capitalisation'])

>>> normalise('ὁ')
('ὁ', ['proclitic'])

>>> normalise('μετ’')
('μετά', ['elision'])

>>> normalise('οὐκ')
('οὐ', ['movable', 'proclitic'])

See my previous post The Normalisation Column in MorphGNT for the original work this code came form.

There are also some regular expressions that I've used to check mistakes in things like the Open Apostolic Fathers.

It's just an initial 0.1 release but parts of the code have already been in use for years.

The repository is https://github.com/jtauber/greek-normalisation and it's pip-installable as greek-normalisation.

A Tour of Greek Morphology: Part 28

2019-04-30

Part twenty-eight of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In this post, we look systematically at the imperfect active distinguishers in much the same way as we did the present active distinguishers in Part 13.

Before we summarise all the distinguisher paradigms we've seen so far, there are actually three forms in the SBLGNT not covered yet: εἰσῄει, παρῆσαν, and συνῆσαν (all in Luke/Acts). εἰσῄει is from εἰς+εἶμι (making it a compound of IA-11) and παρῆσαν is παρά+εἰμί (making it a compound of IA-10). In our text, συνῆσαν is from σύν+εἰμί but could be from σύν+εἶμι. Either way, for completeness we need to add IA-10-COMP and IA-11-COMP.

So with those, here are all the imperfect active distinguisher paradigms we've discussed:

	IA-1	IA-2	IA-3	IA-4	IA-5
1SG	Xον	Xουν	Xουν	Xων	Xων
2SG	Xες	Xεις	Xους	Xᾱς	Xης
3SG	Xε(ν)	Xει	Xου	Xᾱ	Xη
1PL	Xομεν	Xοῦμεν	Xοῦμεν	Xῶμεν	Xῶμεν
2PL	Xετε	Xεῖτε	Xοῦτε	Xᾶτε	Xῆτε
3PL	Xον	Xουν	Xουν	Xων	Xων

	IA-6	IA-7	IA-8	IA-9	IA-9b
1SG	Xῡν	Xην/Xειν	Xουν	Xην	Xην
2SG	Xῡς	Xεις	Xους	Xης	Xης/Xησθα
3SG	Xῡ	Xει	Xου	Xη	Xη
1PL	Xυμεν	Xεμεν	Xομεν	Xαμεν	Xαμεν
2PL	Xυτε	Xετε	Xοτε	Xατε	Xατε
3PL	Xυσαν	Xεσαν	Xοσαν	Xασαν	Xασαν

	IA-10	IA-11	IA-10-COMP	IA-11-COMP
1SG	ἦ/ἦν	ᾖα/ᾔειν	Xῆ/Xῆν	Xῇα/Xῄειν
2SG	ἦς/ἦσθα	ᾔεις/ᾔεισθα	Xῆς/Xῆσθα	Xῄεις/Xῄεισθα
3SG	ἦν	ᾔει(ν)	Xῆν	Xῄει(ν)
1PL	ἦμεν	ᾖμεν	Xῆμεν	Xῇμεν
2PL	ἦτε	ᾖτε	Xῆτε	Xῇτε
3PL	ἦσαν	ᾖσαν/ᾔεσαν	Xῆσαν	Xῇσαν/Xῄεσαν

It will be worth taking some future posts to talk about the -σθα ending that crops up in the 2SG as well as some of the more extraordinary forms in IA-10 and IA-11 (along with compounds).

But for now, just capturing the common element in each row (like we did in Part 13):

	IA-1	IA-2
1SG	-ν
2SG	-ς	-ς/-σθα
3SG	-	-(v)
1PL	-μεν
2PL	-τε
3PL	-ν	-σαν

As with the present active paradigms, some cells across inflectional classes have identical distinguishers and so those cells alone can't identify the inflectional class (and hence all the other forms in that class). In particular:

The 1SG can't distinguish within the set {IA-2, IA-3, IA-8} or within the set {IA-4, IA-5} or within the set {IA-7 (if η), IA-9}
The 2SG and 3SG can't distinguish within the set {IA-2, IA-7} or within the set {IA-3, IA-8} or within the set {IA-5, IA-9}
The 1PL can't distinguish within the set {IA-2, IA-3} or within the set {IA-4, IA-5} or within the set {IA-1, IA-8}
The 2PL can't distinguish within the set {IA-1, IA-7}
The 3PL can't distinguish within the set {IA-2, IA-3} or within the set {IA-4, IA-5}

The distinctions from IA-7 on up are less important because they are tiny, non-productive classes. Looking at just IA-1 through IA-6:

{IA-2, IA-3} can't be distinguished by 1SG, 1PL, or 3PL but can by 2SG, 3SG, or 2PL.
{IA-4, IA-5} also can't be distinguished by 1SG, 1PL, or 3PL but can by 2SG, 3SG, or 2PL.

So at least for the first six classes, any of 2SG, 3SG, or 2PL uniquely identifies the class (at least within the imperfect active system).

It is interesting then that the 2SG and 3SG are the very cells most likely to cause confusion within the sets {IA-2, IA-7}, {IA-3, IA-8}, and {IA-5, IA-9} and in those cases, it is the 1PL or 3PL that can come to the rescue in identifying the class (although the value of X itself can do that given the tiny size of the IA-7, IA-8 and IA-9 classes).

If we try to group our classes along the lines we did in Part 13, we get a hierarchy very similar to that in the present:

IA-{1, 2, 3, 4, 5}	3PL in -ν; 1SG and 3PL identical
	IA-{2, 3, 4, 5}		long vowels before the endings; circumflexes in the 1PL and 2PL
		IA-{2, 3}		ου in 1SG, 1PL, and 3PL
		IA-{4, 5}		ω in 1SG, 1PL, and 3PL
IA-{6, 7, 8, 9, 9b, 10, 11, 10-COMP, 11-COMP}	3PL in -σαν
	IA-{6, 7, 8, 9}		2SG only in -ς
	IA-{9b, 10, 11, 10-COMP, 11-COMP}		2SG in -ς/-σθα

along with cross-cutting categories such as:

| IA-{2, 3, 8} | ουν in 1SG | | IA-{2, 7} | ει in 2SG and 3SG | | IA-{3, 8} | ου in 1SG, 2SG, and 3SG | | IA-{1, 7} | ετε in 2PL |

and, ignoring accents:

| IA-{4, 9} | ατε in 2PL |

But given the closed nature of IA-7 and up, many of these will be easy to disambiguate. We'll go through the details in a future post.

Consolidating Vocabulary Coverage and Ordering Tools

2019-04-20

One of my goals for 2019 is to bring more structure to various disperate Greek projects and, as part of that, I’ve started consolidating multiple one-off projects I’ve done around vocabulary coverage statistics and ordering experiments.

Going back at least 15 years (when I first started blogging about Programmed Vocabulary Learning) I’ve had little Python scripts all over the place to calculate various stats, or try out various approaches to ordering.

I’m bringing all of that together in a single repository and updating the code so:

it’s all in one place
it’s usable as a library in other projects or in things like Jupyter notebooks
it can be extended to arbitrary chunking beyond verses (e.g. books, chapters, sentences, paragraphs, pericopes)
it can be extended to other texts such as the Apostolic Fathers, Homer, etc (other languages too!)

I’m partly spurred on by a desire to explore more stuff Seumas Macdonald have been talking about and be more responsive to the occasional inquiries I get from Greek teachers. Also I have a poster Vocabulary Ordering in Text-Driven Historical Language Instruction: Sequencing the Ancient Greek Vocabulary of Homer and the New Testament that got accepted for EUROCALL 2019 in August and this code library helps me not only produce the poster but also make it more reproducible.

Ultimately I hope to write a paper or two out of it as well.

I’ve started the repo at:

https://github.com/jtauber/vocabulary-tools/

where I’ve basically rewritten half of my existing code from elsewhere so far. I’ve reproduced the code for generating core vocabulary lists and also the coverage tables I’ve used in multiple talks (including my BibleTech talks in 2010 and 2015).

I’ve taken the opportunity to generalise and decouple the code (especially with regard to the different chunking systems) and also make use of newer Python stuff like Counter and dictionary comprehensions which simplifies much of my earlier code.

There are a lot of little things you can do with just a couple of lines of Python and I’ve tried to avoid turning those into their own library of tiny functions. Instead, I’m compiling a little tutorial / cookbook as I go which you can read the beginnings of here:

https://github.com/jtauber/vocabulary-tools/blob/master/examples.rst

There’s still a fair bit more to move over (even going back 11 years to some stuff from 2008) but let me know if you have any feedback, questions, or suggestions. I’m generalising more and more as I go so expect some things to change dramatically.

If you’re interested in playing around with this stuff for corpora in other languages, let me know how I can help you get up and running. The main requirement is a tokenised and lemmatised corpus (assuming you want to work with lemmas, not surface forms, as vocabulary items) and also some form of chunking information. See https://github.com/jtauber/vocabulary-tools/tree/master/gnt_data for the GNT-specific stuff that would (at least partly) need to be replicated for another corpus.

Initial Apostolic Fathers Text Complete

2019-02-01

Exactly three months ago to the day, I announced that Seumas Macdonald and I were working on a corrected, open, digital edition of the Apostolic Fathers based on Lake. That initial work is now complete.

Preparing an Open Apostolic Fathers discussed the original motivation and the rather detailed process we went through.

The corrected raw text files are available on GitHub at https://github.com/jtauber/apostolic-fathers but I also generated a static site at https://jtauber.github.io/apostolic-fathers/ to browse the texts. The corrections will be contributed back to the OGL First1KGreek project.

The next step for us will be to lemmatise the text and there has already been some interest from others in getting the English translation corrected and aligned as well.

Recall that, while we were essentially correcting the Open Greek and Latin text, we used the CCEL text and that in Logos to identify particular places to look at in the printed text. We did this by lining up the CCEL, OGL and Logos texts and seeing where any of them disagreed. Those became the places we went back to, in multiple scans of the printed Lake, to make our corrections to the base text we started with from OGL.

How often did each of those three "witnesses" disagree? Here are some stats. A = CCEL, B = OGL, C = Logos. And so AB/C is where CCEL and OGL agreed against Logos, AC/B is where CCEL and Logos agreed against OGL, A/BC is where OGL and Logos agreed against CCEL, and A/B/C is where all three disagreed.

FILE	AB/C	AC/B	A/BC	A/B/C
001	1.29%	1.15%	7.97%	0.32%
002	0.76%	1.20%	3.39%	0.37%
003	1.58%	2.20%	4.97%	0.28%
004	0.57%	1.33%	7.01%	0.28%
005	1.05%	1.79%	6.21%	0.84%
006	0.88%	1.18%	7.54%	0.69%
007	0.39%	0.88%	3.34%	0.20%
008	0.79%	0.87%	5.41%	0.44%
009	0.25%	1.53%	2.68%	0.38%
010	0.44%	4.05%	4.36%	0.25%
011	0.36%	1.86%	4.23%	0.14%
012	0.92%	1.15%	5.59%	0.43%
013	1.29%	0.90%	6.08%	0.34%
014	1.25%	0.34%	4.91%	0.08%
015	0.96%	0.65%	6.74%	0.50%
TOTAL	1.11%	1.12%	5.98%	0.34%

One can immediately see CCEL diverged the most from the others (it had considerable lacunae for a start). The numbers involving Logos diverging are probably overly high because there was a weird systemic error we only noticed after work had started that a middle dot was often erroneously added after eta. This ultimately didn't affect anything other than perhaps flagging places Seumas and I had to check that we otherwise wouldn't have needed to.

But at the end of the day, how much did we change? How much of the OGL original remained? How similar was our result to the text on CCEL? And for a bit of fun, how often was my first correction and Seumas's first correction the same as what we ended up with after consensus was achieved? Here's the breakdown by work:

FILE	CCEL	OGL	JT	SM
001	91.27%	99.02%	99.85%	99.91%
002	96.02%	98.90%	99.77%	99.90%
003	94.58%	97.63%	99.77%	99.60%
004	92.42%	98.48%	99.91%	100.00%
005	92.32%	98.32%	99.79%	99.89%
006	91.28%	98.82%	98.82%	99.80%
007	96.07%	98.92%	99.90%	99.90%
008	93.89%	99.30%	100.00%	99.91%
009	96.82%	99.75%	98.60%	99.87%
010	94.94%	96.27%	99.87%	99.68%
011	95.04%	98.54%	99.77%	99.91%
012	93.86%	98.78%	99.87%	99.90%
013	93.15%	99.20%	99.87%	99.83%
014	94.90%	99.62%	99.92%	99.74%
015	92.69%	99.16%	99.96%	99.62%
TOTAL	93.32%	98.97%	99.83%	99.84%

You just beat me Seumas :-)

More Thoughts on Different Morphological Analyses

2019-01-14

In Five Types of Morphological Analysis I outlined five distinct ways of approaching morphological (or potentially any linguistic) analysis. In support of some of these, I have some additional examples from a pair of papers I'm reading and a conference I just attended.

Baayen et al (2018) (co-written by Jim Blevins, my undergraduate advisor from 25 years ago and still a mentor), in describing their own word-based, discriminative approach to morphology, contrast it with both widespread morpheme-based approaches and increasingly popular exponent-focused realizational approaches. I'll leave a discussion of these different approaches to another time, but what is relevant to my previous post is this comment:

[morpheme-based and realizational analyses] may be of practical value, especially in the context of adult second language acquisition. It is less clear whether the corresponding theories, whose practical utility derives ultimately from their pedagogical origins, can be accorded any cognitive plausibility.

Note the distinction they are making between analyses of practical (adult SLA, pedagogical) value and cognitive plausibility.

Again, it's not the point of this post to describe (much less assess) their arguments for why morphemes and exponents might not be cognitively plausible and what the alternative is, merely that they acknowledge certain analyses might be useful for pedagogical purposes independent of their cognitive plausibility (thereby agreeing with my psychological vs pedagogical distinction).

Perhaps cognitive would be another word for my psychological category.

They furthermore suggest:

Constructional schemata, inheritance, and mechanisms spelling out exponents are all products of descriptive traditions that evolved without any influence from research traditions in psychology. As a consequence, it is not self-evident that these notions would provide an adequate characterization of the representations and processes underlying comprehension and production. It seems particularly implausible that children would be motivated to replicate the descriptive scaffolding of [these] theoretical accounts...

Terms like "descriptive traditions" and "descriptive scaffolding of theoretical accounts" refer to what I had in mind with my synchronic category of analysis. Perhaps descriptive and theoretical would be other words for that category.

In a related paper, Baayen et al (2019), they talk about three possible responses to the challenge posed to linguistics (or at least linguistically-informed natural language processing) by the success of machine learning.

Αgain it's outside the scope of this post to get into those details, but in short, their suggested possible responses are: (1) admit defeat, (2) claim the hidden layers reflect traditional linguistic representations, (3) rethink the nature of language processing in the brain. They go on to explore the third option in the context of morphology and the lexicon, stating that

the model that we propose here brings together several strands of research across theoretical morphology, psychology, and machine learning.

Note that this is essentially a claim that it's possible to reconcile at least three of the different approaches I've outlined: the synchronic/description/theoretical, the cognitive/psychological, and the algorithmic/machine-learning.

(Missing here is any reference to diachrony or pedagogy, which I think they would agree are distinct approaches to what they are attempting to unify).

Now last week, I attended the Society for Computation in Linguistics meeting, coinciding with the big annual meeting of the Linguistic Society of America. One of the goals of SCiL is to build bridges from the NLP community to the linguistics community so it was of particular interest to me.

But again one of the big things that came up in multiple talks was distinct approaches: the approach of the NLP practitioners, often referred to as the engineering approach, and that of the linguists, often referred to as the scientific approach. At their most self-deprecating, the NLP practioners confessed their over-obsession with metrics on "tasks" and lack of regard for the underlying scientific "questions". Noah Smith, in fact, joked that NLPers can annoy linguists by asking what their "task" is and linguists can annoy NLPers by asking what their "question" is.

The point of mentioning this is yet another example of a difference in approach and perspective.

Diachrony didn't feature at all in either the Baayen/Blevins papers nor at SCiL, but certainly my other distinctions seem more broadly confirmed (albeit with alternative terminology). So I think we have:

algorithmic / engineering / task-oriented
diachronic
synchronic / descriptive / theoretical
psychological / cognitive
pedagogical

Now this is not to say some of these approaches can't be combined (as shown in the Baayen/Blevins papers). But even when one is attempting to combine some of them, I think it's useful to acknowledge (a) the multiple approaches being combined; (b) other approaches with distinct goals and evaluation procedures that aren't being consisdered but which may still be valuable in other contexts.

At the end of the day, I'm trying to turn arguments of the form "that isn't a good theory/description/implementation/explanation of morphology" into a more nuanced "it probably isn't good for this but it might be good for that".

References

Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270.

Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan E., and Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39.

Five Types of Morphological Analysis

2018-12-10

People talking about morphological analyses can often speak across each other because they have different purposes in mind. Here's an initial attempt to outline five possibly distinct notions one might be referring to.

I'm tentatively labelling them:

algorithmic
diachronic
synchronic
psychological
pedagogical

although the labels matter less than being clear about the distinction.

Algorithmic means I can go from an inflected form to a lemma + morphosyntactic properties (or vice versa) efficiently on a computer. The way this is achieved might not be psychologically plausible or historically accurate but it can be implemented in software to get the job done.

Diachronic means I can explain (or at least speculate) how the inflected form came about: what the roots are, what grammaticalisation took place, what sound changes explain seeming irregularities, etc.

Synchronic means I can describe the inflected forms without recourse to historical data or reconstruction. This might focus on perspicuity rather than computational efficiency or psychological plausibility.

Psychological means the analysis is consistent with what I think is (or was) going on in the minds of native speakers. Some people may equate this with syncronic analyses but I think you can have a psychologically implausible yet still descriptively adequate synchronic analysis.

Pedagogical means a useful way of explaining it to students. This may be diachronic, but might be more synchronic (whether psychologically plausible or not).

Analyses can obviously be compatible with more than one of these. But I think it's helpful to be clear what the goals of any morphological description are. If the goal is to lemmatise and tag a new text, then psychological or historical plausibility, or analytical or pedagogical clarity might not matter. If one's goal is a diachronically-informed analysis to help students, it should be clear why an otherwise perfectly adequate morphological parser might not be producing useful information.

Those who have been following my Tour of Greek Morphology know I've tried to be careful distinguishing, for example, historical explanations from how I think native speakers internalise(d) word forms, or how students should learn them.

I still come across a lot of people who think the "modern" way of understanding morphology is learning the "morphemes" and rules, not memorising paradigms. Besides getting the history somewhat wrong, this is also making the mistake of conflating these different types of analyses and not recognising that one type of analysis might be perfectly valid for one purpose but not another.

Here's a fun game to play: how would you analyse/explain the form λαμβάνω? Or ἔλαβον (especially when 3rd plural) or λήμψομαι? Or μαθητής vs μαθητοῦ? Or ἔδωκεν vs δέδωκα vs δός?

Maybe I haven't quite nailed the labels yet. Maybe there are further distinctions to draw. I welcome people's input.

Preparing an Open Apostolic Fathers

2018-11-01

I'm working with Seumas Macdonald on an open, corrected digital edition of the Apostolic Fathers based on Lake.

Seumas Macdonald asked me a few weeks ago what it would take to expand some of our text and vocab ordering experiments to the text of Apostolic Fathers (we're both desirous of more comprehensible input for Greek learners).

My reply was that we first of all needed to get a good open text and then lemmatise it. I thought the "get a good open text" would be trivial but it turned out not to be.

I asked around without much positive response. I found HTML versions of the Lake texts on the Christian Classics Ethereal Library (CCEL) website but they turned out to be problematic quality-wise (see below).

It then occurred to me to check what was in the Perseus Digital Library. It only had the Epistle of Barnabas but the related First 1000 Years of Greek at the Open Greek and Latin Project had done the rest.

The Perseus/OGL texts were considerably better than the CCEL ones, but were still not without problems. It was clear that the two collections had been produced independently, however, which is important for what follows.

I'm almost certain the CCEL texts were keyed in. There is haplography and dittography galore! The hapolography even corresponds almost perfectly to line breaks in the printed Lake editions I looked at.

The Perseus/OGL texts, on the other hand, are the results of OCR with some manual correction.

I wrote some code to extract both the CCEL and Perseus/OGL texts and put them in a comparable format. I then wrote a script to align the two. My thinking was to go through all the places where the two disagreed, check the printed Lake and correct the Perseus/OGL text accordingly.

I decided to throw the Lake text from Logos into the mix as well, not as an input to the correction itself but merely as another "edition" to flag differences with (to then check with the printed Lake).

Thus began a project Seumas and I have been working on the last few weeks. Once differences in any of the three texts are identified, they are flagged for review and Seumas and I independently look at the printed Lake and correct the Perseus/OGL base text.

If our corrections disagree, we continue to work on them until we come to consensus. This three-way comparison followed by two-way independent correction is proving to work very well (although it's a lot of work!)

All the code, the source texts (except Logos), and work-in-progress are available at

https://github.com/jtauber/apostolic-fathers

and you can follow along the status in the README. There are also more detailed notes on the whole process.

Once the candidate versions of all the texts are published, I'll do another post just with some interesting statistics on the nature of errors in the CCEL, Perseus/OGL, and Logos texts. The "scribal errors" in the CCEL text are particularly fascinating but even some of the Perseus/OGL OCR errors will be worth writing about.

Seumas and I will then contribute back the corrections to CCEL, Perseus/OGL, and Logos. Hopefully our texts will also be featured on the Biblical Humanities Dashboard as the go-to open digital text of the Apostolic Fathers (so no one else has to repeat this effort).

Finally, we'll start the process of lemmatisation so the Apostolic Fathers can be included in our open learning materials.

A Tour of Greek Morphology: Part 27

2018-10-18

Part twenty-seven of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Let's finish our survey of imperfect middle endings in the indicative with the athematic verbs.

	IM-6	IM-7	IM-8	IM-9
1SG	Xύμην	Xέμην	Xόμην	Xάμην
2SG	Xυσο	Xεσο	Xοσο	Xασο/Xω
3SG	Xυτο	Xετο	Xοτο	Xατο
1PL	Xύμεθα	Xέμεθα	Xόμεθα	Xάμεθα
2PL	Xυσθε	Xεσθε	Xοσθε	Xασθε
3PL	Xυντο	Xεντο	Xοντο	Xαντο

The classes are similar to their IA- equivalents except there is no ablaut between the singular and plural.

The intervocalic sigma in 2SG generally does not drop out in the athematics although it sometimes can, particularly in IM-9 which seems to be the class most starting to merge with the thematics. Note, though, that the lack of circumflex in this case eliminates confusion with an IM-4 2SG.

The lack of circumflex in the 3SG and 2PL also eliminates confusion with IM-4 in those cells.

IM-7 can be confused for IM-1 in the 3SG and 2PL, though.

In the next few posts we'll summarise the inference rules and ambiguities for the imperfect and look at some type and token frequencies, just like we did for the present.

A Tour of Greek Morphology: Part 26

2018-09-08

Part twenty-six of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We've looked at the imperfect endings for the thematic actives and middles. Now let's look at the athematic active endings.

	IA-6	IA-7	IA-8	IA-9	IA-9b	IA-10	IA-11
1SG	Xῡν	Xην/Xειν	Xουν	Xην	Xην	ἦ/ἦν	ᾖα/ᾔειν
2SG	Xῡς	Xεις	Xους	Xης	Xης/Xησθα	ἦς/ἦσθα	ᾔεις/ᾔεισθα
3SG	Xῡ	Xει	Xου	Xη	Xη	ἦν	ᾔει/ᾔειν
1PL	Xυμεν	Xεμεν	Xομεν	Xαμεν	Xαμεν	ἦμεν	ᾖμεν
2PL	Xυτε	Xετε	Xοτε	Xατε	Xατε	ἦτε	ᾖτε
3PL	Xυσαν	Xεσαν	Xοσαν	Xασαν	Xασαν	ἦσαν	ᾖσαν/ᾔεσαν

IA-6 is the -νυ- verbs like δείκνυμι. There is ablaut between the singular and plural (ῡ vs υ).

IA-9 is ἵστημι and compounds. There is again the expected singular/plural ablaut (η vs α).

IA-8 is δίδωμι and compounds. There is a vowel alternative but it is ου/ο and not ω/ο ablaut like in the present.

IA-7 is τίθημι, ἵημι and their compounds. The vowel alternation here is ει/ε and not η/ε ablaut like in the present except for the η in the 1SG.

IA-9b is φημί which is like ἵστημι but with the added 2SG Xησθα.

IA-10 and IA-11 are εἰμί and εἶμι respectively. The -σθα 2SG ending comes up again but there are other differences that we will eventually want to unpack.

For the most part, the endings follow those of the thematic imperfects. The consistent difference is the 3PL -σαν (although see below).

We'll save for later posts what's going on with the -σθα ending and with various parts of the IA-10 and IA-11 paradigms. But I want to note something intriguing about the unexpected vowel alternations in IA-7 and IA-8.

Xουν ~ Xους ~ Xου is what we see in IA-3 and Xεις ~ Xει in IA-2. This suggests that these athematic verbs were starting to be inflected as if they were thematic.

Along similar lines, John 21.18 has ἐζώννυες with a theme vowel. Acts 27.1 has παρεδίδουν for the plural (yet παρεδίδοσαν in Acts 16.4).

Back from International Colloquium on Ancient Greek Linguistics

2018-09-06

Last week I attended the ninth International Colloquium on Ancient Greek Linguistics at the University of Helsinki.

It was an excellent conference with a lot of good linguistic and philological content featuring some nice quantatitive analyses.

Some of the paper highlights for me:

Paul Kiparsky on a regular sound change explanation (via Optimality Theory) for various alternations usually explained via analogy
abstract
Robert Crellin on the ambiguity of Greek without vowels as part of an exploration of why Greek introduced written vowels in the first place
abstract
Lucien van Beek on atelic perfects in Homeric Greek
abstract
David Goldstein on differential agent marking (dative vs prepositional phrase) in Herodotus
abstract
Sandra Rodríguez Piedrabuena on (im)politeness strategies in Ancient Greek
abstract

I may do individual follow-up posts to some of these as they inspired potential investigations of my own in the future.

It was also great just catching up with people I've met the last couple of years at Greek and Indo-European conferences at UCLA, Oxford, and Cambridge.

A Tour of Greek Morphology: Part 25

2018-08-25

Part twenty-five of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the previous part we looked at the endings of the active imperfects with theme vowels. Now we are going to look at the middles.

	IM-1	IM-2	IM-3	IM-4	IM-5
1SG	Xόμην	Xούμην	Xούμην	Xώμην	Xώμην
2SG	Xου	Xοῦ	Xοῦ	Xῶ	Xῶ
3SG	Xετο	Xεῖτο	Xοῦτο	Xᾶτο	Xῆτο
1PL	Xόμεθα	Xούμεθα	Xούμεθα	Xώμεθα	Xώμεθα
2PL	Xεσθε	Xεῖσθε	Xοῦσθε	Xᾶσθε	Xῆσθε
3PL	Xοντο	Xοῦντο	Xοῦντο	Xῶντο	Xῶντο

The vowel differences between these five different classes of verb should largely be familiar to you by now as they're pretty much the same pattern we've seen in the present active, present middle, and imperfect active—namely:

The -2 class historically had an ε before the theme vowel and this led (depending on whether the theme vowel was ε or ο) to ει or ου
The -3 class historically had an ο before the theme vowel and this led (regardless of whether the theme vowel was ε or ο) to ου
The -4 class historically had an α before the theme vowel and this led (depending on whether the theme vowel was ε or ο) to ω or ᾱ
The -5 class is like the -4 class but with a η for the ᾱ

One difference in the above table from what we've seen before is that the 2SG ending is identical between IM-2 and IM-3 and between IM-4 and IM-5.

The fact the distinguisher is a bare diphthong might remind you of the 2SG in the present middle, which in part 9 we partially explained as historically coming from a dropped intervocalic sigma (e.g. ε+σαι > εαι > ηι > ῃ). This is indeed what happened here too.

The pattern is clearer put alongside the 3SG and 3PL as well.

	PM-1	IM-1
2SG	ε+σαι > ῃ	ε+σο > ου
3SG	ε+ται	ε+το
3PL	ο+νται	ο+ντο

We can see here that, prior to the dropping of the sigma (and subsequent contraction) to a long-ο written as a spurious diphthong ου, the present and imperfect endings in the 2SG, 3SG, and 3PL just differed in a final αι/ο alternation (which is tantalisingly close to just a iota/no-iota alternation like we might expect).

If we try to summarise the historical origins of the personal endings, we might get something like the following:

	PA	IA	PM	IM
1SG	μι	μ	μαι	μην
2SG	σι	σ	σαι	σο
3SG	τι	τ	ται	το
1PL	μεν	μεν	μεθα	μεθα
2PL	τε	τε	σθε	σθε
3PL	ντι	ντ	νται	ντο

There is a clear μ/σ/τ/ντ pattern in the 1SG/2SG/3SG/3PL. Cross-cutting this there is a clear ι/-/αι/ο pattern in the PA/IA/PM/IM. The exception is the μην in the IM 1SG (where we might expect μο).

The 1PL and 2PL seem to be playing by a different set of rules and notice they don't make a distinction between the present and imperfect at all.

Note that this summary of endings, while providing a historical background to the Greek forms we see, is really in the realm of Indo-European comparative linguistics rather than Greek. It's the foundation to how Ancient Greek came to be the way it was but doesn't reflect the way native speakers would have internalised inflections nor should be suggestive of the way they should be taught nowadays.

The goal here is to explain some things once the actual endings are already familiar.

A Tour of Greek Morphology: Part 24

2018-07-29

Part twenty-four of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Now let's look at the imperfect forms corresponding to the active omega verbs we looked at in the present way back in part 4.

We'll use IA-1 through IA-5 for the distinguisher patterns corresponding to the verbs that followed PA-1 through PA-5 in the present.

	IA-1	IA-2	IA-3	IA-4	IA-5
1SG	Xον	Xουν	Xουν	Xων	Xων
2SG	Xες	Xεις	Xους	Xᾱς	Xης
3SG	Xε(ν)	Xει	Xου	Xᾱ	Xη
1PL	Xομεν	Xοῦμεν	Xοῦμεν	Xῶμεν	Xῶμεν
2PL	Xετε	Xεῖτε	Xοῦτε	Xᾶτε	Xῆτε
3PL	Xον	Xουν	Xουν	Xων	Xων

Recall:

| PA-1 | barytone omega verbs | PA-2 | circumflex omega verbs with INF -εῖν / 3SG -εῖ | PA-3 | circumflex omega verbs with INF -οῦν / 3SG -οῖ | PA-4 | circumflex omega verbs with INF -ᾶν / 3SG -ᾷ | PA-5 | ζάω + compounds

It is clear that the imperfect endings shown above had a theme vowel (alternating ο/ε exactly as with the present) which historically contracted with the preceding vowel (if it existed) under exactly the same rules as with the present forms (explained in detail in part 8).

	theme vowel - ending
1SG	ο - ν
2SG	ε - ς
3SG	ε -
1PL	ο - μεν
2PL	ε - τε
3PL	ο - ν

Too often with paradigms we only look at the person/number alternations within a fixed tense/aspect/voice. Let's now look at the possible present / imperfect alternations in the endings we've seen (ignoring the augment for now):

	present	imperfect
1SG	Xω	Xον
Xῶ	Xουν or Xων
2SG	Xεις	Xες
Xεῖς	Xεις
Xοῖς	Xους
Xᾷς	Xᾱς
Xῇς	Xης
3SG	Xει	Xε(ν)
Xεῖ	Xει
Xοῖ	Xου
Xᾷ	Xᾱ
Xῇ	Xη
3PL	Xουσι(ν)	Xον
Xοῦσι(ν)	Xουν
Xῶσι(ν)	Xων

With 1PL and 2PL endings identical between present and imperfect.

The Normalisation Column in MorphGNT

2018-07-23

Eliran Wong asked for a more detailed description of the “normalisation” column in MorphGNT so I promised him I’d write a blog post about it.

I first outlined the objective of the column in a 2005 blog post but enough time has passed and new work done that I thought it was worthy of a new post.

The core idea of the normalised column is to give the inflected form as it would be stated in isolation.

To use the example from the 2005 post, consider the phrase in Matthew 1.20:

τὴν γυναῖκά σου

If you were to ask someone what the accusative singular feminine definite article is, you'd expect the answer τήν and not τὴν. Similarly if you asked what the accusative singular of γυνή is, you'd expect the answer γυναῖκα and not γυναῖκά. The differences in Matthew 1.20 are contextual and, for many applications (particularly morphology) aren't of much interest.

And so years ago, I went about adding a new column that normalised this sort of thing. Similarly μετά, μεθ', μετ', and μετὰ all get normalised to μετά in this separate column.

Back in the 2005 post, I enumerated the normalisations as:

existing text may exhibit elision (e.g. μετ' versus μετά)
existing text may exhibit movable ς or ν
final-acute may become grave
enclitics may lose an accent
word preceding an enclitic may gain an extra accent
the οὐ / οὐκ / οὐχ alternation

When I published the SBLGNT analysis, another normalisation was added, namely the normalisation of capitalisation at the start of paragraphs or direct speech. The capitalisation is not an inherent part of the inflected form in isolation, only the particular context of the token, and so it is normalised.

In Analysing the Verbs in Nestle 1904 I covered some differences between the SBLGNT and Nestle 1904 analyses that normalisation would have smoothed over. Note that normalisation COULD go further (for example, spelling differences) but I chose not to do that in the normalisation column.

In brief, the things NOT normalised include:

spelling
crasis (e.g. κἀγώ vs καὶ ἐγώ)

In Annotating the Normalization Column in MorphGNT: Part 1 I started talking about annotating WHY each token was normalised the way it was and you can see some counts there for how many tokens underwent normalisation of accent or capitalisation, and how many had elision or a movable nu or sigma.

In many cases, the normalisation can be automated without any need for human intervention (by having a list of elidable words, enclitics, etc). I'll soon publish my latest Python code for doing this. In some cases, manual checking is needed (although lemmatisation generally resolves a lot of the ambiguities). In Direct Speech Capitalization and the First Preceding Head I talked about the start of some work to go through all capitalisation and identify the reason for it. Similarly New MorphGNT Releases and Accentuation Analysis discusses work on annotating the reason for all accentuation changes.

There is still lots more work to do this for the SBLGNT but I did apply the idea when working on Seumas Macdonald's Digital Nyssa project. For that, I produced a file the first five lines of which are:

Ἦλθε                    ἦλθε                    capitalisation
καὶ                     καί                     grave
ἐφ’                     ἐπί                     elision
ἡμᾶς                    ἡμᾶς                    
ἡ                       ἡ                       proclitic

Here each token is normalised in the second column with the third column giving the reason for any difference between the token and the normalised form (and also indicating proclitics).

The possible annotations (and there can be more than one on a token) are:

grave
capitalisation
elision
movable
extra
proclitic
enclitic

I hope to eventually be able to provide the same for the entire SBLGNT (and other Greek texts).

Doing all this normalisation has a number of benefits. It makes it easier to extract forms for studying morphology, it allows searches to work more as expected (you don't want to have to think up all the possible ways a form could actually be written in a text to search for it), it also allows much easier searching for particular phenomena (for example particular clitic accentuation).

It also allows for more rigorous validation of things like accentuation. Work in this area has already uncovered a number of accentuation errors in the SBLGNT text, for example, and could help with automated checking of OCR, etc.

A Tour of Greek Morphology: Part 23

2018-05-26

Part twenty-three of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Okay, so we want to contrast two forms of the indicative generally referred to as the "present" and "imperfect".

As we always do with paradigms, we'll keep certain things constant (in this case, the lexeme, voice and mood) and vary things along along one axis (person / number agreement) and another axis (present vs imperfect).

	present	imperfect
1SG	λύω	ἔλυον
2SG	λύεις	ἔλυες
3SG	λύει	ἔλυε
1PL	λύομεν	ἐλύομεν
2PL	λύετε	ἐλύετε
3PL	λύουσι	ἔλυον

There are numerous things which should stand out:

the imperfect forms all have an initial ἐ-
this is then followed by the same λυ root found in the present
this is then followed by an ε/ο "theme" vowel
the 1SG and 3PL are identical in the imperfect
the present and imperfect share the same ending in the 1PL and in the 2PL

There's another perhaps more subtle thing you may notice:

the endings in the imperfect 2SG and 3SG are the same as the present without the ι

Recall also that the -ουσι ending in the present 3PL historically came from -οντι. Without the ι, that would be -οντ and given Greek words can only end in ν, ς, or a vowel, dropping the τ from -οντ would give us the -ον we see.

Furthermore, if we consider the athematic 1SG ending -μι and drop the ι, we get -μ. This is not one of the sounds a Greek word can end in and historically, this was changed to an ν. This gives us the -ον we see in the 1SG.

So it seems that historically the relationship between the two sets of endings has to do with the existence or non-existence of an ι. The only exceptions are the 1PL and 2PL. Interestingly these are the only two-syllable endings (counting the theme vowel).

It could even be stated (at least in the earlier history) as: imperfect has ἐ- but not -ι- and the present has -ι- but not ἐ-, except in the two-syllable ending cases where the only contrast is the existence or absence of ἐ-.

We've only looked at λύω / ἔλυον so far, so in the next couple of posts we'll look to see how the imperfect endings work in other lexemes.

A Tour of Greek Morphology: Part 22

2018-05-16

Part twenty-two of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

I’ve deliberated for a while about whether to follow the present with the imperfect or with the aorist. I had recently elected to go with the aorist but as I sketched out what I wanted to say, I realised it would be easier if I’d said some things about the imperfect first.

And so I’ve decided to do a few posts about the imperfect.

We won’t talk about the endings in this post. I want us to start thinking about the imperfect and its relationship to the present not in terms of endings but in terms of the overall paradigm structure.

In previous posts, we saw that the present comes in two voices: an active and a middle (although we haven’t yet touched on the notion of presents coming in both versus just one of these). Within each voice, we looked at six indicative forms (corresponding to patterns of person and number agreement) and an infinitive (which effectively just has no person or number). We haven’t yet covered this, but each present voice also has imperative forms, subjunctive and optative forms, and participles in each of three genders.

The imperfect, in contrast, only has the indicative forms. No infinitive, no participles, no imperative, no subjunctive, and no optative.

We might be tempted to think of this in terms of the imperfect somehow being “defective”, as if we were doing a feature comparison like this:

	present	imperfect
indicatives	✓	✓
infinitives	✓	✗
imperatives	✓	✗
subjunctives	✓	✗
optatives	✓	✗
participles	✓	✗

But another way to think of the imperfect as being part of the “present” family and providing a contrasting set of indicatives.

So we have:

indicatives 1 (“present”)
indicatives 2 (“imperfect”)
infinitives
imperatives
subjunctives
optatives
participles

This model suggests that, say, the infinitive or imperatives or participles, are just as much the infinitive, imperatives, or participles of the imperfect as they are of the present.

This also leads to the need for a new name for this entire family. Traditionally it’s referred to as the “present system” because of the shared stems, but as I've ranted on this blog before, I think it’s unfortunate to use “present” for both the entire system and for one of the two types of indicatives within it.

For reasons we'll touch on later, the system could perhaps better be called the “imperfective system”.

But the remainder of posts on the imperfects will focus on their endings and, in particular, the contrast with the other set of indicatives (the “present” indicatives we’ve been talking in about the previous posts).

First Impressions of John Lee’s Accents Book

2018-04-25

John Lee’s Basics of Greek Accents was released today. Here are some first impressions.

Like D. A. Carson’s 1985 book Greek Accents: A Student’s Manual, Lee’s new book (based on notes from a class he taught at Macquarie University) is designed to backfill knowledge of Greek accents for those students whose beginning Greek skipped over them.

At least since Wenham’s Elements of New Testament Greek, there has been a trend in beginning New Testament Greek (and perhaps Classical Greek) textbooks to do away with instruction about accentuation. I haven’t investigated, but I suspect this correlates with a reduction in English-to-Greek exercises in textbooks too.

Lee, like Carson before him, considers an understanding of accents to be vital to learning Greek. The book, published by Zondervan, is clearly (in name and cover design) intended by them to fill the gap left by Mounce’s Basics of Biblical Greek.

Lee’s book is small—110 pages and about the size of a 5 x 7 photograph. It’s compact but lucid nevertheless. The modern typography makes for more pleasant reading that both Carson book and Probert’s 2003 New Short Guide to the Accentuation of Ancient Greek.

It’s a gentler introduction than either Carson or Probert. There are eight chapters or "lessons" and each has two sets of exercises (marked as "In Class" and "Homework"). All exercises involve adding accents to unaccented text. Examples and exercises are NT focused but not exclusively and the book would be more than suitable for Classical Greek students as well.

As is understandable given its goals, there are no theoretical underpinnings given and little historical explanation.

I’ve found a few places where, given it’s for beginners (albeit those who know some Greek), I wish Lee had been a little more explicit. For example he says that "Aorist active infinitives in -σαι accent on second last" but never explains when one might expect an acute versus a circumflex. A one line rule with several examples is typical. But it is rare that all the edge cases are covered.

After saying that the verb is generally recessive, he gives various forms of λύω including the subjunctive λυθῶ. He gives contraction as the reason for this one deviant form, but that is the last thing he says about subjunctives other than a remark a couple of pages later about ἀποδῷ being the pattern for compound -μι verbs.

While Lee is a gentler introduction, one thing I like about Carson’s book on accents is he’ll often be a little more exploratory, considering a new form and whether previous rules are adequate to cover the evidence, and only once motivated, introduce a new rule. In doing this, students are encouraged to think a little more about how the rules interact. In a way, Carson’s approach is more like what I’ve been trying to do with my morphology blog posts.

While there’s much to commend it as a first introduction to accents, I do find Lee often misses the forest and instead just catalogs the trees. There’s little view of the whole as a system, how the parts interact. I understand why you don’t start with that, but I feel you need to get to it eventually.

As an example, I recently summarised the first and second declension noun accents as follows:

by default the accent is persistent
however, if the ending is a different length than in the base form (nominative singular), the law of limitation may require an accent change (e.g. X́XS -> XX́L, L̃S -> ĹL, ĹL -> L̃S)
if the base form is oxytone, it becomes perispomenon (X́->L̃) in oblique cases (genitive and dative)
in the 1st declension, the genitive plural is always perispomenon -ῶν (even if the base is not oxytone)

I gave examples of contrasting pairs for every accentuation and syllable length combination in both the first and second declension, and highlighted various things like the importance of building an intuition for the L̃S ~ ĹL alternation (the σωτῆρα rule). I also pointed out that the oblique case perispomenon (XL̃) is only possible because all oblique case endings are long.

Now, I’m not suggesting that this is sufficient—it needs a certain amount of unpacking and is jargon heavy. But this, or something similar, makes a nice summary that ties multiple things together in explaining the first and second declension. It covers the fact that persistence and the law of limitation might be in conflict and how that gets resolved. It explains what happens to oxytones in the oblique cases, and gives the exception of 1st declension genitive plural, pointing out this is not limited just to the oxytones like the previous rule.

In contrast, Lee covers the relevant rules but never brings them together in the context of a single paradigm (other than θεός which hardly demonstrates most of the points). The statement about the genitive plural is 28 pages later than the statement about circumflexes in the oblique when the base form is oxytone. His examples of the law of limitation do cover a couple of direct~oblique alternations but that is isolated from the chapter on noun accentuation and is never explained in the context of vowel length patterns in the noun endings.

All in all, however, I think Lee’s book is a good first introduction to Greek accentuation and its presentation is undoubtedly cleaner than that of previous books. My main criticism is that it is incomplete and students would benefit from some consolidation of the principles taught. Some of that criticism may be mitigated in a classroom situation, for which it was originally intended. Students working alone might have more questions than the book answers. I would recommend something like Probert as a follow on (it will also make a better reference). That said, I think Lee achieves his aim in providing the "basics" and (to quote the back cover blurb) "a foundation [students] will use as they continue their studies".

Conference Time

2018-03-18

I’m off for another string of conferences, this time in Copenhagen, Chicago, and New Orleans.

First is a workshop on Original Language Resources for Bible Translation and Education organised by Nicolai Winther-Nielsen of the Global Learning Initiative and Reinier de Blois of the United Bible Societies. David Instone-Brewer put it best when he responded to the workshop invitation with "All the key people in one place with lots of time to talk and plan. How could I miss this?" Perhaps most exciting for me is I finally get to meet Ulrik Sandborg-Petersen for the first time after working together for more than twelve years!

I fly from Copenhagen to Chicago at the end of the week for the annual conference of the American Association of Applied Linguistics. It will be my first time attending the conference and I'm looking forward to learning a lot (although in contrast to the Copenhagen workshop, I'll know virtually no one).

I have to leave AAAL slightly early though, to go down to New Orleans for the first US VueConf. Vue.JS is an important technology in the Scaife Viewer and DeepReader reading environments. I went to the first European VueConf last year and gave a lightning talk on DeepReader. I had hoped to give a talk on the Scaife Viewer at VueConf US but my talk wasn't accepted so I'm hoping at least for another lightning talk.

A Tour of Greek Morphology: Part 21

2018-03-10

Part twenty-one of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

I started this series with

I ultimately hope to cover everything that a beginner-intermediate grammar might but in a much more exploratory fashion. I’ll occasionally touch on morphological theory but I mostly want to point out phenomena in the language that students have already seen but perhaps have not thought about in any depth.

(emphasis added)

In short, the primary goal has been (and will continue to be) to take data the reader already is assumed to know and to make observations and construct relationships that the reader perhaps didn’t already realise or know. The secondary goal is to talk a little bit about linguistic theory and historical linguistics in relation to the specific phenomena being discussed.

Now that we’re finished our first pass over (particularly the endings of) the present indicatives and infinitives, I wanted to summarise a few key points we’ve touched on that are of a more conceptual nature.

A paradigm is a way of showing related forms next to one another for comparison. We often keep some morphosyntactic properties constant while varying others. We often but, not always, keep the lexeme constant.
We can look at paradigms along (at least) three dimensions: (1) we can take one lexeme’s inflection and look at what stays the same and what changes in different cells; (2) we can take a morphosyntactic property set and look at what stays the same and what changes across different lexemes; (3) we can take a subset of morphosyntactic properties and vary them while keeping the rest of the set (and the lexeme) fixed.
Greek rarely has a one-to-one mapping between an individual morphosyntactic property and some surface property of the inflected form.
There are some cells in a paradigm that are highly predictable and others than are highly predictive.
There are relationships between cells which are often more helpful than relationships between a cell and its underlying or historical stem.
The primary role of morphology is to discriminate between alternatives, not build up compositional meaning.
Ambiguity in morphology can be tolerated if other things (syntax, context) help disambiguate.
There is a big difference between looking at patterns in the surface forms and exploring the historical reasons those patterns developed. While the latter is vital for answering “why”, it is not a crucial part of language acquisition. (Native English speakers don’t acquire strong verbs by understanding how Proto-Indo-European ablaut patterns led to Germanic inflectional classes!)

As well as these conceptual points, we’ve talked about the actual endings, inflectional classes, vowel contractions, frequency effects, and which cells might be the best to use as a lemma.

We also spent time actually testing our models against the corpus data with some Python scripts and showed how that uncovered some patterns we hadn’t previously considered.

We haven’t looked at everything to do with the presents, but it’s time to move on, at least for a while, to a different part of the verbal system.

That said, if you have any questions about the previous twenty parts, or any questions you're hoping will be answered in subsequent posts, just leave a comment (or email me if you want to ask anonymously).

A Tour of Greek Morphology: Part 20

2018-03-05

Part twenty of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In part 17, we went through counts for our present active (infinitive and indicative) classes. Now we'll wrap things up by doing the same for the middle.

Recall this is based on the analysis of 820 tokens available here which was described in the last two parts.

Let us first of all look at the number of distinct lemmas in each of our 14 classes.

PM-1	barytone thematics with INF -εσθαι / 3SG -εται	105
PM-2	circumflex thematics with INF -εῖσθαι / 3SG -εῖται	21
PM-3	circumflex thematics with INF -οῦσθαι / 3SG -οῦται (ζηλόω, ἐλαττόω, λυτρόομαι, διαβεβαιόομαι)	4
PM-4	circumflex thematics with INF -ᾶσθαι / 3SG -ᾶται	11
PM-5	circumflex thematics with INF -ῆσθαι / 3SG -ῆται (χράομαι and compound)	2
PM-6a	INF -υσθαι / 3SG -υται (ἀπόλλυμι, ἐνδείκνυμι, συναναμίγνυμι)	3
PM-7	INF -εσθαι / 3SG -εται (compound of τίθημι)	3
PM-8	INF -οσθαι / 3SG -οται	-
PM-9	INF -ασθαι / 3SG -αται (δύναμαι, compounds of ἵστημι)	8
PM-10	ἧμαι	-
PM-10-COMP	compounds of ἧμαι (κάθημαι)	1
PM-11	κεῖμαι	1
PM-11-COMP	compounds of κεῖμαι	7
PM-12	οἶμαι	1

Again, even the small counts are elevated due to compound verbs. Folding compounds of the same base verb, only PM-1, PM-2, PM-3, PM-4, and PM-6a have more than one or two members (and PM-6a only has three).

This is just looking at the number of unique lemmas in each class but there are two other sets of numbers that are worth looking at: (1) the total number of tokens in the SBLGNT; (2) the distribution of classes amongst the hapax legomena.

class	lemmas	tokens	hapax	hapax details
PM-1	105	523	45
PM-2	21	57	7
PM-3	4	5	3	ζηλόω ἐλαττόω λυτρόομαι
PM-4	11	33	4	μυκάομαι κοιμάομαι καταράομαι ἐγκαυχάομαι
PM-5	2	2	2	χράομαι and συγχράομαι
PM-6a	3	9	-
PM-7	3	5	2	διατίθεμαι and μετατίθημι
PM-8	-	-	-
PM-9	8	156	4	ἐξίστημι ἐφίστημι ἀνθίστημι ἀφίσταμαι
PM-10	-	-	-
PM-10-COMP	1	5	-
PM-11	1	9	-
PM-11-COMP	7	15	-
PM-12	1	1	1	οἶμαι

Recall the hapax legomena matter because they give an indication of what classes were still productive.

If we fold compounds under their base verb, only PM-1, PM-2, PM-3, and PM-4 have more than one hapax legomenon.

Let's now look at counts for each paradigm cell for each class:

	PM-1	PM-2	PM-3	PM-4	PM-5	PM-6a	PM-7	PM-8	PM-9	PM-10-C	PM-11	PM-11-C	PM-12
INF	89	15	4	8	-	4	-	-	12	2	-	3	-
1SG	85	17	-	3	-	1	4	-	9	1	1	-	1
2SG	19	1	-	5	-	-	-	-	7	-	-	-	-
3SG	228	7	-	8	-	-	-	-	74	2	7	11	-
1PL	20	4	-	3	1	3	-	-	9	-	1	-	-
2PL	24	9	-	3	-	-	1	-	32	-	-	-	-
3PL	58	4	1	3	1	1	-	-	13	-	-	1	-
	523	57	5	33	2	9	5	-	156	5	9	15	1

As in the active, the 3SG and INF dominate with only a few interesting exceptions. The third person (especially 3SG but also 3PL) is unusually low in PM-2. In PM-9, the 2PL is usually high. This is almost certainly just because of particular lexical items that happen to be in those classes rather than an inherent characteristic of the class itself, although because the origins of some classes are derivational, there may occasionally be tendencies on semantic grounds.

If the goal is just to identify the person/number, not the class, (which is true in reception but not learning) then most of these numbers collapse because of shared endings. Here are the counts just focused on the common endings (without accents):

INF	-σθαι	137
1SG	-μαι	122
2SG	-{ι}	25
-σαι	7
3SG	-ται	337
1PL	-μεθα	41
2PL	-σθε	69
3PL	-νται	82

And that's it for the present middles. I'll do a brief summary post next and then we'll start exploring beyond the presents.

New Draft Morphological Tags for MorphGNT

2018-02-03

I’ve finally done the work in translating the MorphGNT tagging system to a new proposal for initial feedback.

At least going back to my initial collaboration with Ulrik Sandborg-Petersen in 2005, I've been thinking about how I would do morphological tags in MorphGNT if I were starting from scratch.

Much later, in 2014, I had some discussions with Mike Aubrey at my first SBL conference and put together a straw proposal. There was a rethinking of some parts-of-speech, handling of tense/aspect, handling of voice, handling of syncretism and underspecification.

Even though some of the ideas were more drastic than others, a few things have remained consistent in my thinking:

there is value in a purely morphological analysis that doesn't disambiguate on syntactic or semantic grounds
this analysis does not need the notion of parts-of-speech beyond purely Morphological Parts of Speech
this analysis should not attempt to distinguish middles and passives in the present or perfect system

As part of the handling of syncretism and underspecification, I had originally suggested a need for a value for the case property that didn't distinguish nominative and accusative and a need for a value for the gender property like "non-neuter".

In the absence of feedback beyond a vague feeling that something like this should be done, I didn't immediately make further progress but, a year later, started gathering more notes on handling ambiguity. That then led to a more concrete proposal just around gender and case (although not without open questions).

I've now implemented those smaller-scale proposals as a first draft for the MorphGNT SBLGNT and plan to apply them to other GNT texts soon. The new-tags branch for MorphGNT SBLGNT is available at: https://github.com/morphgnt/sblgnt/tree/new-tags.

This adds a new column (the intention is not to replace existing analyses yet, just augment them) that:

makes voice formal not functional (while still using P in the aorist and future for what Carl Conrad would called MP2)
does not give morphosyntactic properties for uninflected words
implements basic nominative/accusative case syncretism in the neuter with a single value
implements basic non-neuter, non-feminine, and (in most genitive plurals) complete gender syncretism with a value for each

One immediate affect of this is that a list I have from Randall Tan of disagreements between the MorphGNT SBLGNT analysis and that of the Nestle 1904 largely goes away because many of them were merely different judgements of gender or case on non-morphological grounds. This new tag retains the uncertainty. Another benefit of the tagging scheme is that it provides a reasonable output for an automated morphological analysis system which can then, in a separate step, be disambiguated syntactically (or semantically), potentially with human input.

There are some important things to note, however, as just saying "this is a purely morphological analysis that doesn't disambiguate" oversimplifies things greatly.

Firstly, while punting distributional and semantic part-of-speech questions like "is this an adverb or a conjunction" or "what type of pronoun is this" is extremely helpful, there are still some questions that impact a purely morphological tagging such as whether to represent a fossilised verb acting as a particle as having morphological inflection.

Secondly, there are what I have called extended syncretisms not modelled where there can be uncertainty between properties taken as a pair. For example 1st person singular vs 3rd person plural in -ον, or 1st declension genitive singular vs accusative plural in -ας. It may be worth still conveying this ambiguity but just through disjunction, saying for example that a word is GSF^APF. These are almost always phonological coincidences rather than structural syncretism and so should be modelled differently.

Related to this is the "double" syncretism between accusative singular masculine and neuter on the one hand and nominative and accusative singular neuter on the other hand. If we model the latter as CSN then we've lost the former (which, if by itself could be modelled as ASY). So, in a sense CSN and ASY are syncretic (but also share an overlapping cell). CSN^ASY doesn't quite seem right because of that overlap and the fact that this isn't just a phonological coincidence as best I know.

Thirdly, I have only modelled basic syncretism, not endings in wildly different parts of the paradigms (so would definitely not be called syncretism) that also happen to have converged by phonological change. For example both -ου and -ον can be nominal endings or unrelated verbal endings (with quite a few interpretations, mind you, especially for -ου). No attempt has been made to capture this in a single tag (although a disjunctive representation might be possible).

And finally (although related to the previous point), a certain amount of lexical disambiguation is applied. There are many cases where not being familiar with the lexeme makes a form highly ambiguous but that ambiguity goes away if the lemma is known. A simple example is imperfects versus second aorists where the principal parts resolve the ambiguity. The draft new tags for MorphGNT SBLGNT effectively assume the lemmatisation has been done and is correct.

In light of this, some people might be surprised, therefore, that υἱοῦ is tagged GSY and not GSM given it's lexically masculine. My current argument (at least in my own head) is that, regardless of a specific lexeme like υἱοῦ, GSM, as a morphological tag, doesn't really make sense in the Greek paradigmatic system because, by nature, genitive singulars have the same form in the non-feminines. I think there's definitely a difference, if subtle, between true ambiguity and underspecification. It's not that υἱοῦ is ambiguous as to gender, it's just that the cell doesn't distinguish masculine from neuter. Lexical knowledge is still being used, otherwise it could be feminine (or even a middle imperative!).

So, in short, syncretism inherent to the paradigmatic system is captured well but other forms of ambiguity will need to be handled other ways (potentially via a disjunctive list of possibilities). This seems a reasonable, practical compromise.

Let me know your thoughts. There's definitely still more to do and I do plan on expressing more ambiguity with some form of disjunction. I'll probably do a post soon with some more thoughts (and stats) on that.

Lexical Dispersion in the Greek New Testament Via Gries’s DP

2018-01-21

Measures of dispersion are interesting to apply to a corpus because they tell you whether a word is distributed across parts of the corpus as expected or concentrated more in just some parts. I thought I’d play around with Gries’s DP as a measure of dispersion on the SBLGNT lemmas.

There are lots of measures of dispersion but Stefan Th. Gries's is perhaps the simplest (see [1] for a detailed survey of lots of different measures as well as the original definition of his own).

Here it is in Python for lemmas:

dp = sum(abs((p[part] / t) - (lp[lemma][part] / l[lemma])) for part in p) / 2

where:

p[part] is a dictionary mapping corpus part to the count of words in that part
l[lemma] is a dictionary mapping lemmas to the count of that lemma in the corpus
lp[lemma][part] is a dictionary of dictionaries mapping lemmas and parts to the count of the lemma in that part

but see [1] for some simple worked examples.

One thing Gries doesn't talk about (email me if you know of any discussion of this) is how to handle very low frequency words as they'll dominate the high DP values.

Using books as the parts, here are the top 10 most evenly dispersed lemmas in the GNT:

0.0466 ὁ
0.1085 εἰς
0.1154 καί
0.1178 ὅς
0.1250 εἰμί
0.1358 ποιέω
0.1382 γίνομαι
0.1385 πολύς
0.1395 μετά
0.1420 μή

Here are the top 10 least evenly dispered lemmas (including all frequencies, even hapax legomena):

0.9984 φιλοπρωτεύω
0.9984 ἐπιδέχομαι
0.9984 μειζότερος
0.9984 Διοτρέφης
0.9984 φλυαρέω
0.9982 χάρτης
0.9982 κυρία
0.9976 προσοφείλω
0.9976 ἑκούσιος
0.9976 ἄχρηστος

but this list looks very different if we, say, restrict ourselves to lemmas that occur 5 times or more:

0.9827 ἀντίχριστος
0.9752 καταλαλέω
0.9687 ἐπιφάνεια
0.9681 νήφω
0.9680 ἀρετή
0.9667 μῦθος
0.9641 Μελχισέδεκ
0.9568 πλεονεκτέω
0.9557 νόημα
0.9532 ἐνέργεια

or 30 times or more:

0.8952 ἀρνίον
0.8085 καυχάομαι
0.8024 θηρίον
0.7987 μέλος
0.7969 εἴτε
0.7266 συνείδησις
0.7202 περιτομή
0.7199 θρόνος
0.7139 ὑποτάσσω
0.7116 Παῦλος

If we use chapters as the corpus division, we get a little different top ten most evenly distributed by Gries's DP:

0.0677 ὁ
0.1440 καί
0.1913 εἰμί
0.2084 εἰς
0.2117 αὐτός
0.2259 ἐν
0.2366 οὗτος
0.2378 ὅς
0.2437 δέ
0.2561 οὐ

and obviously this is even more problematic for lower frequency words at the other end.

It's interesting to look, though, at chapters within a single book. For example, here are the most evenly distributed lemmas in John's gospel using chapters for parts:

0.0574 ὁ
0.0867 καί
0.0977 αὐτός
0.1331 οὐ
0.1391 οὗτος
0.1440 ὅτι
0.1480 λέγω
0.1569 δέ
0.1576 εἰμί
0.1658 εἰς

and here are the least evenly distributed lemmas that occur at least 10 times:

0.9470 σταυρόω
0.9414 Ἀβραάμ
0.9126 νίπτω
0.8958 Πιλᾶτος
0.8914 πρόβατον
0.8812 Λάζαρος
0.8493 καρπός
0.8426 ἄρτος
0.8371 προσκυνέω
0.8221 ψυχή

Obviously Gries's DP is extremely easy to calculate, and I plan to experimentally include it in the Greek Vocabulary Tool for the Perseus Project but there are still some things to work out with low frequency words.

It's very interesting, though, as a way of contrasting words that otherwise have the same frequency in a corpus. For example, here are all the lemmas that occur exactly 30 times in the SBLGNT, with their book-based Gries's DP:

0.3276 διδαχή
0.3558 ἐγγύς
0.3708 σκότος
0.4143 ἀγοράζω
0.5360 σκανδαλίζω
0.5833 συνέρχομαι
0.6230 ἴδε
0.6485 ἐπικαλέω
0.7266 συνείδησις
0.8952 ἀρνίον

There is a massive range in the DP which I think is quite illustrative.

Here is the list with their chapter-based DP (notice how high the lowest DP now is):

0.8769 ἀγοράζω
0.8821 σκότος
0.8869 συνέρχομαι
0.8958 σκανδαλίζω
0.9016 ἐγγύς
0.9016 διδαχή
0.9034 ἴδε
0.9083 ἐπικαλέω
0.9441 συνείδησις
0.9609 ἀρνίον

One of my reasons for exploring Gries's DP (and potentially other measures of lexical dispersion) is the application to language learning. My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.

[1] Gries, Stefan Th. (2008) Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13:4. John Benjamins.

Some Unix Command Line Exercises Using MorphGNT

2017-12-24

I thought I’d help a friend learn some basic Unix command line (although pretty comprehensive for this type of work) with some practical graded exercises using MorphGNT. It worked out well so I thought I’d share in case they are useful to others.

The point here is not to actually teach how to use bash or commands like grep, awk, cut, sort, uniq, head or wc but rather to motivate their use in a gradual fashion with real use cases and to structure what to actually look up when learning how to use them.

This little set of commands has served me well for over twenty years working with MorphGNT in its various iterations (although I obviously switch to Python for anything more complex).

Task 0

Clone https://github.com/morphgnt/sblgnt in git.

Task 1

Using wc and the concept of wildcards/globbing (and relying on the fact I have one line-per-word in those files) work out how many words are in the main text of SBLGNT.

Task 2

Using grep and wc work out how many times μονογενής appears. (You might be able to do it with just grep and appropriate options, but try using grep without options and wc and understand the concept of "piping" the output of one command to the input of another)

Task 3

How many verbs (tokens) are there in John’s gospel? (still doable just with grep and wc)

Task 4

How many unique verbs (lemmas) are there in John’s gospel?

(learn how to use awk to extract fields, and how to use sort and uniq in tandem)

Task 5

What are the 5 most common verbs (lemmas) in John’s gospel? (you might want to use head)

Task 6

Get counts in John’s Gospel of how many tokens appear in each tense/aspect (hint: use cut) and write the results to a file called john.txt rather than just output it in the terminal.

Task 7

Come up with your own question that you think could be answered using the types of operations and try it out.

SBL Papers Now Online

2017-11-22

I’ve put my two SBL papers this year (from both the recent Annual Meeting and the International Meeting) online and also sync’d my Annual Meeting slides to audio I recorded on my iPhone.

SBL 2017 Annual: Linking Lexical Resources for Biblical Greek
[slides] [video]
SBL 2017 International: The Route to Adaptive Learning of Greek
[slides]

For completeness, here are my other SBL talks:

SBL 2016 Annual: An Online Adaptive Reading Environment for the Greek New Testament
[slides]
SBL 2015 Annual: A Morphological Lexicon of New Testament Greek
[slides]

Speaking at SBL 2017 on Linking Lexical Resources

2017-11-18

I’m again speaking at the SBL Annual Meeting, this time in Boston. My topic is basically the “lemma lattice” work started by Ulrik Sandborg-Petersen and I back in 2006 but which I’ve never presented in this sort of setting before.

Here's the official abstract:

Linking Lexical Resources for Biblical Greek

As more resources for Biblical Greek, both old and new, become openly available, the opportunities for integrating them become greater. At the level of the word, it might seem a trivial task to match based on lemma. But no two texts are lemmatised the same way and no two lexicons will make the same choices of headwords. Numerical solutions such as Strongs and Goodrick-Kohlenberger solve some problems but introduce new ones. After surveying the various issues and challenges, this talk will provide both a framework for moving forward and a report on practical ways that a variety of texts, lexicons, and other resources such as principal-part lists are being linked in the service of open, biblical digital humanities.

I'll certainly post my slides after my talk but I'll also try to record it on my iPhone like I did at BibleTech 2015.

Four Types of But

2017-11-03

In his talk on adversive conjunction in Gothic at the 29th UCLA Indo-European Conference, Jared Klein started with a wonderful example paragraph in English.

In order to finish the project, I don't need money but₂ time. I would like to be done by the end of this year, but₃ I don't think that is going to happen. Nobody is to blame for this but₁ me, because I've wasted a lot of time on things that have proved to be irrelevant. But₄ this is too depressing; let's talk about something else.

He went on to talk about the Gothic equivalents for each but I thought it was a great illustration of four distinct types of adversatives all using "but" in English.

Klein didn't necessarily use the following terms but the four could be described as:

prepositional
phrasal
clausal
discourse

A Tour of Greek Morphology: Part 19

2017-11-02

Part nineteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

It's now time to do for the middle forms what we did for the actives in part 16, namely come up with the rules to help disambiguate inflectional classes. These were sketched out in theory in part 14 but now it's time to actually write the rules and test them in code against the SBLGNT.

This is what my Python script does:

INF:Xεσθαι or 3SG:Xεται or 2PL:Xεσθε	is	PM-1 if lemma ends in ω or ομαι PM-7 if lemma ends ημι
1SG:Xομαι or 1PL:Xόμεθα or 3PL:Xονται	is	PM-8 if lemma ends in δίδομαι PM-1 if lemma ends in ω or otherwise ends in ομαι
1SG:Xοῦμαι or 3PL:Xοῦνται	is	PM-2 if lemma ends in έω or έομαι PM-3 if lemma ends in όω or όομαι
1SG:Xῶμαι or 1PL:Xώμεθα or 3PL:Xῶνται	is	PM-5 if lemma ends in χράομαι PM-4 if lemma otherwise ends in άομαι
2SG:Xῇ	is	PM-2 if lemma ends in έω or έομαι PM-5 if lemma ends in άομαι
1PL:Xύμεθα	is	PM-2 if lemma ends in έω or έομαι PM-3 if lemma ends in όω or όομαι (not needed in SBLGNT) PM-5 otherwise (not needed in SBLGNT)
3SG:Xεῖται or 2PL:Xεῖσθε	is	PM-2 if lemma ends in έω or έομαι PM-11 if lemma ends in εῖμαι
1PL:Xείμεθα	is	PM-11 if lemma is κεῖμαι PM-11-COMPOUND otherwise (not needed in SBLGNT)
INF:Xεῖσθαι	is	PM-2 if lemma ends in έω or έομαι PM-11 if lemma is κεῖμαι (not needed in SBLGNT) PM-11-COMPOUND otherwise
INF:Xῆσθαι	is	PM-10-COMPOUND if lemma is κάθημαι PM-5 otherwise (not needed in SBLGNT)

I decided to cover a bunch of ambiguities not specifically needed by the SBLGNT—not strictly necessary but it will help when the script is extended to run on a larger corpus.

Note the special-casing of δίδομαι, κεῖμαι, κάθημαι, and χράομαι. χράομαι is an example, like ζάω in part 16, that is misleadingly lemmatized with an alpha. More on that later!

We now have an inflectional class for all 820 present middle infinitive or indicative forms in the MorphGNT SBLGNT.

You can download the entire output of my Python script here.

Are there multiple classes for a particular lexeme (like there was in the active)?

Two of the 167 lexemes show multiple classes:

δύναμαι: PM-9 normally but a 2SG:δύνῃ that comes up as a PM-1 (PM-9 would predict a Xασαι)
κάθημαι: PM-10-COMPOUND normally but a 2SG:κάθῃ that comes up as a PM-1 (PM-10-COMPOUND would predict a Xησαι)

If κάθῃ were καθῇ, we'd have the possibility of reanalysis as a PM-5 and it's still possible that's what's going on and the accentuation just doesn't reflect that.

δύνῃ for δύνασαι is somewhat less expected and it should be noted that both forms appear in the SBLGNT, sometimes within the same author. That the PM-4 2SG all show up with an un-contracted ᾶσαι adds slightly more mystery.

For now we'll leave δύνῃ and κάθῃ as PM-1 but we revisit them later.

In the next part, we'll look at counts for the present middles across the SBLGNT.

Off to the UCLA Indo-European Conference

2017-11-01

Tomorrow I’m off to Los Angeles for the Twenty-Ninth Annual UCLA Indo-European Conference.

Indo-European studies are notoriously impenetrable, even for linguists, but a couple of months ago, I finally decided now was the time to attend this major conference (to the extent an IE conference can be "major").

I'm not great at conferences at the best of times, especially when I'm not a speaker and/or don't know very many people, so this will be quite a stepping-out-of-the-comfort-zone for me.

But as an aspiring comparative philologist, I'm sure it's going to be very rewarding for me.

A Tour of Greek Morphology: Part 18

2017-10-27

Part eighteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In Part 13 we summarised the present active endings and in part 15 posed the question "Do these paradigms cover all the forms in the Greek New Testament?"

Now we're going to answer the same question for the middle endings summarised in part 14.

Again, I've written a short Python program that reveals there are 16 forms in 23 instances that do NOT match.

Two of these forms are of κάθημαι: the 1SG itself plus the 3SG κάθηται. The 3SG bears a resemblance to the PM-5 3SG (differing only in accent) but this is not a circumflex verb. The existence of the η in the 1SG rather than an ῶ indicates this is an athematic verb. It is in fact a compound verb κατά+ἧμαι.

We don't have a paradigm class for ἧμαι OR its compounds so let's add them now.

	PM-10	PM-10-COMPOUND
INF	ἧσθαι	Xῆσθαι
1SG	ἧμαι	Xημαι
2SG	ἧσαι	Xησαι
3SG	ἧται	Xηται
1PL	ἥμεθα	Xήμεθα
2PL	ἧσθε	Xησθε
3PL	ἧνται	Xηνται

(we don't actually need PM-10 for the SBLGNT but I've included it for completeness)

Next we have κεῖμαι and ITS compounds which account for 10 more forms. Here again we have an athematic verb with a vowel we haven't covered before.

	PM-11	PM-11-COMPOUND
INF	Xεῖσθαι	Xεῖσθαι
1SG	Xεῖμαι	Xειμαι
2SG	Xεῖσαι	Xεισαι
3SG	Xεῖται	Xειται
1PL	Xείμεθα	Xείμεθα
2PL	Xεῖσθε	Xεισθε
3PL	Xεῖνται	Xεινται

Note that INF and 1PL are identical between the two of them (so will be an ambiguity we'll need to cover, although not for the SBLGNT).

Our next word is οἶμαι which only appears in the SBLGNT in the 1SG. We won't reconstruct the entire paradigm (we may come back to it later) but will use PM-12 to designate the οἶμαι form.

This leaves us with three forms, all 2SG:

καυχᾶσαι
ὀδυνᾶσαι
κατακαυχᾶσαι

In all cases, this looks a lot like a PM-4 that just hasn't dropped the sigma in -ᾶσαι to form -ᾷ. In fact, all the PM-4s in the SBLGNT seem to have this behaviour so we probably shouldn't treat it as a separate paradigm but rather an alternative realisation within the PM-4 2SG cell (similar to Xῃ/Xει in the PM-1). We'll discuss in a later post why PM-4 might exhibit this when other circumflex middle paradigms don't seem to.

But with this tweak and the additions of PM-10, PM-10-COMPOUND, PM-11, PM-11-COMPOUND, and PM-12 we now have full coverage of the present middle indicatives and infinitives in the SBLGNT.

You may be wondering whether we could have just identified these paradigms way back when we first laid out the different present middle paradigms. We absolutely could have. But I think the way we've discovered them demonstrates an important concept: that of rigorously testing a linguistic model against a corpus.

This whole blog series is, in fact, laying the ground work for a rigorous description of Greek morphology that has been my goal to write for many years.

But coming back to the short term: we still have to explore the disambiguation of assigning inflectional classes to the middle forms, like we did for the actives in part 16. We'll do that in the next part.

A Tour of Greek Morphology: Part 17

2017-10-16

Part seventeen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

As mentioned in the last post in the series, we now have an inflectional class for all 5,314 present active infinitive or indicative forms in the MorphGNT SBLGNT in a file that looks like the following:

010120 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010123 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010202 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010206 εἶ 2SG PA-10 εἰμί PA-10
010213 μέλλει 3SG PA-1 μέλλω PA-1
010213 ζητεῖν INF PA-2 ζητέω PA-2
010218 εἰσί(ν) 3PL PA-10 εἰμί PA-10
010222 βασιλεύει 3SG PA-1 βασιλεύω PA-1
010303 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010309 λέγειν INF PA-1 λέγω PA-1
010309 ἔχομεν 1PL PA-1/PA-8 ἔχω PA-1

Where the columns are:

the book/chapter/verse reference
the normalized form
the morphosyntactic properties
the inflectional classes possible without disambiguation
the lemma
the disambiguated inflectional class

Now it's time to do some counts.

Let us first of all look at the number of distinct lemmas in each of our 13 classes.

The numbers for classes PA-5 and above are low enough that we should look at them individually:

PA-1	barytone omega verbs	338
PA-2	circumflex omega verbs with INF -εῖν / 3SG -εῖ	145
PA-3	circumflex omega verbs with INF -οῦν / 3SG -οῖ	21
PA-4	circumflex omega verbs with INF -ᾶν / 3SG -ᾷ	31
PA-5	ζάω + compound (συζάω)	2
PA-6a	ὀμνύω; δείκνυμι + compound (ἀμφιέννυμι)	3
PA-7	τίθημι + compounds (ἐπιτίθημι παρατίθημι περιτίθημι); compounds of ἵημι (ἀφίημι συνίημι)	6
PA-8	δίδωμι + compounds (διαδίδωμι ἀποδίδωμι μεταδίδωμι παραδίδωμι	5
PA-9	compounds of ίστημι (καθίστημι μεθίστημι συνίστημι); compound of φημί (σύμφημι); that one weird case of συνίημι	5
PA-9-ENC	φημί	1
PA-10	εἰμί	1
PA-10-COMP	compounds of εἰμί (ἄπειμι ἔξεστι(ν) πάρειμι)	3
PA-11-COMP	compounds of εἶμι (ἔξειμι εἴσειμι)	2

Notice that even the small counts are elevated due to compound verbs. Folding compounds of the same base verb, the classes from PA-5 on have only one or two members.

class	lemmas	tokens	hapax	hapax details
PA-1	338	2563	151
PA-2	145	856	65
PA-3	21	35	15
PA-4	31	117	16
PA-5	2	41	1	συζάω
PA-6a	3	5	2	ὀμνύω ἀμφιέννυμι
PA-7	6	37	3	εἴσειμι παρίστημι παρατίθημι
PA-8	5	35	2	διαδίδωμι μεταδίδωμι
PA-9	5	9	3	συνίημι σύμφημι μεθίστημι
PA-9-ENC	1	22	0
PA-10	1	1551	0
PA-10-COMP	3	39	1	ἄπειμι
PA-11-COMP	2	4	1	εἴσειμι

Why do the hapax legomena matter? Well they give an indication of what classes were still productive.

Note, however, that the hapax in PA-5 and above are VERY low in number and, with the exception of ὀμνύω in PA-6a they are all compounds. This strongly suggests that only PA-1, PA-2, PA-3, and PA-4 were productive.

Notice that the token numbers for PA-6a, PA-9 and PA-11-COMP are particularly low too. Potentially relevant in the case of PA-6a and PA-9 is that these are the classes most like to have developed thematic alternatives. This might be worthy of a future post in this series!

Let's now look at counts for each paradigm cell for each class:

	PA-1	PA-2	PA-3	PA-4	PA-5	PA-6a	PA-7	PA-8	PA-9	PA-9-ENC	PA-10	PA-10-COMP	PA-11-COMP
INF	394	171	5	21	13	1	11	10	1	-	124	3	3
1SG	460	116	3	21	6	1	7	10	2	4	138	1	-
2SG	164	46	-	5	2	-	-	1	-	-	92	1	-
3SG	923	295	16	35	13	3	11	13	5	17	896	31	-
1PL	141	52	2	19	5	-	1	-	-	-	52	1	-
2PL	218	99	4	8	1	-	4	-	-	-	93	1	-
3PL	263	77	5	8	1	-	3	1	1	1	156	1	1
	2563	856	35	117	41	5	37	35	9	22	1551	39	4

What is obvious from this is just how important, regardless of inflectional class, the 3SG form is. The INF is also very important. We've seen in a previous post that both cells are very good predictors of inflectional class (much better than 1SG) but they are also just both very common. The 1SG, despite being a bad predictor, is still important in terms of frequency.

The 3PL is a distant fourth with one apparent deviation: it is very common in PA-10 (i.e. the copula), more so than the INF or 1SG. In fact, the proportion of 3PL in this class is actually average, it's the INF and 1SG that are unusually low (with much of the frequency drop taken up by the 3SG).

As well as εἰμί, φημί (PA-9-ENC) is also disproportionately 3SG.

Of course, given how common PA-1 is, even the plurals there outnumber the most common cells in the other classes.

If the goal is just to identify the person/number, not the class, (which is true in reception but not learning) then a lot of those numbers collapse because of shared endings. Here are the counts just focused on the common endings (without accents):

INF	-ν	604
-ναι	153
1SG	-ω	606
-μι	163
2SG	-{ι}ς	217
-ς	1
(-)ει	93
3SG	-{ι}	1282
-σι(ν)	49
(-)εστι(ν)	927
1PL	-μεν	273
2PL	-τε	448
3PL	-σι(ν)	511
-ασι(ν)	7

This just emphasises even more (even though it was in the previous table) that there is only 1 2SG in -ς (without an iota, subscripted or otherwise): the παραδίδως in Luke 22.48.

The 7 3PLs in -ασι(ν) are:

τιθέασι(ν) in Matt 5.15
ἐπιτιθέασι(ν) in Matt 23.4
περιτιθέασι(ν) in Mark 15.17
φασί(ν) in Rom 3.8
συνιᾶσι(ν) in 2Co 10.12
εἰσίασι(ν) in Heb 9.6
διδόασι(ν) in Rev 17.13

One could argue that these are subsumed by saying 3PL ends in -σι(ν) but given that, in the very same lexemes, -σι(ν) can also indicate 3SG, it is useful calling out the α, even though the root vowel alternation is enough to distinguish singular and plural.

That's it (for now) for counts of the present actives. In the next couple of posts, we'll turn to the middle forms.

pyuca 1.2 Released with Support for New Versions of Unicode

2017-09-25

pyuca is my pure-Python implementation of the Unicode Collation Algorithm—a library I use almost every day to properly sort Greek (although the library is not Greek-specific). I was recently asked how to use pyuca with a more recent DUCET than 6.3.0. That led to me needing to make a number of changes to the core code so it now supports 8.0.0, 9.0.0 and 10.0.0 as long as you have the right Python version.

pyuca has always supported custom collation element tables, but when someone tried the DUCET from Unicode 8.0.0, the test suite failed.

At first I thought perhaps that was because the test suite is from 6.3.0 (or 5.2.0 if running Python 2.7) but when I got around to trying the 8.0.0 test suite on the 8.0.0 DUCET it too failed.

It turned out to be that a few changes were made by the Unicode Consortium to what code points are considered CJK Unified Ideographs. This is hard-coded in pyuca because it's required for implementing the implicit weight calculations (weights for certain CJK ideographs are calculated programmatically rather than explicitly listed in the DUCET).

In 9.0.0 the collation element table format was slightly changed to add a new @implicitweights directive so for things to work with 9.0.0, I had to implement that. Then in 10.0.0, more changes were made to what code points are considered CJK Unified Ideographs.

It didn't stop there, though. Because pyuca relies on Python's unicodedata library for getting information on character categories, certain versions of Python won't work with certain versions of Unicode.

So I added some logic (both to pyuca itself, and to the test suite) to use the appropriate collation code (with the right implicit weight calculations) and appropriate DUCET depending on what version of Python you are running.

Some of this dispatching-based-on-Python-version had already been written by Chris Beaven, Paul McLanahan, and Michal Čihař as part of their backporting of pyuca to 2.7 (after I'd declared I'd only support 3). So I just extended this with the following results:

Python 2.7: test and use 5.2.0
Python 3.3: test 5.2.0, 6.3.0 and use 6.3.0 by default
Python 3.4: test 5.2.0, 6.3.0 and use 6.3.0 by default
Python 3.5: test 5.2.0, 6.3.0, 8.0.0 and use 8.0.0 by default
Python 3.6: test 5.2.0, 6.3.0, 8.0.0, 9.0.0 and use 9.0.0 by default
Python 3.7-dev: test 5.2.0, 6.3.0, 8.0.0, 9.0.0, 10.0.0 (so we're ready)

pyuca 1.2 has now been released and is available on PyPI. The repository is at https://github.com/jtauber/pyuca.

A Tour of Greek Morphology: Part 16

2017-09-07

Part sixteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the previous post we went through and made sure we had all our active endings covered ready for counting. As pointed out (and in detail in Part 13), though, we still had some ambiguities. If we want to assign just a single inflectional class to each form in the SBLGNT, we need some way of disambiguating. Fortunately, the lemma does this (even if it resorts to using fake forms like the uncontracted circumflex 1SG).

This allows us to write code that basically follows these rules:

1SG:Xημι or 3SG:Xησι(ν)	is	PA-7 if lemma ends in τίθημι or ίημι PA-9 if lemma ends in ίστημι or φημι
1PL:Xῶμεν or 3PL:Xῶσι(ν)	is	PA-5 if lemma is ζάω PA-4 otherwise
1PL:Xοῦμεν or 3PL:Xοῦσι(ν)	is	PA-2 if lemma ends in έω PA-3 if lemma ends in όω
2PL:Xετε	is	PA-1 if lemma ends in ω PA-7 if lemma ends in ημι
1PL:Xομεν	is	PA-1 if lemma ends in ω PA-8 if lemma ends in ωμι
1SG:Xῶ	is	PA-2 if lemma ends in έω PA-3 if lemma ends in όω PA-5 if lemma is ζάω PA-4 if lemma otherwise ends in άω
INF:Xέναι	is	PA-7 if lemma ends with ίημι PA-11-COMPOUND if lemma ends with ειμι

Part 13 also mentioned the 2SG:Xης ambiguity between PA-7 and PA-9 but that doesn't crop up in the SBLGNT: there are in fact no PA-7 OR PA-9 2SGs in the SBLGNT.

There ARE however three 1PL forms which do still cause a problem with the rules above:

ἀφίομεν
ἱστάνομεν
συνιστάνομεν

Each of these matches 1PL:Xομεν BUT the MorphGNT lemmas are ἀφίημι, ἵστημι, and συνίστημι respectively.

What is happening here is that new forms have developed belonging to a different inflectional class than the particular form chosen for the lemma. For example ἱστάνομεν is an ω verb but it's otherwise the same as the athematic ἵστημι. Arguably the MorphGNT lemmatization could be changed to ἱστάνω if you consider a difference in inflectional class to be a new lexeme. This is a topic I'll be covering in my talk at SBL 2017 in Boston in November. For now, in our Python code, we'll just special-case these as PA-1 but we will come back to discussing this more. Note that we only caught this here because it was an ambiguous form so we were checking for particular lemma patterns.

We now have an inflectional class for all 5,314 present active infinitive or indicative forms in the MorphGNT SBLGNT.

The output of my Python script begins:

010120 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010123 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010202 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010206 εἶ 2SG PA-10 εἰμί PA-10
010213 μέλλει 3SG PA-1 μέλλω PA-1
010213 ζητεῖν INF PA-2 ζητέω PA-2
010218 εἰσί(ν) 3PL PA-10 εἰμί PA-10
010222 βασιλεύει 3SG PA-1 βασιλεύω PA-1
010303 ἐστί(ν) 3SG PA-10 εἰμί PA-10
010309 λέγειν INF PA-1 λέγω PA-1
010309 ἔχομεν 1PL PA-1/PA-8 ἔχω PA-1

The columns are:

the book/chapter/verse reference
the normalized form
the morphosyntactic properties
the inflectional classes possible without disambiguation
the lemma
the disambiguated inflectional class

You can download the entire thing here.

We'll use this to do our counts in the next post.

One question comes to mind: are the disambiguated inflectional classes consistent for all the forms of a lexeme (beyond the three exceptions we already saw above)?

Well, looking at the full output of the script, we find there are a few more in the SBLGNT:

ὀμνύω	INF	ὀμνύναι	PA-6a
ὀμνύειν	PA-1
all other forms
δείκνυμι	INF	δεικνύειν	PA-1
2SG	δεικνύεις
1SG	δείκνυμι	PA-6a
3SG	δείκνυσι(ν)
συνίστημι	1PL	συνιστάνομεν	PA-1
INF	συνιστάνειν
1SG	συνίστημι	PA-9
3SG	συνίστησι(ν)
ἀφίημι	1PL	ἀφίομεν	PA-1
3PL	ἀφίουσι(ν)
2SG	ἀφεῖς	PA-2
all other forms	PA-7
συνίημι	INF	συνιέναι	PA-7
2PL	συνίετε
3PL	συνίουσι(ν)	PA-1
συνιᾶσι(ν)	PA-9

In each case we have an originally athematic verb occasionally acting like it's thematic (and, in the case of ὀμνύω even the lemma is written as if it was thematic). We WILL have more to say about this in a few posts but we've now done enough that we can count how many times each inflectional class appears in the SBLGNT and how many different lexemes follow each inflectional class. We'll do that in the very next post.

There is still another thing worth checking: is the value of X in our paradigm patterns consistent across a lexeme too? Yes it is, accent aside, if you only compare within the same inflectional class. The X for the δείκνυμι cells in PA-6a is always δείκν, for example, but the PA-1 cases have X = δεικνύ.

UPDATE: I just discovered a mis-disambiguated παριστάνετε that needs to be special-cased as a PA-1.

A Tour of Greek Morphology: Part 15

2017-09-05

Part fifteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the previous two posts in this series (part 13 and part 14) we summarized the paradigms we've seen so far for the present infinitive and indicative both in the active and middle.

Do these paradigms cover all the forms in the Greek New Testament? Which paradigms are more common? Which are productive? We'll explore these questions in the next few posts.

Let's start with the active forms.

The first test is whether every present active infinitive and indicative verb in the MorphGNT SBLGNT matches with one of the patterns we've discussed GIVEN ITS MORPHOSYNTACTIC PROPERTY SET. We want to test, for example, whether every verb tagged as -PAN---- matches one of Xειν, Xεῖν, Xοῦν, Xᾶν, Xῆν, Xύναι, Xέναι, Xόναι, Xάναι, or εἶναι. Or whether every verb tagged as 2PAI-S-- matches one of Xεις, Xεῖς, Xοῖς, Xᾷς, Xῇς, Xυς, Xης, Xως, Xης, or εἶ.

Running a short Python script over the MorphGNT, it turns out there are 14 forms in 69 instances that do NOT match.

Three of these forms are φημί. The issue here is that φημί is enclitic in the indicative and so, even though it otherwise follows a PA-9 paradigm, the accentuation doesn't match. If we want to capture the enclitic nature of φημί in its inflection class, we'll need to create a variant of PA-9 that is enclitic.

	PA-9	PA-9-ENCLITIC
INF	Xάναι	Xάναι
1SG	Xημι	Xημί
2SG	Xης	Xής
3SG	Xησι(ν)	Xησί(ν)
1PL	Xαμεν	Xαμέν
2PL	Xατε	Xατέ
3PL	Xᾶσι(ν)	Xασί(ν)

The 2SG appears more frequently as φῄς in Classical Greek but neither form appears in the SBLGNT so we'll put that issue aside for now.

Another eight of these forms are compounds of the copula and so have different accentuation and breathing (but are otherwise identical to PA-10).

	PA-10	PA-10-COMPOUND
INF	εἶναι	Xεῖναι
1SG	εἰμί	Xειμι
2SG	εἶ	Xει
3SG	ἐστί(ν)	Xεστι(ν)
1PL	ἐσμέν	Xεσμεν
2PL	ἐστέ	Xεστε
3PL	εἰσί(ν)	Xεισι(ν)

The only additional variation here is εἰσίασιν in Hebrews 9.6 but this is not, in fact, derived from εἰς + εἰμί but rather εἰς + εἶμι. Let's create a new paradigm for εἶμι even though it doesn't appear in the the SBLGNT just so we can derive a paradigm for the compound case from it.

Here PA-11 and PA-11-COMPOUND are shown alongside PA-10 for comparison (note the italic forms don't appear in the SBLGNT):

	PA-10	PA-11	PA-11-COMPOUND
INF	εἶναι	ἰέναι	Xιέναι
1SG	εἰμί	εἶμι	Xειμι
2SG	εἶ	εἶ	Xει
3SG	ἐστί(ν)	εἶσι(ν)	Xεισι(ν)
1PL	ἐσμέν	ἴμεν	Xιμεν
2PL	ἐστέ	ἴτε	Xιτε
3PL	εἰσί(ν)	ἴασι(ν)	Xίασι(ν)

PA-11 and PA-11-COMPOUND are very similar to PA-6a through PA-9 except with ει/ι instead of υ/υ, η/ε, ω/ο, η/α. The INF being ιε is a little unexpected but outside the scope of the current discussion as we really are just wanting to capture the 3PL of PA-11-COMPOUND for now.

Note that εἰσιέναι in Acts 3.3 is also from εἰς + εἶμι but this slipped us by because we have a Xέναι pattern already. Similarly, we have ἐξιέναι in Acts 20.7 and 27.43. With the addition of PA-11-COMPOUND we now have a slight ambiguity with PA-7 (in the INF) and PA-10-COMPOUND (in the 1SG and 2SG). This isn't a problem at the moment but will come up again (as will other ambiguities) in the next post.

Adding these paradigm variants covers 12 of our originally non-matching forms. The remaining two are the impersonal χρή and ἔνι which represent fossilized phrases with the copula elided. For our stats we'll ignore them.

In the next post, we'll see if we can categorize the lexemes in the SBLGNT into inflection classes based on these paradigms and therefore be able to study how frequent they are from both a type and token perspective.

More Vocabulary Statistics

2017-09-02

With a boost in numbers on vocab.oxlos.org, this post looks at some slightly more detailed statistics from the first activity.

Just 5 days ago there were 82 sign ups with 52 people having completed the first activity. Now there have been a total of 116 signups and 79 people have done at least the first activity (with 44 having done more than one). Thank you very much everyone!

In my last post we looked at mean item difficulty (what proportion of people get an item correct) by frequency bucket.

We saw that the coarse frequency buckets had an okay correlation with item difficulty but not great. We’ll explore that a little more in the near future but in this post I want to introduce another dimension: the ability of the person being asked the item.

I should note that in psychometrics (and in item response theory in particular, which we’ll be getting to) the term "ability" is used in a specific sense of the measurement we’re trying to take of the person (with no assumption of whether it’s innate or even desirable). It’s just the person-specific construct we’re trying to measure.

As an initial proxy for this "ability" in the context of the first activity on the site, I’ve used the total percentage of items in that activity answered correctly by a given person. This is just the raw percentage of items answered correctly, not quite the same as the estimate of NT vocabulary coverage shown on the site. This raw percentage is then used to group people into buckets (just in the context of the first activity for now).

Now we can tabulate item frequency buckets vs person ability buckets with the following result:

First off, you can see we’re still somewhat lacking in numbers of people of beginning-intermediate ability.

But importantly, you can see how mean item difficulty (the number in each cell) varies by ability bucket (the column). We’ve already seen that mean item difficulty isn’t a great predicator of item frequency bucket. Splitting out different abilities like we do above makes discrimination easier in some cases. But the important thing to note in the table above is that the mean item difficulty WITHIN a frequency bucket (row) is a good indicator of a person’s overall ability bucket.

This is less the case in the bucket for the most frequent items (the row labeled 1), which makes ability buckets 20% and above difficult to discriminate. Similarly, the less frequent item buckets aren’t as good at discriminating between the lower ability buckets. This is what we would expect.

But overall, frequency buckets 2 through 5 (and especially 3 and 4) do an excellent job of discriminating each of the ability buckets above 20%. 5 seems particularly well suited for each of the buckets at 40% ability and above and 1 only really between the 0–20% bucket and the rest.

I suspect it’s going to be interesting to have more fine-grained item frequencies but even MORE interesting to put aside frequency all together and bucket them by overall difficulty. I’ll do that in a subsequent post once I’ve done the analysis. At some point I’ll also look at individual items and their ability to discriminate ability.

For now, though, I did want to share a finer-grained bucketing of ability, with ten buckets instead of five:

The lack of people below the 50% ability mark makes this a little less useful and there are adjacent ability buckets that cease to be discriminating at this level of granularity.

But the important pattern is still there, assuming for now frequency is a proxy for difficulty: if an item is easy, it can’t discriminate people of higher ability, although may be great at discriminating those of lower ability; and if an item is hard, it can’t discriminate people of lower ability, although may be great at discriminating those of higher ability.

A Tour of Greek Morphology: Part 14

2017-08-29

Part fourteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Now we summarize our middle distinguishers. As we did for PA-6a, we'll include the upsilon for PM-6a.

	PM-1	PM-2	PM-3	PM-4	PM-5	PM-6a	PM-7	PM-8	PM-9
INF	Xεσθαι	Xεῖσθαι	Xοῦσθαι	Xᾶσθαι	Xῆσθαι	Xυσθαι	Xεσθαι	Xοσθαι	Xασθαι
1SG	Xομαι	Xοῦμαι	Xοῦμαι	Xῶμαι	Xῶμαι	Xυμαι	Xεμαι	Xομαι	Xαμαι
2SG	Xῃ or Xει	Xῇ or Xεῖ	Xοῖ	Xᾷ	Xῇ	Xυσαι	Xεσαι	Xοσαι	Xασαι
3SG	Xεται	Xεῖται	Xοῦται	Xᾶται	Xῆται	Xυται	Xεται	Xοται	Xαται
1PL	Xόμεθα	Xούμεθα	Xούμεθα	Xώμεθα	Xώμεθα	Xύμεθα	Xέμεθα	Xόμεθα	Xάμεθα
2PL	Xεσθε	Xεῖσθε	Xοῦσθε	Xᾶσθε	Xῆσθε	Xυσθε	Xεσθε	Xοσθε	Xασθε
3PL	Xονται	Xοῦνται	Xοῦνται	Xῶνται	Xῶνται	Xυνται	Xενται	Xονται	Xανται

and if we capture the common elements in each row:

	PM-1	PM-2	PM-3	PM-4	PM-5	PM-6a	PM-7	PM-8	PM-9
INF	-σθαι	-σθαι	-σθαι	-σθαι	-σθαι	-σθαι	-σθαι	-σθαι	-σθαι
1SG	-μαι	-μαι	-μαι	-μαι	-μαι	-μαι	-μαι	-μαι	-μαι
2SG	-{ι}	-{ι}	-{ι}	-{ι}	-{ι}	-σαι	-σαι	-σαι	-σαι
3SG	-ται	-ται	-ται	-ται	-ται	-ται	-ται	-ται	-ται
1PL	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα	-μεθα
2PL	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε	-σθε
3PL	-νται	-νται	-νται	-νται	-νται	-νται	-νται	-νται	-νται

Notice that, other than the contraction happening in 2SG obscuring the historical σαι, and unlike the active, there is no difference between the thematic and athematic endings.

That does mean, however, that the INF is no longer completely predictive of the other forms and, in fact no cells are (2SG getting close but failing because of the -ῇ ambiguity).

INF, 3SG, and 2PL can't distinguish within the set {PM-1, PM-7}
1SG, 1PL, and 3PL can't distinguish within the set {PM-1, PM-8}, the set {PM-2, PM-3}, or the set {PM-4, PM-5}
2SG (at least if ῇ) can't distinguish within the set {PM-2, PM-5}

That means, even if you had the INF, 3SG, AND 2PL of a word, you might not be able to predict its other forms (but if you had a single one of those other forms, all the rest would be predictable). And if you had the 1SG, 1PL, and/or 3PL of a word, you might not be able to predict its other forms (but again, if you had a single one of those other forms, all the rest would be predictable).

This mirrors the ambiguous categories we've already seen.

PM-{1, 7}	ε in INF, 3SG, and 2PL
PM-{1, 8}	ο in 1SG, 1PL, and 3PL
PM-{2, 3}	οῦ in 1PL and 3PL
PM-{4, 5}	ῶ in 1PL and 3PL

Plus:

PM-{2, 5}

ῇ ending in 2SG

Also, without accentuation, PM-4 and PM-9 would be indistinguishable in INF, 3SG, and 2PL. And, similarly, PM-1 and PM-2 in 2SG.

In the next part, we'll look at the MorphGNT to see whether the distinguishers here and in part 13 fully cover all present infinitive and indicative verbs in the SBLGNT. We'll also look at some frequency data. How (relatively) common are each of the paradigms we've identified? Which seem to be productive and which not? We'll also briefly touch on words that change inflectional class (and hence paradigm) and what role ambiguous forms might play in this.

Some Initial Vocabulary Statistics

2017-08-29

Here are some very preliminary statistics from the Greek Vocab site’s first month.

So far 82 people have signed up to http://vocab.oxlos.org/ and 52 have completed at least the first activity, a common noun receptive vocabulary leveling test based on a test form developed (for English) by Paul Nation.

Recall from my initial post on the site, that vocabulary items in that activity are classified into one of five buckets based on how many times they occur in the Greek New Testament.

Here are the mean results (with standard error) for each bucket for the first activity (N=52):

bucket	occurences	mean ± std err
1	32 or more times	0.966 ± 0.008
2	16 to 31 times	0.837 ± 0.028
3	4 to 15 times	0.667 ± 0.041
4	2 or 3 times	0.556 ± 0.049
5	1 time	0.582 ± 0.047

The first four buckets get increasingly more difficult, as one would expect. But notice the buckets 4 and 5 are indistinguishable within the standard error of the two means.

Here are the results of the next three activities of the same type.

bucket	GNT Nouns 2	GNT Nouns 3	GNT Nouns 4
	N=30	N=19	N=15
1	0.985 ± 0.004	0.991 ± 0.005	0.985 ± 0.007
2	0.894 ± 0.020	0.901 ± 0.021	0.930 ± 0.018
3	0.631 ± 0.046	0.661 ± 0.039	0.689 ± 0.051
4	0.602 ± 0.060	0.570 ± 0.067	0.574 ± 0.059
5	0.450 ± 0.048	0.556 ± 0.064	0.611 ± 0.050

GNT Nouns 2 actually does successfully separate buckets 4 and 5 (apparently the hapax legomena in that test were harder) but it doesn’t do a great job distinguishing buckets 3 and 4. GNT Nouns 3 fails to distinguish buckets 4 and 5 and only barely separates 3 and 4. GNT Nouns 4 likewise doesn’t really distinguish buckets 4 and 5 and only barely separates 3 and 4.

It should be noted that the ability level of the average person doing an activity increases with each activity. This isn’t clear from the data presented here but is from other data. This is likely because a person who has done reasonably well on one activity is more likely to continue to do more activities.

I COULD mitigate this problem by only including results for earlier activities from people who have completed all four. But before I do that, I’d actually like to just see more people do all four activities.

Furthermore, the vast majority of people doing these activities are scoring above 50% and, in fact, no one scoring below 40% has attempted activities beyond the first. I NEED MORE BEGINNER-INTERMEDIATE LEVEL PEOPLE to do all four tests! They will better discriminate mid-to-hard difficulty items (more on that concept later).

But preliminary indications are that I haven’t quite got the buckets right yet. Fortunately, I can re-run analyses with different bucketing even if the distribution of items chosen for the tests are based on the existing bucketing scheme.

I’ll continue to blog more statistics over time. Some topics I’d like to explore include inter-test reliability, G-theory, ANOVA, and IRT modeling.

Thank you to everyone who is contributing to this. Please spread the word!

Greek Letter Frequencies

2017-08-27

I recently saw a nice visualisation of English letter bigram frequencies and decided to replicate it with Greek New Testament data.

You can see the English original in this post on All Things Linguistic. That's not where I originally saw it, though. I think I saw a link on Twitter to a Reddit post.

I wrote a quick Python script to generate the same style of visualisation based on word types (not tokens) in the SBLGNT after stripping accents and folding to lowercase (but keeping the apostrophe used to mark elision). This is the result:

The intensity of red in the left column indicates the relative frequency of that letter overall. Each row then indicates (via ordering and the intensity of blue) the relative frequencies of what letter follows that red letter. The superscript then indicates the single most likely letter to follow that sequence of two letters. So it shows all unigram frequencies, all bigram frequencies, and the most common trigram for each bigram.

I also used the same bigram and trigram data to generate pseudowords, much like the English original did. At the time, I only tweeted about this second part.

Trigram-based generation of Greek-like words seems promising: ὀκρός θρωτοί δελθομοῦς ἐδωσῖνα ἐπιδάς εὑόν εἰπῆς ἐνησόφος πόδου δόξηλθον μετέ
— James Tauber (@jtauber) August 7, 2017

Patrick Burns asked me for the pseudword generation code so I extracted it, cleaned it up a bit and posted it to a gist here.

I never got around to posting my letter frequency visualisation, but Seumas Macdonald (not knowing I'd already done the work) pointed me to the All Things Linguistic blog post and asked about the possibility of doing the same for Greek. It was enough of a nudge to get this blog post written.

Thanks Seumas and Patrick!

A Tour of Greek Morphology: Part 13

2017-08-26

Part thirteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Let's summarize all 10 active distinguisher paradigms we've seen so far (this will probably only layout properly if your browser is wide):

	PA-1	PA-2	PA-3	PA-4	PA-5	PA-6	PA-7	PA-8	PA-9	PA-10
INF	Xειν	Xεῖν	Xοῦν	Xᾶν	Xῆν	Xναι	Xέναι	Xόναι	Xάναι	εἶναι
1SG	Xω	Xῶ	Xῶ	Xῶ	Xῶ	Xμι	Xημι	Xωμι	Xημι	εἰμί
2SG	Xεις	Xεῖς	Xοῖς	Xᾷς	Xῇς	Xς	Xης	Xως	Xης	εἶ
3SG	Xει	Xεῖ	Xοῖ	Xᾷ	Xῇ	Xσι(ν)	Xησι(ν)	Xωσι(ν)	Xησι(ν)	ἐστί(ν)
1PL	Xομεν	Xοῦμεν	Xοῦμεν	Xῶμεν	Xῶμεν	Xμεν	Xεμεν	Xομεν	Xαμεν	ἐσμέν
2PL	Xετε	Xεῖτε	Xοῦτε	Xᾶτε	Xῆτε	Xτε	Xετε	Xοτε	Xατε	ἐστέ
3PL	Xουσι(ν)	Xοῦσι(ν)	Xοῦσι(ν)	Xῶσι(ν)	Xῶσι(ν)	Xασι(ν)	Xέασι(ν)	Xόασι(ν)	Xᾶσι(ν)	εἰσί(ν)

As we've already noted, some cells have identical distinguishers (for example, the ῶ of PA-2, PA-3, PA-4 and PA-5). More on that shortly.

But first note something about PA-6—it subsumes the next three paradigms and, in fact, in the case of 2SG subsumes every paradigm except PA-10. In otherwords, a word form from another paradigm technically matches PA-6 too. If you go back to part 10, you'll see that our exemplar for PA-6 was δεικνύναι, δείκνυμι, and so on. The only reason PA-6 doesn't have a vowel like PA-7, PA-8, PA-9 is that the vowel is always υ and hence it was dropped out of the distinguisher analysis. But we have no reason at this stage not to supposed the upsilon is an important part of the PA-6 paradigm (it just doesn't distinguish cells within the paradigm). So I'm going to tentatively put it back for the purposes of comparing across paradigms. I'll call this modified distinguisher paradigm PA-6a.

In repeating the paradigm of paradigms with this small modification:

	PA-1	PA-2	PA-3	PA-4	PA-5	PA-6a	PA-7	PA-8	PA-9	PA-10
INF	Xειν	Xεῖν	Xοῦν	Xᾶν	Xῆν	Xύναι	Xέναι	Xόναι	Xάναι	εἶναι
1SG	Xω	Xῶ	Xῶ	Xῶ	Xῶ	Xυμι	Xημι	Xωμι	Xημι	εἰμί
2SG	Xεις	Xεῖς	Xοῖς	Xᾷς	Xῇς	Xυς	Xης	Xως	Xης	εἶ
3SG	Xει	Xεῖ	Xοῖ	Xᾷ	Xῇ	Xυσι(ν)	Xησι(ν)	Xωσι(ν)	Xησι(ν)	ἐστί(ν)
1PL	Xομεν	Xοῦμεν	Xοῦμεν	Xῶμεν	Xῶμεν	Xυμεν	Xεμεν	Xομεν	Xαμεν	ἐσμέν
2PL	Xετε	Xεῖτε	Xοῦτε	Xᾶτε	Xῆτε	Xυτε	Xετε	Xοτε	Xατε	ἐστέ
3PL	Xουσι(ν)	Xοῦσι(ν)	Xοῦσι(ν)	Xῶσι(ν)	Xῶσι(ν)	Xύασι(ν)	Xέασι(ν)	Xόασι(ν)	Xᾶσι(ν)	εἰσί(ν)

Now let's capture the common elements in the rows:

	PA-1	PA-2	PA-3	PA-4	PA-5	PA-6a	PA-7	PA-8	PA-9	PA-10
INF	-ν	-ν	-ν	-ν	-ν	-ναι	-ναι	-ναι	-ναι	-ναι
1SG	-ω	-ῶ	-ῶ	-ῶ	-ῶ	-μι	-μι	-μι	-μι	-μί
2SG	-{ι}ς	-{ι}ς	-{ι}ς	-{ι}ς	-{ι}ς	-ς	-ς	-ς	-ς	εἶ
3SG	-{ι}	-{ι}	-{ι}	-{ι}	-{ι}	-σι(ν)	-σι(ν)	-σι(ν)	-σι(ν)	ἐστί(ν)
1PL	-μεν	-μεν	-μεν	-μεν	-μεν	-μεν	-μεν	-μεν	-μεν	-μέν
2PL	-τε	-τε	-τε	-τε	-τε	-τε	-τε	-τε	-τε	-τέ
3PL	-σι(ν)	-σι(ν)	-σι(ν)	-σι(ν)	-σι(ν)	-ασι(ν)	-ασι(ν)	-ασι(ν)	-ᾶσι(ν)	-σί(ν)

The INF, although coming in two variants, has the property that it gives us enough information to know every form of the word in the present indicative active.

No other slots in our paradigms do that.

The 1SG can't distinguish within the set {PA-2, PA-3, PA-4, PA-5} or within the set {PA-7, PA-9}
The 2SG can't distinguish within the set {PA-7, PA-9}
The 3SG can't distinguish within the set {PA-7, PA-9}
The 1PL can't distinguish within the set {PA-2, PA-3}, the set {PA-1, PA-8}, or the set {PA-4, PA-5}
The 2PL can't distinguish within the set {PA-1, PA-7}
The 3PL can't distinguish within the set {PA-2, PA-3} or within the set {PA-4, PA-5}

Among other things, this is why the 1SG isn't a great choice of lemma (or headword, or citation form) for a lexeme. It's the reason so many dictionaries and lemmatizations show the circumflex verbs uncontracted (e.g. ποιέω for ποιῶ) even though in many dialects, including the Koine, that's a nonsense word. Even then, most dictionaries don't distinguish PA-7 from PA-9 (τίθημι vs ἵστημι) although admittedly that's not as important as they aren't productive.

In almost all respects, the present active infinitive is the perfect lemma for the present active forms of a verb. Some have argued against the infinitive as lemma as it doesn't form a clause by itself (although nor do verbs with obligatory complements). A close candidate is the 3SG, the benefit of which is how common it is. The main downside is just it doesn't distinguish PA-7 and PA-9. But one could hardly go wrong focusing on the INF and 3SG as the forms to most associate with each present active verb.

It should be noted that even though the 1SG is the worst predictively, it's completely predictable from any other forms. Also, despite some ambiguity in the 1PL and 3PL, they can be predicted from one another. Similarly, all the singulars in the PA-7 and PA-9 words predict each other.

Another way of thinking about this is to group our paradigm classes by their shared properties:

PA-{1, 2, 3, 4, 5}	INF ends in -ν, 1SG in -ω/-ῶ	thematic or omega verbs
	PA-{2, 3, 4, 5}		circumflex throughout endings	circumflex or contract verbs
		PA-{2, 3}			οῦ in 1PL and 3PL
		PA-{4, 5}			ῶ in 1PL and 3PL
PA-{6a, 7, 8, 9, 10}	INF ends in -ναι, 1SG in -μι	athematic or μι verbs
	PA-{6a, 7, 8, 9}		3SG in -σι(ν), 3PL in -ασι(ν)
		PA-{7, 9}			η in singulars

There are the other cross-cutting categories:

PA-{1, 8}	1PL ends with ομεν
PA-{1, 7}	2PL ends with ετε

If one ignores accentuation, one could conceivably also come up with cross-cutting categories such as PA-{1,2} which shares the ει in the INF, 2SG, and 3SG. Or PA-{4, 9} which both have ατε in 2PL. Or PA-{1, 2, 3} which all have ουσι(ν) in 3PL.

Next we'll look at the middles.

A Tour of Greek Morphology: Part 12

2017-08-16

Part twelve of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

There is one very important verb we haven't looked at the paradigm of yet: the copula.

For comparison, we'll put the present infinitive and indicative forms alongside the common endings of the μι verbs we saw in part 10.

| INF | εἶναι | -ναι | 1SG | εἰμί | -μι | 2SG | εἶ | -ς | 3SG | ἐστί(ν) | -σι(ν) | 1PL | ἐσμέν | -μεν | 2PL | ἐστέ | -τε | 3PL | εἰσί(ν) | -ασι(ν)

Notice:

all but the INF and 2SG are enclitic
in the INF, 1SG, 1PL and 2PL we find the expected ending
the 3SG and 3PL are slightly different
the 2SG is lacking the ending all together
with all the endings removed, we sometimes have ἐσ and sometimes εἰ

Recall in part 9 we said that "it was not uncommon for Attic-Ionic to have σι for τι in other dialects" (a type of lenition). Perhaps the 3SG ending was originally τι(ν) and it just became σι(ν) in all the μι verbs except the copula.

And in part 11 we questioned "why the active 2SG and 3SG forms don’t end in σι and τι to mirror σαι and ται." Well, what if they originally did and some change masked this?

The 3SG τι(ν) would be explained as an original τι with the occasional movable nu. The 3SG σι(ν) would just come from τι(ν) via the tendency for τι to become σι in Attic-Ionic.

The 2SG εἶ is perfectly explainable as coming from ἐσι with the intervocalic sigma dropping. In fact, we find ἐσσί in Homer, Pindar and other writings in older or more conservative dialects. If εἶ came from an older ἐσσί, that would not only suggest a -σι ending but a ἐσ stem. [EDIT: it's also possible, or even likely given the evidence of other Indo-European languages, that the first sigma was dropped much earlier in Proto-Indo-European and the instances of ἐσσί are actually a reintroduction of a double sigma by analogy with the 3SG!]

Is it plausible that εἶναι came from ἐσ+ναι and εἰμί from ἐσ+μι? Absolutely! A sigma dropping and the preceding vowel lengthening would explain those forms. But why would we still find ἐσμέν rather than, say, εἰμέν? Well it turns out Homer and Herodotus do have εἰμέν. There is clearly tension between keeping the ἐσ and going to εἰ and different dialects went a different way even at the level of different cells in the paradigm.

In the 3PL, we do find that Homer (as well as εἰσί) has ἔᾱσι, following the 3PL ending of the other μι verbs, but much as the ω verb ending -ουσι comes from -οντι, we can explain εἰσί from ἐσ+ντι.

Further justification of earlier forms comes from comparison with other Indo-European languages but doing that would take us too far afield for this survey. For now, we'll just summarize what we have for this new paradigm.

We'll call this PA-10 but because of the ἐσ/εἰ alternation, we can't really isolate distinguishers across the entire paradigm other than the full words themselves.

	PA-10
INF	εἶναι	ἐσ+ναι
1SG	εἰμί	ἐσ+μι
2SG	εἶ	ἐσ+σι
3SG	ἐστί(ν)	ἐσ+τι
1PL	ἐσμέν	ἐσ+μεν
2PL	ἐστέ	ἐσ+τε
3PL	εἰσί(ν)	ἐσ+ντι

As always, I stress this is a historical explanation, not an explanation of what was going on in the minds of native Greek speakers nor the best way to initially learn the forms of the copula.

The μι/σι/τι/ντι pattern is fascinating, though; with its parallel to the middle μαι/σαι/ται/νται.

There are still, of course, open questions, like the relationship between these endings and those of the ω verbs that differ (not least of which -μι vs -ω itself!) Or the fact that our other μι verbs seemed to use a different vowel in the singular than the plural and there's no sign of that in the copula. [EDIT: also as noted, ἐσσι as the original form is problematic; it was likely ἐσι in Proto-Greek.]

One earlier observation we can say a little bit more about now, though, is the alpha in the -ασι(ν) ending which previously seemed inexplicable. As we shall see later on, when a ν can't be pronounced in a particular context, it often became an α rather than just dropping out completely. Given we reconstruct an ν in the 3PL ending, this ν becoming an α rather than dropping out entirely explains -ασι(ν) (with no compensatory lengthening). Because the μι verbs (unlike the ω verbs) have a 3SG ending in σι(ν), keeping the α around was useful to discriminate between the singular and plural. In the case of the copula, though, the 3SG retained the τ so there was less reason to keep the old ν (pronounced as α) around and it could just drop out entirely.

We've now covered the major present infinitive and indicative paradigms. In the next few posts in this series we're going to step back a little and talk about the relationship between paradigms, the notion of lemmas and citation forms, some more about cell filling and class inference, and some statistics about the frequency of these different paradigms we've looked at. Then we'll move beyond the present and look at a whole new set of paradigms!

Speaking in Berlin

2017-08-05

This afternoon I'm heading off to Berlin for my first Society of Biblical Literature International Meeting, where I'll be speaking on adaptive reading environments for Biblical Greek.

I've attended a number of SBL Annual Meetings in the US and spoken at two but this will be my first International Meeting. At the invitation of Professor Nicolai Winther-Nielsen, I'll be giving an update on the talk I gave at last year's Annual Meeting.

Here's my abstract:

The Route to Adaptive Learning of Greek

One of the promises of machine-actionable linguistic data linked to biblical texts is the enablement of new types of language learning tools. At their simplest, such tools might involve adding the necessary scaffolding to enable students to read more text than they otherwise might by providing glosses for rarer words or help on idioms, irregular morphology, and unusual syntactic constructions. Such tools, however, are hardly novel and have long been manually produced in printed form. Equivalent electronic versions don't really take advantage of what's possible. In this paper I discuss an online reading environment for Ancient Greek, and the Greek New Testament in particular, that takes advantage of the availability of open, machine-actionable resources such as treebanks and morphological analyses for more automated and consistent generation of scaffolding but which goes a step further by being adaptive to an individual student's knowledge at a given point. Such knowledge need not be explicitly provided (although it can be: to align with a particular textbook, for example). It can also be built up implicitly from what the reader is requesting more information or help on: What words are they having trouble remembering the meaning of? What forms are they having trouble parsing? The model of student knowledge is then integrated with learning tools such as spaced-repetition flash cards and parsing drills with the results of these tools then feeding back into better adapting scaffolding for reading. The online reading environment will be open source and potentially applicable to a wide range of other language and texts provided the necessary linguistic data is available.

Thank you to Professor Winther-Nielsen for inviting me.

First Week of New Vocab Site

2017-08-05

Last week I launched a site for Greek vocabulary. Here's how the first week has gone.

Over time http://vocab.oxlos.org/ will contain a variety of tools for learning and assessing Greek vocabulary. As mentioned in my blog post a week ago, I'm starting with some experiments based on the work of Paul Nation.

I'm delighted with the response so far and am very thankful to everyone who has participated. In the first week 58 people signed up, 37 people completed at least one full activity with 19 completing more than one and six people completing at least four activities.

Thanks to Seumas Macdonald, I expanded the initial New Testament vocabulary testing a couple of days ago to some Patristic vocabulary. I'll also be adding some classical Greek vocabulary soon.

As my previous post says, some of my initial research questions are:

how reliable is a test like Nation's vocabulary level test at estimating one’s NT Greek vocabulary size?
how much is frequency a factor in how likely a student is to know a word?
what other factors contribute to likelihood a student knows a word?

I do need to continue to gather data but so far the Nation-style test seems to be working well and individual frequency bands actually do seem very good indicators of overall vocabulary size. I'll publish results with analysis over time. I'll also continue to release new activities.

As well as expanding the vocabulary to broader corpora and other parts of speech besides nouns, I also want to explore the impact of English cognates and relatedness between lexemes due to derivation. I'll also be adding some additional activity types based on the work of other vocabulary acquisition researchers such as Schmitt and Meara.

Thanks again to everyone who has participated so far and please continue to do so (and share a link to the site with Greek students, particularly those at a less-advanced level).

A Tour of Greek Morphology: Part 11

2017-08-03

Part eleven of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In part 10, we looked at some new active forms. Now it's time to look at the corresponding middle forms.

In the middle forms, there is no change in the vowel and so it doesn't need to be included in the distinguisher. In this sense, we really only have one distinguisher paradigm for all these forms in the middle.

However, if we were contrasting against the active forms as well, we could identify a PM-6, PM-7, PM-8, and PM-9 paired up with PA-6, PA-7, PA-8, PA-9:

	PM-6	PM-7	PM-8	PM-9
INF	Xσθαι	Xεσθαι	Xοσθαι	Xασθαι
1SG	Xμαι	Xεμαι	Xομαι	Xαμαι
2SG	Xσαι	Xεσαι	Xοσαι	Xασαι
3SG	Xται	Xεται	Xοται	Xαται
1PL	Xμεθα	Xέμεθα	Xόμεθα	Xάμεθα
2PL	Xσθε	Xεσθε	Xοσθε	Xασθε
3PL	Xνται	Xενται	Xονται	Xανται

But the common endings for the μι verbs are very clear. Here they are alongside our previously reconstructed endings for the previous middle paradigms:

| INF | -σθαι | ε σθαι | 1SG | -μαι | ο μαι | 2SG | -σαι | ε σαι > ῃ | 3SG | -ται | ε ται | 1PL | -μεθα | ο μεθα | 2PL | -σθε | ε σθε | 3PL | -νται | ο νται

This not only provides clear support for the ε+σαι reconstruction of the ῃ MID 2SG form but also makes clear how the ω verbs (both barytone and circumflex) use the same endings as the μι verbs but with the ε/ο vowel (the so-called thematic vowel) attached to the stem before the ending. In the middle, this is the only difference (slightly obscured when ῃ is used in the 2SG).

As mentioned in part 9, there are some tantalising patterns here: the αι in 5 out of 7 cells; the μ/σ/τ in the 1st/2nd/3rd person.

The appearance of μι in the ACT 1SG is particular interesting because we now have a μι/μαι contrast in the 1SG between active and middle which exactly mirrors the οντι/ονται contrast in the 3PL.

One might well question why the active 2SG and 3SG forms don't end in σι and τι to mirror σαι and ται. Or why the active infinitive isn't σθι. Or why the 1PL and 2PL have only a vague relationship between the active and middle. And we still have the question of where the alpha in the ACT 3PL ασι(ν) ending comes from. We'll touch on some of these questions in the next post and we will reveal some more historical and dialectal patterns.

But it is again worth reiterating that the primary role of a distinguisher is not to be decomposable but merely to discriminate meaning. That there are patterns between the distinguishers at all is not a fundamental requirement of the role they play in conveying information. There may be historical reasons for the patterns (as we've already seen) and learnability pressures that favour them (or even conspire to introduce them over time) but we should not expect them and therefore view their absence as any kind of defect or irregularity.

A Tour of Greek Morphology: Part 10

2017-08-02

Part ten of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In previous posts we've explored five distinct active and middle paradigms in the present indicative and infinitive.

There are still a number of inflectional classes in the present we haven't covered yet and we'll introduce a few more active forms in this post.

INF	δεικνύναι †	τιθέναι	διδόναι	-ιστάναι †
1SG	δείκνυμι	τίθημι	δίδωμι	-ίστημι
2SG	δείκνυς †	τίθης	-δίδως	ἵστης †
3SG	δείκνυσι(ν)	τίθησι(ν)	δίδωσι(ν)	-ίστησι(ν)
1PL	δείκνυμεν	-τίθεμεν	δίδομεν	ἵσταμεν †
2PL	δείκνυτε	τίθετε	δίδοτε	ἵστατε †
3PL	δεικνύασι(ν)	τιθέασι(ν)	διδόασι(ν)	ἱστᾶσι(ν)

In the above table, italics indicates the form does not appear in the NT but the cell is filled from elsewhere; a preceding hyphen indicates the NT only contains the form with a preverb; and † indicates the NT has another form from one of the inflectionals classes we've already seen (more on that later).

It is worth noting that there are very few verbs that follow these paradigms but they are very common. In a future post, we'll look at the frequencies in more detail.

Let's start with the distinguishers (removing the common elements in each column):

	PA-6	PA-7	PA-8	PA-9
INF	Xναι	Xέναι	Xόναι	Xάναι
1SG	Xμι	Xημι	Xωμι	Xημι
2SG	Xς	Xης	Xως	Xης
3SG	Xσι(ν)	Xησι(ν)	Xωσι(ν)	Xησι(ν)
1PL	Xμεν	Xεμεν	Xομεν	Xαμεν
2PL	Xτε	Xετε	Xοτε	Xατε
3PL	Xασι(ν)	Xέασι(ν)	Xόασι(ν)	Xᾶσι(ν)

At this point, the relationship between PA-6 and each of PA-7, PA-8, PA-9 seem to mirror that between PA-1 and each of PA-2, PA-3, PA-4 respectively. This is especially evident in the infinitive and plurals where (-, ε, ο, α) is to (PA-1, PA-2, PA-3, PA-4) is to (PA-6, PA-7, PA-8, PA-9).

If we isolate just the common endings (recurring horizontally) and place them alongside the endings we reconstructed in part 9, we get:

INF	-ναι	ε εν
1SG	-μι	ω -
2SG	-ς	ε ις
3SG	-σι(ν)	ε ι
1PL	-μεν	ο μεν
2PL	-τε	ε τε
3PL	-ασι(ν)	ο ντι > ουσι(ν)

Notice that:

thematic vowels seem to be entirely missing
the 3PL has an alpha, though
some endings seem identical except for the lack of thematic vowel (1PL and 2PL)
some are close (2SG and 3PL)
some are not so close (INF and 3SG)
but now the 3SG and 3PL are almost identical to each other in these new paradigms
the 1SG seems completely unrelated

Because of the lack of thematic vowels (seen most strikingly in the 1PL and 2PL forms), these types of verbs are often called athematic verbs. Because of the completely different ending μι in the 1SG, they are also often called μι verbs. They could be called ναι verbs, but I'm not aware of anyone who does that. Those three things are the most obvious contrasts, though.

When we look back at the full forms, we also notice:

the vowel preceding the endings is different in the singular and the plural
ἱστᾶσι(ν) is accented in a way that suggests a contraction, probably from αα which makes sense given the other plural forms.
έα and όα haven't contracted in the 3PL (and note if they did, they would be identical to the 3SG in PA-7 and PA-8)

It is as if the stems are τιθη, διδω, and ἱστη in the singular and τιθε, διδο, and ἱστα in the infinitive and plural. This is noteworthy for at least three reasons.

Firstly, it's the first time we've seen a contrast that only indicates number and not person.

Secondly, it's not (just) a different ending indicating the number but a change in the vowel.

And thirdly, it's redundant as the ending alone still conveys number.

On the surface, it appears that δεικνυ keeps its vowel the same although length is not clear yet.

It is important to note that, unlike the circumflex verbs PA-2 through PA-5 which, as we have shown, all have the same endings (as each other and as PA-1), PA-6 through PA-9 have a new set of common endings distinct from those of PA-1 thru PA-5 (with some overlap). The paradigms cannot be explained merely as stems interacting differently with the same endings.

We will pick up this point again soon, but first (in the next post), we'll look at the middle forms of our new verbs.

NT Book Similarity by Jaccard Distance of Lemma Sets

2017-07-29

I was thinking about vocabulary differences between books of the New Testament and decided to see what happens when you do a hierarchical clustering analysis of NT books using the Jaccard distance of their lemma sets.

UPDATE: I'm now convinced much (although not all) of this is due to length effects. If you think about it, the Jaccard distance between a large set and a small set is going to be large just by virtue of the large set having more in it than the small set. This will naturally group the non-letters together, the short letters together, Romans and the Corinthian letters together and so on. So until I come up with a way to correct Jaccard distance for text length, I'd take this post with a huge grain of salt.

This is some old-school stylometry but the results are still pretty interesting. For each book, I calculated the set of lemmas and then, for each pair of books, calculated the Jaccard coefficient (the ratio of the intersection of the sets and the unions of the sets).

I then did a cluster analysis using Ward's criterion and rendered the results as a dendrogram:

Notice that the first split is between the letters and non-letters.

Within the non-letters, John's Gospel and Revelation cluster together as do Acts and the Synoptics. The Synoptics cluster with each other more than they do with Acts. Matthew and Mark cluster together more than they do with Luke.

The highest division in the letters is between:

the non-pastoral Pauline epistles plus Hebrews, James and 1 Peter
the pastorals plus the rest of the general epistles (2 Peter, the Johannine epistles and Jude)

That first division of letters further clusters into:

Galatians, Ephesians, Philippians, Colossians, 1 Thessalonians, 2 Thessalonians
Romans, 1 Corinthians, 2 Corinthians, Hebrews, James and 1 Peter

Ephesians and Colossians cluster together, the two epistles to the Thessalonians cluster together, and Galatians and Philippians cluster together.

Romans, 1 Corinthians, and 2 Corinthians cluster (although 1 Corinthians clusters closer to Romans than to 2 Corinthians). James and 1 Peter cluster. Hebrews is in the same overall group but clusters closer to the Romans/Corinthian subgroup.

The second division of letters clusters into:

Philemon, 2 John, 3 John
Titus, 1 Timothy, 2 Timothy
Jude, 1 John, 2 Peter

with the second and third clustering slightly closer than the first.

2 John and 3 John cluster much closer to each other than to Philemon. The epistles to Timothy cluster slightly closer together than they do to Titus. 1 John and 2 Peter cluster slightly closer together than they do with Jude.

I haven't thought about length effects here but they may influence the clustering of very short books together (and possibly very long books). A lot of the clustering does follow similar lengths so it's definitely worth thinking more about.

Of course, there's nothing new about this kind of analysis. As I said at the start, it's old school—the sort of thing I can imagine being published in a "humanities computing" journal in the 80s. But it's still interesting. And it might be even more interesting to apply to finer-grained text divisions and/or with properties other than lemmas.

New Site for Vocabulary Experiments

2017-07-29

I've put together a new little site to host various activities to research vocabulary knowledge and acquisition in the context of Ancient and Biblical Greek.

The new site is at:

http://vocab.oxlos.org/

While eventually there will be a range of activity types and some spaced repetition practice, there is just a single activity type at the moment, based on work by vocabulary acquisition expert Paul Nation in the 1980s and 1990s.

It is a receptive vocabulary test, which means it focuses on whether you can understand a word when you come across it in text rather than whether you can produce the word in the right context. Each step of the activity asks you to select a word that best matches a given gloss, taken over a list of word-gloss pairs with a range of different frequencies.

Nation's original tests (for English as a Foreign Language learners) used word lists split into frequency bands like the top 2000, top 3000, top 5000, and so on.

I took the common nouns in the Greek New Testament and similarly broke them in to frequency bands. Rather than have identically-sized buckets, I went by frequency cut offs:

bucket 1 : 32 or more times
bucket 2 : 16 to 31 times
bucket 3 : 4 to 15 times
bucket 4 : 2 or 3 times
bucket 5 : 1 time

(Whether these are appropriate buckets will be assessed as part of this work.)

From each bucket, 36 word-gloss pairs were randomly chosen (the glosses coming from Dodson's public domain glosses of NT lexemes). Of those 36, only 18 are tested, the 18 untested words used for distractors. This follows Nation's approach.

So each activity of this type involves 90 items. I've so far generated two activities but it's easy for me to generate more over time. I'll also expand the items to other parts of speech and a larger Greek corpus (including Classical). As long as I have frequency information and glosses, I can easily generate activities.

I also have some other types of activities I'd like to implement, based on the research literature. I'd like to roll out a new activity once every couple of weeks or so.

There are some fairly basic, fundamental questions that I'll be able to start to answer once I get more people trying the initial activities:

how reliable is a test like this at estimating one's NT Greek vocabulary size?
how much is frequency a factor in how likely a student is to know a word?
what other factors contribute to likelihood a student knows a word?

Future activities will be able to explore some of this in more detail such as the impact of English cognates or relatedness between lexemes due to derivation, etc.

Ultimately this is all input into producing better learning tools. It will feed directly into the adaptive online reading environment I'm currently working on.

Thank you to everyone who has tried the activities so far and PLEASE continue to do more activities as I roll them out and help spread the word. The more people of varying ability I get doing these activities, the richer and more insightful the data will be.

I'll share those insights on this blog as things progress.

A Tour of Greek Morphology: Part 9

2017-07-23

Part nine of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In part 8 we saw, amongst other things, that the present active infinitive has a spurious diphthong ει from ε+ε whereas the the present active second and third person singulars have a ει that is a true ε+ι diphthong.

This somewhat justifies our observation of the ις and ι pattern in the second and third person singulars across all the present actives we've seen so far.

If we show the "inert" part of the endings separated from the vowel that interacts with a preceding stem vowel to form the circumflex verbs, we get something like this:

| | active | middle | | INF | ε ε ν | ε σθαι | 1SG | ω - | ο μαι | 2SG | ε ις | η ι (sometimes ε ι) | 3SG | ε ι | ε ται | 1PL | ο μεν | ο μεθα | 2PL | ε τε | ε σθε | 3PL | ου σι(ν) | ο νται

You can see the predominance of initial ε and ο with three exceptions:

the ω of the ACT 1SG
the ου of the ACT 3PL
the η of the MID 2SG

We now know to ask the question: is ου in ACT 3PL a spurious diphthong (from ο+ο) or a true diphthong (from o+υ)? If υ works the same way as ι in our contraction rules, it must be a spurious diphthong.

There's additional evidence for this:

In the Western Greek dialects (like Doric) we find -οντι
It was not uncommon for Attic-Ionic to have σι for τι in other dialects (we'll encounter more examples later)
Dentals like ν drop out in Attic-Ionic when followed by σ and this generally causes the preceding vowel to lengthen (what is called compensatory lengthening)

So it seems our ουσι(ν) was originally from the -οντι preserved in Doric.

This introduces interesting parallels with the -ονται in the middle.

What about the ῃ in the MID 2SG? We don't need to go to another dialect to see traces of what's going on. In the NT we have the PM-4 circumflex verb:

| INF | | 1SG | καυχῶμαι | 2SG | καυχᾶσαι | 3SG | | 1PL | καυχώμεθα | 2PL | καυχᾶσθε | 3PL | καυχῶνται

with ᾶσαι for ᾷ. The ᾶσαι can be explained as the stem vowel α interacting with the ending εσαι. The ᾷ can be explained simply through the σ dropping out (and similarly the ῃ in the PM-1 and PM-2 and so on) plus our contraction rules.

Interestingly, later Greek restored the uncontracted ending and we find it again in Modern Greek.

And so we have the reconstructed endings:

	active	middle
INF	ε εν	ε σθαι
1SG	ω -	ο μαι
2SG	ε ις	ε σαι > ῃ
3SG	ε ι	ε ται
1PL	ο μεν	ο μεθα
2PL	ε τε	ε σθε
3PL	ο ντι > ουσι(ν)	ο νται

There are some tantalising patterns here, especially in the middle: the αι in 5 out of 7 cells; the μ/σ/τ in the 1st/2nd/3rd person.

As usual I want to emphasize the reconstructed forms in this table help explain things historically but should not necessarily be taken as an indication of a process that went on syncronically in the minds of native speakers. I'm not aware of any evidence that native speakers would have, for example, thought of ουσι as being an underlying οντι, or ῃ as being an underlying εσαι.

We haven't yet explained what's going on with the ACT 1SG nor why ει would have been an alternative for ῃ in the MID 2SG.

But other than the ACT 1SG, all other endings start with either an ε or ο. We'll talk more about this later (including why this vowel is called the thematic vowel) but note that which of the two vowels is used is completely predictable by what follows.

If the following segment is nasal (μ or ν), the vowel is ο. If the following segment is ε, ι, σ, or τ, the vowel is ε. Most descriptions consider the ε the default and the nasal context leading to ο being the exception. But we could also look for features that ε, ι, σ, and τ have that μ and ν don't (other than just being NON-nasal).

A Tour of Greek Morphology: Part 8

2017-07-17

Part eight of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

So far, just for the active, we've suggested the following contraction rules.

έει > εῖ
έω > ῶ
έε > εῖ
έο > οῦ
έου > οῦ
άω > ῶ
άε > ᾶ
άει > ᾷ (in the indicative) and ᾶ (in the infinitive)
άο > ῶ
άου > ῶ
όω > ῶ
όε > οῦ
όει > οῖ (in the indicative) and οῦ (in the infinitive)
όο > οῦ
όου > οῦ
ήω > ῶ
ήε > ῆ
ήει > ῇ (in the indicative) and ῆ (in the infinitive)
ήο > ῶ
ήου > ῶ

In this post I want to explain why these aren't just an arbitrary set of sound changes and that they are really quite systematic. We'll say a little bit about Greek orthography and build a model using some simple phonological features that explains the core contraction rules quite compactly.

Before I do that, though, I want to emphasize again that I'm not suggesting these "rules" need to be learned by the language learner. They are historical explanations for the spelling of circumflex verb endings in certain dialects and I'm discussing them to give people a flavour for linguistic description. The best way to learn the circumflex verbs is to produce and read them in context. It really doesn't take long to just intuitively know that ἀγαπᾷς is a second person singular or that ἀγαπᾶν is an infinitive. You don't need to know the contraction rules or how to model them with phonological features.

But if you're interested in WHY the forms are ἀγαπᾷς and ἀγαπᾶν (including why one has an iota subscript and the other doesn't) keep reading!

Orthography

You've probably been told that ε and o are always short vowels. As far the LETTERS themselves go, in our standard Greek orthography, that is true. But a long ε and a long o existed as sounds in Classical Greek and earlier. Different dialects wrote these differently. Some just wrote Ε and Ο regardless of whether they were long or short. This is similar to Α, Ι, or Υ, which could be used for both the short and long variants. The Ionians, however, used the digraphs ΕΙ and ΟΥ for the long-Ε and long-Ο respectively. At the time, this was NOT the same sound as the diphthongs ΕΙ and ΟΥ, despite being written the same. It is likely that the long ε and long ο were pronounced with the tongue a little higher up (hence closer to the way ι and υ were pronounced) to reduce any confusion with η and ω which were pronounced with a lower tongue, closer to α. The digraphs ΕΙ and ΟΥ, when used for the long ε and long ο are sometimes called "spurious diphthongs" because they weren't actually diphthongs at all, they were long monophthongs.

The Greeks started to standardize on the Ionian spelling and, in 403 BC, Athens officially adopted the Ionian spelling.

This purely orthographic convention explains why εε > ει and οο > ου. That doesn't mean ALL occurences of ει are long ε or all occurences of ου are long ο. ει and ου CAN be true diphthongs, but when they come from ε+ε or ο+ο respectively, they are just long monophthongs.

Now as already mentioned, both short and long α was just written as α and so αα > α is a similarly straightforward contraction (the result being a long α). If you have a circumflex or an iota subscript, the α must have been long.

Basic Contractions

So the diagonals of this contraction table make sense:

	ε	ο	α
ε	ει	ου	η
ο	ου	ου	ω
α	α	ω	α

Now ε+ο and o+ε both result in a long ο (written ου). The order doesn't matter. The ο wins out over the ε and the ε assimilates to ο resulting in the equivalent to ο+ο.

Both α+ο and ο+α result in ω and again order doesn't matter. At the time of the spelling standardization, ω was effectively in between α and ο so this makes sense.

Note, however, that α+ε and ε+α don't behave the same way in our table above. α+ε results in α but ε+α results in η. We might expect both to be η given how α+ο and ο+α behaved. It seems that order matters in some cases but not others.

Phonological Features

One way we can model all this is by assigning each of the vowels binary features of low, back, and round and making generalisations about those categories.

In other words:

	low	back	round
ε	-	-	-
ο	-	+	+
η	+	-	-
ω	+	+	+
α	+	+	-

Note that not all combinations are possible and +round implies +back.

(We haven't included ι or υ here as they don't play a part in this analysis.)

Now all the ε, ο, α contractions can be explained in terms of assimilation of +low and +round and partial assimilation of +back, as follows:

the output is +low if either input vowel is +low
the output is +round if either input vowel is +round
the output is +back if the first input vowel is +back
the output is +back if it is +round

The rules also explain why any vowel + ω goes to ω. In fact, if you work them through, these simple rules explain all 23 contractions in our list at the top of the post (and more that haven't come in to play yet) with just one additional rule:

if you have more than two vowels, the contraction is left associative

There are likely other solutions with other features and rules but my analysis roughly follows that of Sommerstein in The Sound Pattern of Ancient Greek, that of Bubeník in The Phonological Interpretation of Ancient Greek: A Pandialectal Analysis (which also considers differences in things like the Doric dialect), and apparently that of Lejeune in Phonétique historique du mycénien et du grec ancien (on which Bubeník's is based). This style of analysis is typical of the early second half of the twentieth century so I'm not claiming it's in any way state-of-the-art. But it demonstrates that the contraction rules are very systematic.

The Difference in the Infinitive vs Indicative

There is one final thing we haven't explicitly addressed but which is fully explained by these simple rules on features: why is άει sometimes ᾷ and sometimes ᾶ (and likewise why is όει sometimes οῖ and sometimes οῦ)?

The answer is simply that if the ει is a spurious diphthong (i.e. actually just a long εε) then our simple rules will result in long ᾶ but if it's a true diphthong, the result is long α + ι which is written ᾷ. Similarly in the case of όει, a spurious diphthong will result in οῦ (from οεε > οοε > οο > ου) but a true diphthong in οῖ (οει > οοι > οι)).

What this tells us is that the ει in the ειν ending in the infinitive is a spurious diphthong but the ει in εις and ει in the second and third person singular actives are true diphthongs.

A Man Walks Into A Bar

2017-07-16

I’ve thought for a while that “A man walks into a bar” jokes are a great example of how definiteness works in English. I mentioned this to Jonathan Robie in Cambridge and he seemed to like the example too so I thought I’d share it more broadly.

Consider the standard joke form:

A man walks into a bar. The bartender says X. The man says Y.

Notice this has two indefinite articles and two definite articles. When do we use the indefinite article and when do we use the definite article?

In our sentence above, we’ve neither been introduced to the man nor the bar before. And so we use the indefinite article.

We can’t say “* The man walks into a bar” unless he’s been introduced before. Likewise we can’t say “* the bar” unless the bar’s been introduced before. For example,

Chris is one crazy guy! The man walks into a bar...

is fine if we take the man to be Chris. Similarly,

You know that bar on 52nd Street? A man walks into the bar...

works if the bar in the joke is the one on 52nd Street.

If we were telling a second joke, we could use the to indicate the man (or the bar) was the same but notice we’d have to use something like another and NOT a for introducing a second bar (or man):

Later, the man walks into another bar...

Later, another man walks into the bar...

Notice in our original joke, the third sentence starts “The man”. This makes sense because that man has already been introduced. We wouldn’t say “* The man walks into a bar. The bartender says X. A man says Y.” Even it were a different man, we’d probably use something like “Another man”.

But notice we did use the with the bartender even though he or she has NOT been introduced yet. The reason is our frame for a bar is that it has a bartender. The existence of the bartender has effectively been set up by us having a bar and that’s the bartender we want to reference so it’s not a completely new reference. Saying “* A man walks into a bar. A bartender says X” would be odd. Notice also that even if the bartender is a man, the following “The man says Y” is unambiguous.

Even if there were more than one bartender (certainly possible, although not prototypical for the frame) we’d have to say something like “One of the bartenders says X”.

This can be demonstrated with an example where we EXPECT multiple instances.

A man walks into a classroom. One of the students says X.

In this case, it would be odd to say “* A student says X” and even odder to say “* the student says X”. We want definiteness (because the classroom frame has already established the likelihood of a group of students and that’s the group we want to reference a member of) but because it’s a group, we need to say “one of” to call out an individual.

“One of the” calls out an indefinite member of a definite group.

A Tour of Greek Morphology: Part 7

2017-07-14

Part seven of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

κλῶμεν in 1Co 10.16 is clearly ACT 1PL but we can't tell from just that if it's a PA-4 or PA-5. In authors like Galen and Hippocrates we find the MID 3SG κλᾶται which we've called PM-4, which strongly suggests it's a PA-4 in the active.

If that's the case, we'd expect an ACT 2SG of κλᾷς, an ACT 3SG of κλᾷ, and an ACT 3PL of κλῶσι(ν).

But in various authors we can find the respective forms κλάεις, κλάει, and κλάουσι.

This suggests that α plays the same role in PA-4 and PM-4 as ε did in PA-2 and PM-2.

For this to work,

άω > ῶ
άε > ᾶ
άει > ᾷ (in the indicative) and ᾶ (in the infinitive)
άο > ῶ
άου > ῶ

We'll discuss the άει issue in the next post.

What about PA-3 and PM-3? We're basically trying to solve for x given:

xω > ω
xε > ου
xει > οι (in the indicative) and ου (in the infinitive)
xο > ου
xου > ου

It's difficult to find examples in the present verb forms of other dialects and texts, but even in the New Testament it's not difficult to find cases where οε and οο are alternatively spelled ου (e.g. ἀγαθοεργ- in 1 Tim and ἀγαθουργ- in Acts). This makes ο a possible candidate for x and note, in particular, the ACT 3SG forms have so far all been quite transparent in what vowel ends the stem.

So we appear to have:

όω > ῶ
όε > οῦ
όει > οῖ (in the indicative) and οῦ (in the infinitive)
όο > οῦ
όου > οῦ

And although a proper argument will get us quite far afield (maybe one day), it turns out PA-5 and PM-5 can be explained by:

ήω > ῶ
ήε > ῆ
ήει > ῇ (in the indicative) and ῆ (in the infinitive)
ήο > ῶ
ήου > ῶ

So, in summary, the circumflex verbs can be explained through a historical interaction (generally referred to as a contraction) between a vowel at the end of the original stem and the vowel at the start of what is added to it.

PA-2 and PM-2 come from a stem originally ending in έ
PA-3 and PM-3 come from a stem originally ending in ό
PA-4 and PM-4 come from a stem originally ending in ά
PA-5 and PM-5 come from a stem originally ending in ή

Often circumflex verbs are referred to as contract verbs but, while contraction is indeed the historical explanation for how the circumflex verbs got their forms, I like the name circumflex verbs because it describes an actual synchronic characteristic of the verb forms rather than an explanation of how they happened to get like that. It's interesting that ancient grammarians like Dionysius Thrax called them perispomenon verbs (the term for words with a circumflex on the last syllable) and called PA-1/PM-1 verbs barytone verbs (the term for words with NO ACCENT on the last syllable).

In the next post, we'll explore why the contraction rules are not random but, in fact, are quite systematic. We'll also touch on why the contractions don't seem to work quite the same way in the infinitive.

A Tour of Greek Morphology: Part 6

2017-07-11

Part six of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Every form we've seen of λύω so far starts with λυ, unchanged except for accent. Also, all the forms that start with λυ (or λύ) have been forms of λύω.

Every form we've seen so far that's active first person plural ends with μεν. Also, all the forms that end with μεν have been active first person plural.

Put another way, the λύ in λύομεν has nothing to do with being active first person plural and the μεν in λύομεν has nothing to do with being a form of λύ (at least based on every paradigm we've seen so far).

What about the ο in between them? It cannot (at least at the moment) be said to only depend on the fact we have a form of λύω nor can it be said to only depend on the fact we have an active first person plural form. The vowel seems to depend BOTH on the lexical item AND the morphosyntactic properties of voice, person, and number.

Similarly with ποιεῖτε. The initial ποι indicates and only indicates the lexical item. The final τε indicates and only indicates the active second person plural. The fact we have εῖ rather than ο (or ε or οῦ or any other vowel) is because of BOTH the lexical item and the morphosyntactic properties.

What is happening here becomes very clear when we look at some older texts or texts in more conservative dialects. For example, in Herodotus, written in the Ionic dialect, we don't find ποιεῖτε but instead ποιέετε. In fact, here's what we find:

ACT INF	ποιέειν
ACT 1SG	ποιέω
ACT 2SG	ποιέεις
ACT 3SG	ποιέει
ACT 1PL
ACT 2PL	ποιέετε
ACT 3PL

There are a couple of things about this that are remarkable. Firstly, if we split off the common part (now ποιέ rather than ποι) then our distinguishers are all IDENTICAL to those of λύω. Secondly, this restores the accent placement to be properly recessive.

Our ποιῶ and ποιεῖτε are so accented (and not *ποίω or *ποίειτε) because the accent has remained on the same mora (relative to the start) as the older form.

The vowels are thus explained by noting that historically:

έει > εῖ
έω > ῶ
έε > εῖ

Even without finding the necessary forms in Herodotus, we can infer (assuming the ποιέ is consistent and the distinguishers are those of λύω) the forms missing above and hence the following additional historical vowel changes:

έο > οῦ
έου > οῦ

And making the same assumption about the middle forms add:

έῃ > ῇ

All the PA-2 and PM-2 endings can now be explained by:

the verb-specific common part (the stem) ending in ε
the voice / person / number endings originally being identical to those of λύω
the six historical vowel changes listed (referred to as contractions)

In the tour's next post, we'll see if we can similarly explain the other forms we've seen. Then, in a subsequent post, we'll come back to these vowel changes and see what's systematic about them.

I want to close by emphasizing that I am only trying to describe HOW the circumflex verbs came about, not suggest anything about how native speakers processed or generated the contracted forms. As an analogy: it might be interesting to learn why the English words foot and feet are spelled the way they are relative to how they are pronounced but that explanation doesn't bear much, if any, relation to what's going on in the minds of native speakers nor is it necessarily of any use to people learning English as a second language. I'll touch on that again in a few posts time, but you can also read my 2015 post The Dangers of Reconstructing Too Much Morphophonology.

Categories of Reader Work

2017-07-10

I sometimes get people expressing an interest in my Greek reader work or get asked about the status of my "reader" and I have to ask them to clarify which reader they mean. I thought I might do a quick post where I spell out various "reader" projects I have worked on and am working on.

My interest in tools for helping read Greek (especially, but by no means only, the New Testament) goes back at least thirteen or fourteen years. In a 2004 post copied over to this blog, I talk about algorithms for ordering vocabulary to accelerate verse coverage. It was around this time I was also working on what became Quisition, a flashcard site with spaced repetition.

In November 2005, I registered the domain readjohn.com with a view to building a site to help people learn Greek by reading through John's gospel. The reason for John was not only the simplicity of its Greek but the fact it's the one thing I had the OpenText analysis for at the time. As proof I had more than just the GNT in mind, I point out that I registered readhomer.com just two months later. I wasn't just thinking Greek either, as I registered readdante.com at the same time.

Vocabulary was just an initial part of the model of what it takes to be able to read a text. It happens to be the easiest to model because all it takes, to first approximation, is a lemmatized text. But it illustrates the basic concept: if you model what is needed to read a text and you model what a student knows, you can:

help order texts (including individual clauses or even phrases) in a way that's appropriate to the student's level
appropriately scaffold the texts with just enough information to fill in the gap in their understanding

One thing I was experimenting with for scaffolding was inlining Greek that the student could understand (according to the ordering generated by my vocabulary algorithms) in larger text kept in English. So in the first lesson, the student might be given something like John 1.41 in this form:

He first found his own brother Simon καὶ λέγει αὐτῷ, "We have found the Messiah!"

The combination of vocabulary ordering algorithms (driven by clause-level analysis of John's gospel) with this sort of inlining I was calling a New Kind of Graded Reader and you can find a lot of posts from around March 2008 on this blog about it including this video. I subsequently did a full-length talk at BibleTech 2010. There's also a post with an extended example of the inlining approach.

That initial category of reader work is still alive and by no means abandonded, it's just taking a long time to get the analysis broadened to take into account not just vocabulary but inflectional morphology, lexical relatedness, syntactic constructions, etc. In fact, a large part of my linguistic analysis work is motivated by the reader work (which was a big theme of my BibleTech 2015 talk).

The second, somewhat independent (although still very much corpus-driven and using much of the same machine-actionable linguistic data) reader project was the semi-automated generation of more traditional print readers (the sort with rarer words glossed in footnotes and perhaps more obscure syntactic constructions or idioms commented on). You can read more about it in this post. One aim with the semi-automatic generation of printed readers was being able to customize them quite easily to a particular level. The scaffolding wouldn't necessarily be adaptive but it could be personalized.

Again this is still of great interest to me and motivates a lot of work on machine-actionable data. While I might experiment with approaches other than using TeX, I still want to do more in this area, most likely collaborating with people interested in particular texts (and able to help work on glosses and syntactic commentary).

A third category of work is a loose collection of various little prototypes over the years for ways of presenting information in a reader. This includes things like interlinears, colour-coded texts, various ways of showing dependency relations, etc. Brian Rosner and I consolidated these prototypes in a framework for generating static HTML files in https://github.com/jtauber/online-reader. There are various online demos linked in the README.

That repo did initially include a dynamic reading environment written in Vue.js but that was broken out as the starting point for DeepReader (see below).

The fourth category of work (which goes back to my vision for readjohn.com, readhomer.com and readdante.com when I registered the domains) is an online adaptive reading environment with integrated learning tools. I talked about this at SBL 2016 in San Antonio, a Global Philology workshop in Leipzig in May, and I will be talking about it at SBL International 2017 in Berlin next month.

The idea is to integrate vocabulary and morphological drills with the reading environment so the text drives what to drill, the results of the drills help determine the text, the scaffolding needed, etc.

So the adaptive reading environment will model:

what's needed to understand an upcoming passage
what the student has already seen
what the student has inquired about
what is at an optimal recall interval
what the student is good or not so good at understanding (based on explicit assessment including meta-cognitive questions)

This is what I'm most actively working on at the moment. As with the other categories of readers, it relies heavily on linguistic resources so I'm doing a lot in that area.

From an implementation point-of-view, this is being implemented as a Vue.js-based application running in the browser talking to a range of microservices on the backend. Much of the "heavy lifting" will be done by the microservices. The generic parts of the frontend application are being broken out by Brian and me as a framework called DeepReader which could be used for all sorts of readers (even just Kindle-style EPUB readers). I'll have a lot more to say about DeepReader in the future as well as the specific application of it to building an adaptive reading environment for Greek.

So there are really four distinct categories of reader projects that I've been working on on and off for the last thirteen or fourteen years:

a "New Kind of Graded Reader"
semi-automatic generation of printed readers
framework for generating static HTML files
online adaptive reading environment with integrated learning tools

They are all related in that they build on the same linguistic data (which is where most of the effort actually goes).

Hopefully all that provides a little bit of a high-level guide to all the reading stuff talked about on this blog, on Twitter, and which is implemented in various repositories on GitHub.

I should stress none of the code is specific to the New Testament or even to Greek. I'd be happy to collaborate with anyone on producing the necessary linguistic data for other texts and other languages.

A Tour of Greek Morphology: Part 5

2017-07-06

Part five of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In part four, we introduced the circumflex verbs in the present active. Now we're going to look at their middle forms.

Here they are alongside the middle of λύω:

As you can see, the circumflex pervades except in the 1PL where the law of limitation prohibits it. This is also the one place the λύω accent is on the distinguisher.

Note also that, as was the case with the active, the forms in each row essentially have the same endings just with a vowel change.

Here are the common elements of each row of the distinguisher in both the active and middle:

	active	middle
INF	-ν	-σθαι
1SG	-	-μαι
2SG	-{ι}ς	-{ι}
3SG	-{ι}	-ται
1PL	-μεν	-μεθα
2PL	-τε	-σθε
3PL	-σι(ν)	-νται

The iota in the 2SG active and middle and the 3SG active is questionable because we're splitting a diphthong but we'll return to that in another post.

The vowels prior to this common element seem to change as follows:

if the distinguisher has a monophthong ε in λύω,
it will have ει, ου, α, η in the other paradigms
if the distinguisher has a monophthong ο in λύω,
it will have ου, ου, ω, ω in the other paradigms

This applies to the active too (although the diphthongs there are found in more cells of the λύω paradigm).

We'll explore this more in the next post.

Before we end this one, though, let's label the paradigms for our present middle distinguishers:

	PM-1	PM-2	PM-3	PM-4	PM-5
INF	Xεσθαι	Xεῖσθαι	Xοῦσθαι	Xᾶσθαι	Xῆσθαι
1SG	Xομαι	Xοῦμαι	Xοῦμαι	Xῶμαι	Xῶμαι
2SG	Xῃ or Xει	Xῇ or Xεῖ	Xοῖ	Xᾷ	Xῇ
3SG	Xεται	Xεῖται	Xοῦται	Xᾶται	Xῆται
1PL	Xόμεθα	Xούμεθα	Xούμεθα	Xώμεθα	Xώμεθα
2PL	Xεσθε	Xεῖσθε	Xοῦσθε	Xᾶσθε	Xῆσθε
3PL	Xονται	Xοῦνται	Xοῦνται	Xῶνται	Xῶνται

Notice that the 1SG, 1PL, and 3PL distinguishers are identical for PM-2 vs PM-3 and for PM-4 vs PM-5. This was similar to what we saw in the active case (although there, the 1SG was even less helpful in identifying the paradigm).

Notice also that these are exactly the rows where the distinguisher in λύω starts with an omicron.

A Tour of Greek Morphology: Part 4

2017-07-02

Part four of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the previous part we saw that more than half of the verb lexemes in the NT appearing in the present indicative follow the exact pattern of λύω, i.e. PA-1 in the active and PM-1 in the middle. In the next few parts to this series, we're going to look at some of the verbs that do NOT.

Here's our first example, placed alongside λύω for comparison (a paradigm of paradigms again):

Look closely at each pair on a row and notice a few things:

in the infinitive, in all singulars, and in the third plural, the distinguishers are identical EXCEPT for accent
in the first and second plurals, the only other difference is ου vs ο and ει vs ε
whereas λύω never has the accent on the distinguisher, the seven forms of ποιῶ above ALWAYS do and it is always a circumflex
the accent is not strictly recessive the way it is in λύω and PA-1 verbs in general

We are going to call this new pattern PA-2.

There are many other verbs that follow the PA-2 pattern and yet others that are quite similar but with small differences.

Here are some examples placed side-by-side with λύω and ποιῶ:

It will be clearer to see the similarities and differences by just showing the distinguishers.

	PA-1	PA-2	PA-3	PA-4	PA-5
INF	Xειν	Xεῖν	Xοῦν	Xᾶν	Xῆν
1SG	Xω	Xῶ	Xῶ	Xῶ	Xῶ
2SG	Xεις	Xεῖς	Xοῖς	Xᾷς	Xῇς
3SG	Xει	Xεῖ	Xοῖ	Xᾷ	Xῇ
1PL	Xομεν	Xοῦμεν	Xοῦμεν	Xῶμεν	Xῶμεν
2PL	Xετε	Xεῖτε	Xοῦτε	Xᾶτε	Xῆτε
3PL	Xουσι(ν)	Xοῦσι(ν)	Xοῦσι(ν)	Xῶσι(ν)	Xῶσι(ν)

I've given each of these patterns a label: PA-3, PA-4, PA-5.

All four of the new patterns have circumflex accents on the distinguisher in every cell. For this reason we will call these verbs circumflex verbs.

Notice that in 1SG, the distinguisher is identical across all the circumflex verbs (-ῶ). What that means is, given just the 1SG form of a circumflex verb, you can't tell exactly which of the patterns will be followed overall. Xῶ could be PA-2, PA-3, PA-4 OR PA-5. You CAN tell, however, that it's not a PA-1 verb (because of the circumflex).

In contrast to 1SG, if you know ANY of the INF, the 2SG, the 3SG, or the 2PL, you can tell exactly which pattern is being followed.

That leaves the interesting case of the 1PL and 3PL. An ου in either cell distinguisher means we have a PA-2 or PA-3 but don't know which. An ω in either cell distinguisher means we have a PA-4 or PA-5 but don't know which.

Put another way: presented with a 1PL ending in -οῦμεν, we can tell (at least given what we've see up until this point) what the 1SG and 3PL must be but we're left with two possibilities for all the other cells. The moment we know just one of those OTHER cells, though, we can tell what every cell must be.

We'll continue to explore these new patterns (and their corresponding middle patterns) over the next few posts.

Collapsible Treedown

2017-06-30

Jonathan Robie's Treedown format is a really nice way of conveying basic syntactic structure in real texts. I recently experimented a little with some code for collapsing and expanding of the structure.

You can read about Treedown in more detail but the idea is to convey structure in a plain text format that still conveys meaning. The name "treedown" is a nod to "markdown" and the philosophy is very similar—convey information visually but in a way that's easy to transmit and edit in plain text.

One of the things that appeals to me about Treedown is how easily it can be used to just initially sketch out high level argument structure without getting into the weeds. But even if the analysis does go a little deeper, you want to be able to pull back and see the high-level structure without getting too much in the way of just reading. So to this end, I hacked together a bit of HTML, CSS and JS to demonstrate some UI to support this "collapsibility".

This is just plain Treedown (or one proposal for it—it's still a work in progress) but with some lightweight interactivity that lets the reader determine how much structure they want to see. Square brackets around the Treedown label indicates a further analysis that can be expanded.

I made a variant that lets you get a "preview" of the next level of structure when you hover over it, using labelled brackets:

I then thought that perhaps this preview might be better conveyed just with colour, where each Treedown label gets its own colour. Here's what that might look like:

There are all just quick prototypes but let me know what you think.

A Tour of Greek Morphology: Part 3

2017-06-29

Part three of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the first two parts (part one and part two), we looked at the present indicative forms of λύω.

I want to now add the infinitives, λύειν (for the active) and λύεσθαι (for the middle).

So we now have:

	active	middle
INF	λύειν	λύεσθαι
1SG	λύω	λύομαι
2SG	λύεις	λύῃ or λύει
3SG	λύει	λύεται
1PL	λύομεν	λυόμεθα
2PL	λύετε	λύεσθε
3PL	λύουσι(ν)	λύονται

Adding the infinitives does make certain commonalities jump out even more: all the 'ει' in the active and both the 'αι' and '(σ)θ' in the middle.

But one of the big questions to address next is: does any of this have anything to do with the present indicative (and infinitive) forms of any other words besides λύω?

Fortunately (otherwise it might not have been the best of starting places) it does. In the MorphGNT, there are 645 distinct lexemes appearing in the present indicative and 383 of them follow exactly the same pattern as λύω above including the accentuation.

In the present active indicative, there are 10 verbs that exhibit all six cells in the paradigm: θέλω, ἀκούω, λέγω, μένω, λαμβάνω, γινώσκω, πιστεύω, μέλλω, ἔχω, βλέπω (note that λύω is not, in fact, among them).

In the middle, there are no words filling all six cells in the MorphGNT but there are five verbs that fill five of the cells: βούλομαι, λογίζομαι, ἔρχομαι, ἐργάζομαι, προσεύχομαι.

But allowing for the missing cells, 271 lexemes follow this active pattern in the present indicative and 160 lexemes follow this middle pattern (with overlap in the case of lexemes that have both active and middle forms):

	active	middle
INF	Xειν	Xεσθαι
1SG	Xω	Xομαι
2SG	Xεις	Xῃ or Xει
3SG	Xει	Xεται
1PL	Xομεν	Xόμεθα
2PL	Xετε	Xεσθε
3PL	Xουσι(ν)	Xονται

The accent is recessive in every case so will be an acute on the right-most syllable of X in every case but Xόμεθα where the law of limitation means the accent can't go back as far as X. I could skip accents altogether but they'll turn out to be very important in the next few posts so it's actually helpful to include them in this template where they fall on the distinguisher (the part other than the X that varies from cell to cell). And note that if the distinguisher doesn't have an accent in the template it's because it doesn't have the accent in the full form.

I'm going to call the active and middle pattern above PA-1 and PM-1 respectively.

We must avoid the temptation to talk of stems at this point. Even though X above does correspond to what's normally thought of as the stem, we will encounter many paradigm templates (including in the next few posts in this series) where that is not the case and it's better to be precise and avoid confusion from the start.

A Tour of Greek Morphology: Part 2

2017-06-25

Part two of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the first part we took an initial look at the present active indicative paradigm for λύω, repeated below for easy reference:

λύω
λύεις
λύει
λύομεν
λύετε
λύουσι(ν)

There are a number of morphsyntactic properties we could alter to see the effect on the paradigm, but in this post, we'll look at the middle voice:

λύομαι
λύῃ or λύει
λύεται
λυόμεθα
λύεσθε
λύονται

So again, we're showing, side-by-side, the various number-person forms for λύω, keeping the tense, aspect, voice, and mood constant. In this way we can see, by comparing the paradigms (a paradigm of paradigms!), how the active/middle alternation is realized in Greek (at least for the present indicative λύω!)

A few things may immediately jump out at you:

the forms continue to all start with λυ
the υ is always followed by a vowel (and mostly ε or ο)
the second person singular has two possible forms
three of the forms end in -αι
both the first person forms have a μ and both the third person forms have a τ
the first and second plural both have a θ and there seems to be more of a link between the active and middle forms (ομεν/ομεθα, ετε/εσθε)

We have to be careful not to make too much of some of these yet. Many a bad linguistic analysis has come from noticing patterns in a small number of instances without seeing if the same pattern applies more broadly! We need more data. But these initial observations are at least things to keep in the backs of our minds as we explore more forms. Some of them will prove particularly interesting later on.

For now I just want to explore the two second person singular forms, λύῃ and λύει. You'll notice one of these forms is identical to the third singular active form. Isn't this potentially confusing?

Yes, but there are two things to note here: one, it should generally be clear from the context, regardless of the ending, whether a third person active or second person middle is intended. Ambiguities in morphology like this are far more likely in cases where multiple morphsyntactic properties vary at once (in this case both person AND voice) and where the larger context is likely to make clear which alternative is meant. It's worth also noting, for example, that -ει can also end a dative noun (and in fact does in over 300 cases in the NT).

Two, the -ῃ forms are much more common in the NT than the -ει and, in fact, there's actually only one second person -ει form in the SBLGNT text and it is βούλει where lexically the word must be middle anyway and so even the context isn't needed to disambiguate.

As to why two forms developed in the first place, we'll have to wait a bit to discuss that.

Another European Trip

2017-06-25

I was here last month but I'm back again for a series of conferences and then my graduation.

Last week I attended the inaugural Language, Data and Knowledge conference in Galway, Ireland including the OntoLex Model Workshop which preceded it. The LDK conference was a nice intersection of linguistics and linked data very much in the spirit of the work described on this website. I got to met a few people I've known of for a while as well as meet some new people I hope to stay in touch with and potentially collaborate with. The conference will be biennial, with the next one in Leipzig. I definitely plan to submit something for that one!

Then I attended VueConf in Wrocław, Poland. Vue.js is the JavaScript framework I'm using for my online reading environment work and the timing turned out perfectly for me to attend. I gave a lightning talk on the DeepReader project (which I'll also blog about here soon).

I'm currently in Leipzig just to visit some people at the Humboldt Chair of Digital Humanities again.

Then I'm heading to Cambridge for the Tyndale House Workshop in Greek Prepositions. Looking forward to seeing a lot of my friends there and having some good discussions, not just about the topic at hand but more broadly as well.

Then I'm heading to Lampeter, Wales for my graduation on July 7th. Three years ago, I decided that it might be useful for me to have a qualification in Classical Greek as well as in linguistics and so I started pursuing a postgraduate diploma at the University of Wales Trinity Saint David. Two days ago, I found out I'm being awarded the diploma with Distinction which was my unspoken hope despite occasionally doing poorly at my unseen translations.

A Tour of Greek Morphology: Part 1

2017-06-23

This is the first post in a (likely long) series exploring the inflectional morphology of Greek. My goal is to work through various aspects of Greek morphology to help students think more systematically about the subject.

I ultimately hope to cover everything that a beginner-intermediate grammar might but in a much more exploratory fashion. I'll occasionally touch on morphological theory but I mostly want to point out phenomena in the language that students have already seen but perhaps have not thought about in any depth.

We'll start with a paradigm familiar to all students of New Testament Greek:

λύω
λύεις
λύει
λύομεν
λύετε
λύουσι(ν)

At its most basic, a paradigm is just a showing of related forms next to one another for comparison. The idea is to get a sense of how forms and meaning relate by showing contrastive examples.

In most cases, there's something held constant across all the cells. In the list above, all the forms are present active indicative forms of the word λύω. What distinguishes them from the point of view of their morphosyntactic properties is the person and number.

Respectively the list above is:

the first person singular (present active indicative form of the word λύω)
the second person singular
the third person singular
the first person plural
the second person plural
the third person plural

It may not be the case that the forms all have something in common, although in this case you can see they all start with λύ. It may be tempting to make the simple analysis that λύ itself means "the present active indicative form of the word λύω" and, say, εις means "the second person singular". But as we shall see, that's not the most helpful analysis in general.

It's worth thinking about other possibilities we could draw from just this tiny example (even though many theories will be ruled out once we look at other data): perhaps λ indicates indicative; perhaps εις indicates not only second person singular but present active too; perhaps εις is only used if the word starts with an λ.

About all we can say at this stage is the way you discriminate between, say, a second person singular and a third person singular, in the case of the present active indicative of λύω, is the εις vs ει. And that particular example, in the absence of seeing the other cells, may even lead one to conclude you get from the third singular to the second singular by adding a sigma.

The point is there's a LOT we can't tell yet. What we CAN tell, within the set of forms with the properties held constant, is how to discriminate across forms with the morphosyntactic properties that vary. In other words, IF we have a present active indicative of λύω, how do we tell the person and number?

There is one very important property of Greek morphology that we can see just in the paradigm so far: there is no consistent way person is discriminated for a given number, nor number for a given person. In other words, the relationship between the forms λύω and λύομεν seems completely unrelated to that between λύεις and λύετε. And the relationship between λύω and λύεις seems completely unrelated to that between λύομεν and λύετε even though they differ in meaning in only one property. Or put another way, we can't just tell the person OR number, only the person AND number. We will talk more about this in future posts.

Finally, you may be wondering "why is λύω used so often?". There are multiple reasons for this choice. Firstly, as we shall see later, λύω has completely regular stem formation. Secondly the υ is robust in the face of what sounds follow it. Some Classical Greek textbooks will use παύω for the same reasons.

Modelling Derivational Morphology

2017-05-31

While most of my focus has been on inflectional morphology, I've done a little bit of work on modelling derivational morphology and it's been a desideratum for my reader and learning algorithm work dating back to at least the original 2008 \"New Kind of Graded Reader\" presentations.

In the 90s I was even in conversation with Harold Greenlee about putting his work online. There are numerous problems with this kind of work, though. The first is just mistakes and dubious connections. John Lee's 2013 paper Etymological Follies: Three Recent Lexicons of the New Testament gives numerous examples. Lee is always worth listening to when it comes to lexicons!

There's another major issue which is that expressing etymology (or even just cognate groupings) doesn't really tell you what I actually care about which is how easy is the meaning of a lexical item to learn based on other cognate lexical items you've learned. I've previously talked about modelling the cost of learning a new form in the context of inflectional morphology but I'm also interested (as mentioned in various "New Kind of Graded Reader" presentations) in the derivational equivalent between lexemes. There's some interesting theoretical work in this area going back to at least Jackendoff's 1975 paper Morphological and Semantic Regularities in the Lexicon. This was picked up in Bochner's 1993 book Simplicity in Generative Morphology which was a huge influence on me in thinking about morphology as paradigmatic relationships between words rather than morpheme-based approaches.

So for my purposes, at least, I want to model how easy it is to work out the meaning of a word from known cognates potentially given similar analogical pairs of cognates. What I'd ultimately like to develop is some sort of weighting between pairs that represents how transparent the connection in meaning is from their cognate forms.

Take for example the pair

Ἰταλία:Ἰταλικός

If that pair is known, then something like

Γαλατία:Γαλατικός

is much easier to understand. So if you understand Ἰταλία, Ἰταλικός, and Γαλατία, you can almost certainly take a stab at guessing the meaning of Γαλατικός. I care about that because a big part of my research is modelling how "easy" a passage might be for a student to read.

The analogy might be abstracted as

-ια:-ικος::place:person-from-that-place

but it also applies to things like

Πόντος:Ποντικός

which is -ος:-ικος so first/second declension doesn't matter.

Given a new place, you could probably easily construct a plausible denominal adjective for someone from that place with -ικος. A Greek speaker unfamiliar with the philosophical school would still immediately recognize Στοϊκός as suggesting "someone from the στοά" although we might want to score the transparency of that lower that those based on geographical proper nouns.

But now consider

κοινωνία:κοινωνικός

εἰρήνη:εἰρηνικός

The meaning of the root clearly transfers to the lexical items in each pair but the relationship between the items in each pair is a little less transparent. It's still there if you think about it but it almost certainly needs to be weighted less. κοινωνία and εἰρήνη are not physical places. The -ικος derivative is still in some sense about something coming from somewhere but rather than a person from a place, it seems to be a state coming from another state (metaphorical place).

Then you get something like

ὄνος:ὀνικός

If you think really really hard about it you can see how ὀνικός (in the sense of millstone) might have come from ὄνος (donkey). But this is at best a potentially useful mnemonic for learners rather than a productive derivation. It should be weighted even lower (no pun intended). And then where might

κέραμος:κεραμικός

fit in this weighting? (and to what extent do English cognates help too in cases such as this?)

I'm not yet sure how best to produce weightings for this kind of lexical relatedness. My guess is a first pass could be achieved by crowdsourcing on oxlos. Ultimately, some of the weighting could be calculated via regression based on vocabulary quizzes (although I worry about confounding factors unless the students are beginners). Even just doing the crowdsourcing would be interesting to see how much agreement there was in the "obvious relatedness" ordering of pairs like Πόντος:Ποντικός > στοά:Στοϊκός > κοινωνία:κοινωνικός > ὄνος:ὀνικός.

Finally, it occurs to me this gives a potential measure of "false friendship" amongst cognates as a mismatch between the obviousness of relatedness in form vs in meaning.

I have some old work at https://github.com/morphgnt/morphological-lexicon/tree/master/projects/derivational_morphology which I probably need to clean up at some point for all this.

As is often the case, this blog post was triggered by Jonathan Robie asking me something and me realising I'd never written up my thoughts on the topic despite having thought about it on and off for a decade :-)

Comparing Analyses from Herodotus

2017-05-24

An analysis I did of a couple of chapters of Herodotus looks like it might be an interesting example to use for various treebanking approaches—both in terms of how things are structured as well as how they are visualised.

As the last assignment for my Postgraduate Diploma in Ancient Greek, I had to write a brief commentary of Herodotus 2.35–36, which catalogs (with hasty generalisations galore) differences between Egypt and the rest of the world. The catalog consist of a series of statements of the form “Egyptians do THIS whereas everyone else does THAT” or “[In Egypt] the men do THIS and the women do THAT [as opposed to the other way around like everywhere else]”.

In his commentary, Lloyd notes that this sort of catalog could be quite monotonous but that Herodotus avoids this through “skilful stylistic variation”. My commentary spent a decent proportion of its short word count digging deeper into this variation.

Quite coincidentally, Greg Crane sent me some examples of student treebanking recently in the context of how to compare analyses and they happened to be of Herodotus 2.35. They differ from each other and from my own way of thinking about the sentences. Note that these aren’t difficult or ambiguous sentences, though! The syntax is easy, I just don’t think most analysis conventions and visualisation tools do a great job of capturing what’s going on.

In my assignment, I started off presenting a canonical example of the construction and it’s that example that I want to show here. The original sentence is

τὰ ἄχθεα οἱ μὲν ἄνδρες ἐπὶ τῶν κεφαλέων φορέουσι, αἱ δὲ γυναῖκες ἐπὶ τῶν ὤμων.

But I started off considering these sentences:

οἱ ἄνδρες τὰ ἄχθεα ἐπὶ τῶν κεφαλέων φορέουσι

αἱ γυναῖκες τὰ ἄχθεα ἐπὶ τῶν ὤμων φορέουσι

The verb (in the present, as always in these comparisons), direct object, and prepositional phrase construction are identical. What is being contrasted (shown in bold) is how the particular location (the complement in the prepositional phrase) varies with the subject.

Herodotus sets up the contrast with μέν and δέ postpositives.

οἱ μὲν ἄνδρες τὰ ἄχθεα ἐπὶ τῶν κεφαλέων φορέουσι

αἱ δὲ γυναῖκες τὰ ἄχθεα ἐπὶ τῶν ὤμων φορέουσι

He then alters the “constants” in the comparison, topicalising the direct object and eliding repetition of the verb. This results in:

τὰ ἄχθεα
    μὲν
        οἱ … ἄνδρες
            ἐπὶ τῶν κεφαλέων φορέουσι
    δὲ
        αἱ … γυναῖκες
            ἐπὶ τῶν ὤμων [φορέουσι]

The above was an indented structure I manually constructed for my commentary. It’s not machine actionable and is missing a lot but I think it does a decent job of capturing some of what's going on. It makes clear:

the topicalisation of τὰ ἄχθεα
the μέν and δέ construction as a whole
the elision of the verb

It is these three properties that I think make this a particularly interesting example.

Here’s the first student treebank analysis:

The student supplies the elided verb (although it’s not co-referenced in any way) but not the elided direct object. There’s no indication of the topicalisation.

It doesn’t quite seem right to me to say the two clauses are conjoined by δέ with the μέν hanging off the verb. I think of the μέν and δέ as equal partners in this construction and as tagging the two things being compared.

Here’s the second student treebank analysis:

This analysis seems a lot more confused. The coordination is shown as being done with the μέν this time, with the δέ dangling. The prepositional phrases are shown as governed by the subjects rather than the verb.

To be clear, I’m not trying to critique the students so much as raise questions for analysis conventions and visualisation, especially for reading environments and querying.

Again, this (and the other sentences in Herodotus 2.35–36) aren’t difficult. I doubt either student had any trouble understanding the sentence. I just think it wasn’t clear how to adequately model their understanding of the structure.

I think elision and conjunction are the biggest issues in most analyses like this and good structures and visualisation for handling those will go a long way to making treebanks more consistent and more useful.

Using this sentence from Herodotus as an example, what are better ways of making sure analyses both enable useful queries and can be visualised in more perspicuous ways?

UPDATE: perhaps "coordination" would be better than conjunction as one of the "biggest issues" and I think "theticals" (HT: Jonathan Robie) could be added to that list to make the triad: elision, coordination, and theticals.

UPDATE 2: I also need to stop saying elision when I mean ellipsis! I'm spending too much time with morphophonology and not enough time with syntax :-)

Headed to Germany Next Week

2017-05-03

Next week I'm headed to Germany for a whirlwind trip to Göttingen, Heidelberg, and Leipzig to share and discuss ideas with other scholars.

I'll be speaking at a Global Philology workshop in Göttingen, attending a Digital Classics conference in Heidelberg (where I'll also have to sit the final exam for my Postgraduate Diploma in Greek if I can find someone to invigilate), and then spending a few days in Leipzig meeting with the team at the Humboldt Chair of Digital Humanities at Universität Leipzig.

I'm very excited to now be working more closely with the digital classics community and meeting many people whose names I've known for a while.

I'm also thrilled to visit Leipzig again after more than ten years and get my fill of musical history there. I'm also hoping for a bit of a physics history fill too given the importance of both Göttingen and Leipzig in the history of quantum mechanics.

Handling Morphological Ambiguity

2017-04-21

On my now page, I currently list "finalising an improved set of morphology tags to use" under Medium Term. As I find myself sometimes having to clarify the motivation for and state of this, I thought I'd share what I just wrote in the Biblical Humanities Slack.

Firstly, some background on previous notes...

Back in 2014, I wrote down some notes Proposal for a New Tagging Scheme after discussions with Mike Aubrey. In 2015, after some discussions with Emma Ehrhardt, wrote down Handling Ambiguity. Then in February 2017, after discussion on the Biblical Humanities Slack, I put forward a concrete Proposal for Gender Tagging.

Here's a slightly cleaned up version of what I wrote in Slack...

All I've done is propose a way of representing certain single-feature ambiguities (especially gender but also nom/acc in neuter). I have not proposed anything for multi-feature ambiguities nor have I actually DONE any work that uses these proposals.

Multi-feature ambiguities at the morphology level (1S vs 3P, GS vs AP, etc) are rarely ambiguous at the syntactic or semantic level for very good reason: the syntactic/semantic-level disambiguation is what allows one to tolerate the ambiguity at the morphology level (one reason that, as a cognitive scientist, I quite like discriminative models of morphology).

But if I continue with my goal to produce a purely morphology analysis, without "downward" disambiguation, then I want to be able to provide a way of representing form over function AND representing ambiguity.

I want to stress again that I think nom vs acc in neuter, or gender in genitive plurals is a DIFFERENT kind of ambiguity than 1S vs 3P or GS vs AP. For these multi-feature ambiguities (or what my wiki page calls extended syncretism although not sure I really like that term) it may come down to just providing a disjunction of codes, e.g. GSF∨APF.

Also just in terms of motivation: clearly a morphological analysis that ignores downward disambiguation from syntax or semantics is unhelpful (and potentially even misleading) for exegesis and so a lot of use cases wouldn’t want to do it. HOWEVER, my goal is three fold:

(1) I want to have a way to model the output of automated morphological analysis systems prior to either automated or human downward disambiguation;
(2) as someone studying how morphology works from a cognitive point of view, I care about modelling how ambiguity is resolved at different levels and so want a model that can handle that;
(3) because a student is quite likely to be confronted with this disambiguity, it needs to be in my learning models. I want to be able to search for cases where 1S vs 3P ambiguity or GSF vs APF ambiguity or NSN vs ASN ambiguity is resolved by syntax or semantics so they can be illustrated to the student. I want to know, for a given passage, whether such ambiguity exists so learning can be appropriately scaffolded. And note that, for me, this extends to ambiguity resolved by just accentuation as well (which is another potentially useful thing to model for various applications).

In conclusion, I want to again state I'm not at all against a functional, full-disambiguated parse code existing. I have NEVER proposed REPLACING the existing tagging schemes. I just want to add a new column useful for the reasons I've listed above in (1) – (3) and produce new resources that perhaps ONLY use that purely morphological parse code.

Finally I want to note there's an important difference between what we put in our data and how we present it to users. People should not assume that when I'm describing codes to use in data that I'm suggesting that's what end-users should see.

UPDATE: one topic I didn't discuss here is ambiguity in endings that is resolved by knowledge of the stems or principal parts. For example, without a lexicon, there are ambiguities between imperfect and aorist that are easily resolved with additional lexical-level information.

An Initial Reboot of Oxlos

2017-04-18

In a recent post, Update on LXX Progress, I talked about the possibility of putting together a crowd-sourcing tool to help share the load of clarifying some parse code errors in the CATSS LXX morphological analysis. Last Friday, Patrick Altman and I spent an evening of hacking and built the tool.

Back at BibleTech 2010, I gave a talk about Django, Pinax, and some early ideas for a platform built on them to do collaborative corpus linguistics. Patrick Altman was my main co-developer on some early prototypes and I ended up hiring him to work with me at Eldarion.

The original project was called oxlos after the betacode transcription of the Greek word for "crowd", a nod to "crowd-sourcing". Work didn't continue much past those original prototypes in 2010 and Pinax has come a long way since so, when we decided to work on oxlos again, it made sense to start from scratch. From the initial commit to launching the site took about six hours.

At the moment there is one collective task available—clarifying which of a set of parse codes is valid for a given verb form in the LXX—but as the need for others arises, it will be straightforward to add them (and please contact me if you have similar tasks you'd like added to the site).

If you're a Django development, you are welcome to contribute. The code is open source under an MIT license and available at https://github.com/jtauber/oxlos2. We have lots we can potentially add beyond merely different kinds of tasks.

If your Greek morphology is reasonably strong, I invite you to sign up at

http://oxlos.org/

and help out with the LXX verb parsing task.

It's probably not that relevant anymore, but you can watch the original 2010 talk below. I'd skip past the Django / Pinax intro and go straight to about 37:00 where I start to discuss the collective intelligence platform.

Analysing the Verbs in Nestle 1904

2017-04-17

The last couple of weeks, I've been working on getting my greek-inflexion code working on Ulrik Sandborg-Petersen's analysis of the Nestle 1904. The first pass of this is now done.

The motivation for doing this work was (a) to expand the verb stem database and stemming rules; (b) to be able to annotate the Nestle 1904 with additional morphological information for my adaptive reader and some similar work Jonathan Robie is doing.

My usual first step when dealing with a next text is to automatically generate as many new entries in the lexicon / stem-database as I can (see the first step in Update on LXX Progress).

In some cases, this is just a new stem for an existing verb because of a new form of an already known verb. But sometimes it's an entirely new verb.

I thought the Nestle 1904 would be considerably easier than the LXX because the text is so similar but there were numerous challenges that arose.

It became clear very quickly that there were considerable differences in lemma choice between the Nestle 1904 and the MorphGNT SBLGNT. This didn't completely surprise me: I've spend quite a bit of time cataloging lemma choice differences between lexical resources and there are considerable differences even between BDAG and Danker's Concise Lexicon.

But even these aside, there were 7,743 out of 28,352 verbs mismatching after my code had already done it's best to automatically fill in missing lexical entries and stems.

A. The normalisation column in Nestle 1904 doesn't normalise capitalisation, clitic accentuation, or moveable nu, all of which greek-inflexion assumes has been done.

Capitalisation alone accounted for 1042 mismatches. Clitic accentuation alone accounted for 1008 mismatches. Moveable nu alone accounted for 4153 mismatches.

B. Nestle 1904 systematically avoids assimilation of συν and ἐν preverbs.

Taken alone, these accounted for 91 mismatches. Mapping prior to analysis by greek-inflexion is somewhat of a hack that I'll address in later passes.

C. There were 8 spelling differences in the endings which required an update to stemming.yaml:

κατασκηνοῖν (PAN) in Matt 13:32
κατασκηνοῖν (PAN) in Mark 4:32
ἀποδεκατοῖν (PAN) in Heb 7:5
φυσιοῦσθε (PMS-2P) in 1Cor 4:6
εἴχαμεν (IAI.1P) in 2John 1:5
εἶχαν (IAI.3P) in Mark 8:7
εἶχαν (IAI.3P) in Rev 9:8
παρεῖχαν (IAI.3P) in Acts 28:2

D. The different parse code scheme (Robinson's vs CCAT) had to be mapped over.

This should have been straightforward but voice in the formal morphology field sometimes seemed to be messed up (which I corrected as part of G. below).

E. There were 182 differences (type not token) in lemma choice, mostly active vs middle forms.

See https://gist.github.com/jtauber/28ddfeee3175903026dade4ab965ac6c#file-lemma-differences-txt for the full list.

F. There were a small handful of per-form lemma corrections I made

ἐπεστείλαμεν AAI.1P ἀποστέλλω ἐπιστέλλω
ἀγαθουργῶν PAP.NSM ἀγαθοεργέω ἀγαθουργέω
συνειδυίης XAP.GSF συνοράω σύνοιδα
γαμίσκονται PMI.3P γαμίζω γαμίσκω

G. Finally, I made 69 (type not token) parse code changes.

See https://gist.github.com/jtauber/28ddfeee3175903026dade4ab965ac6c#file-parse-txt for the list.

With all this, the greek-inflexion code (on a branch not yet pushed at the time of writing) can correctly generate all the the verbs in the Nestle 1904 morphology.

There are definitely improvements I need to make in a second pass and at least a small number of corrections that I think need to be made to the Nestle 1904 analysis.

But it's now possible for me to produce an initial verb stem annotation for the Nestle 1904 and I'm a step closer to a morphological lexicon with broader coverage.

UPDATE: I've added some more parse corrections but not yet updated the gist.

Update on LXX Progress

2017-04-10

As mentioned in previous posts, I've been working through the LXX, initially making sure my greek-inflexion library can generate the same analysis of verbs as the CATSS LXX Morphology and adding to the verb stem database accordingly. This is a preliminary to being able to run the code on alternative LXX editions such as Swete and provide a freely available morphologically-tagged LXX.

The general process has been, one book at a time:

programmatically expand the stem database with missing stems where the analysis given by CATSS fits what greek-inflexion stemming rules expect
where the analysis from CATSS doesn't fit what greek-inflexion expects, evaluate if it's
- a parse error in the CATSS (at this stage by far the most common problem, but also the most time consuming to identify and fix)
- a missing stemming rule (very rare at this stage)
- some temporary limitation of greek-inflexion (it could be smarter about some accentuation, for example)

Working a few hours a week, it took about a month to do 1 Kings (i.e. 1 Samuel), in part because it had close to 100 parsing errors in the CATSS, many of them quite inexplicable (like getting the voice wrong when the ending should make that very easy to determine).

The work up until this point covers about 35% of the LXX, but I decided for the rest to go broad rather than book-by-book.

In other words, I've expanded the stem database (per step one above) for the entire LXX in one go and will now work through the problem cases.

What is very encouraging is that expanding the verbs attempted from 35% to 100% only led to 731 analysis mismatches in 1,875 locations. Given the LXX has just over 100,000 verbs, that's less than a 2% error rate.

Let me be clear, however, what I'm claiming. I'm NOT saying I can morphologically tag verbs with 98% accuracy. I'm merely saying that 98% of the CATSS LXX morphological analysis can be explained by the rules and data in greek-inflexion. The other 2% is likely to MOSTLY be errors in the CATSS analysis with a few errors in my stem database, stemming rules, or accentuation rules.

At the rate I worked through 1 Kings, going through the rest of the mismatches might take the rest of the year, but I think I can speed things up by batching similar kinds of mismatches together. For example, there are 586 forms where greek-inflexion didn't generate the form in the CATSS analysis with the morphosyntactic properties given but was able to generate the form with different morphosyntactic properties. In almost all cases that corresponds to a mistake in the CATSS analysis. It's the most time consuming part to deal with but batching them up together (especially dealing with the same mismatch across all remaining books at once) should speed things up.

It may also lend itself to crowd-sourcing. I could probably pretty easily whip up a little website that shows people the form and asks them to choose between the CATSS analysis and the greek-inflexion analysis (not telling them which is which).

It may be worth me spending a few hours setting that up!

New MorphGNT Releases and Accentuation Analysis

2017-02-15

Over the last few weeks, I've made a number of new releases of the MorphGNT SBLGNT analysis fixing some accentuation issues mostly in the normalization column. This came out of ongoing work on modelling accentuation (and, in particular, rules around clitics).

Back in 2015, I talked about Annotating the Normalization Column in MorphGNT. This post could almost be considered Part 2.

I recently went back to that work and made a fresh start on a new repo gnt-accentuation intended to explain the accentuation of each word in the GNT (and eventually other Greek texts). There's two parts to that: explaining why the normalized form is accented the way it but then explaining why the word-in-context might be accented differently (clitics, etc). The repo is eventually going to do both but I started with the latter.

My goal with that repo is to be part of the larger vision of an "executable grammar" I've talked about for years where rules about, say, enclitics, are formally written up in a way that can be tested against the data. This means:

students reading a rule can immediately jump to real examples (or exceptions)
students confused by something in a text can immediately jump to rules explaining it
the correctness of the rules can be tested
errors in the text can be found

It is the fourth point that meant that my recent work uncovered some accentuation issues in the SBLGNT, normalization and lemmatization. Some of that has been corrected in a series of new releases of the MorphGNT: 6.08, 6.09, and 6.10. See https://github.com/morphgnt/sblgnt/releases for details of specifics. The reason for so many releases was I wanted to get corrections out as soon as I made them but then I found more issues!

There are some issues in the text itself which need to be resolved. See the Github issue https://github.com/morphgnt/sblgnt/issues/52 for details. I'd very much appreciate people's input.

In the meantime, stay tuned for more progress on gnt-accentuation.

First Pass of MorphGNT Verb Coverage and LXX Beginnings

2017-01-02

In greek-inflexion and an Update on the Morphological Lexicon I said that all the verbs in the MorphGNT SBLGNT analysis should be done by the end of the year. I hit that goal and made a decent start on the Septuagint.

As mentioned in that previous post, by May 2016 I could generate every single verb form in:

Louise Pratt’s intermediate grammar
Helma Dik’s Greek verb handouts
Andrew Keller & Stephanie Russell’s beginner-intermediate text book

On December 8th, I'd actually finished coverage of all the verbs in the MorphGNT SBLGNT (with a little bit of help from Nathan Smith).

The stem database is available at https://github.com/jtauber/greek-inflexion/blob/morphgnt/morphgnt_lexicon.yaml. I should emphasize, though, this is just a first pass and there's more work to do but the coverage is now there.

I immediately started work on applying the greek-inflexion code and stemming rules to the CATSS analysis of the LXX. By the end of 2016, I'd built a stem database and updated the stemming rules to cover the Pentateuch, 1 Maccabees, Jonah, Nahum, and Ezra-Nehamiah. Work on the rest of the CATSS analysis will continue over the next few months.

I decided to start a new stem database from scratch for the LXX (although I recently wrote a script to compare stem databases for inconsistencies). My primary reason for this was to see if I ended up with the same analysis for a verb stem as a way of catching potential errors in my original MorphGNT analysis. The classical Greek exemplars listed above, the MorphGNT SBLGNT and the LXX analysis all share the same stemming rules, though.

My reasons for doing the stem analysis on the CATSS morphological analysis were threefold:

expand coverage of the stem database to more parts for existing verbs as well as new verbs
provide broader tests for the stemming rules
prepare for a morphological analysis of the Swete text of the LXX/OG.

A fourth benefit quickly emerged, though: I found errors in the CATSS analysis.

I've been maintaining patch files which, after a review pass, I'll contribute back to CCAT (if they are interested). Fun fact: it was contributing corrections back to the CCAT's GNT analysis which started me on the path to MorphGNT 24 years ago!

The patches are available at https://github.com/jtauber/greek-inflexion/tree/lxx/lxxmorph. They need to be reviewed as they all pretty much assume the text is correct (including accentuation, which was a major reason for the corrections I made) and I've redone the analysis without considering context. An easy way to contribute would be to help review these patch files.

All this work on greek-inflexion has led to some improvements to the underlying inflexion library as well as numerous corrections to greek-accentuation.

Work on the LXX coverage will continue as well as expansion to other texts (both Hellenistic and Classical).

Also in an early stage is better modeling of stem formation and endings.

Finally, the fruits of all this will soon be applied to the online Greek reader I talked about at SBL 2016, with a goal to release a prototype for the Johannine gospel and epistles in a couple of months.

Diacritic Stacking in Skolar PE Fixed

2016-12-04

Back in Polytonic Greek Unicode Still Isn’t Perfect and An Updated Solution to Polytonic Greek Unicode’s Problems I talked about problems with stacking vowel length and other diacritics. At least in terms of the font used on this site, the problems are now solved.

After discussions on the Unicode mailing list, it was clear that the solution to better handling of complex diacritic stacking in polytonic Greek was NOT more precomposed forms but better support in fonts, etc. So I reached out to David Březina, the creator of the Skolar typeface, used on this site, to see if the issues could be addressed.

I'm delighted to say that Březina's foundry Rosetta Type has released new versions of Skolar PE that address all the issues I had.

I've now switched over this site to use the new version, which does mean those old posts complaining about the issues will read a little funny as they won't actually show examples of the problems they purport to.

Thank you, David, for listening to my input and making my favourite Greek typeface even better!

UPDATE (2017-01-06): turns out I also needed to add font-feature-settings: "ccmp"; for it to work on Safari.

greek-inflexion and an Update on the Morphological Lexicon

2016-12-02

Exactly seven months ago, I released a generic library, inflexion, and said I'd soon follow it up with the Greek-specific stuff. While I did open-source the latter on GitHub as greek-inflexion shortly thereafter, I didn't want to announce it here until it was further along. I'm happy to say it now is.

If you recall, I said back in May that "it can currently generate every single verb form in Louise Pratt’s intermediate grammar, on Helma Dik’s Greek verb handouts and in Andrew Keller & Stephanie Russell’s beginner-intermediate text book". It now also has much better tooling for parsing new verb forms and guessing the stem of a given form. It also has the start of noun and adjective support.

On a separate morphgnt branch, it now has tooling for testing verb form generation against the MorphGNT/SBLGNT text. The coverage of the stem database is the gospel and epistles of John, Galatians and Mark. I expect to have complete MorphGNT/SBLGNT verb coverage by the end of the year.

The repo is at https://github.com/jtauber/greek-inflexion. Note that it's not pip-installable at the moment and that hasn't been a priority as it's not a library.

As mentioned in my May post, most of the value (and effort) is not so much in the code but in the data. The stemming rules and, in particular, the stem database forms the core of the Morphological Lexicon I've been working on for a few years.

The best discussion of the Morphological Lexicon can be found in my SBL 2015 Slides although the vision can be found way back in this blog post from 2004 where I say:

the idea is that surface forms, lexical forms, spelling variations, roots, stems, suppletion, morphophonological rules, etc. will all be catalogued with relationships between them expressed as a directed labelled graph.

So good progress is being made (and it's all available openly as work progresses) and the initial stem and morphophonological rule databases should be completed in the next month.

Alongside that I'm also looking at better representing relationships between stems and also relationships between the stemming rules.

Ultimately, as discussed in my SBL 2015 talk and elsewhere, my goals are to:

freely provide, in a machine-actionable way, all of the morphological information normally found in a Greek lexicon
facilitate tagging of new Greek texts
provide the underlying information to drive a new generation of adaptive Greek readers (the topic of my 2016 SBL talk)
contribute a comprehensive analysis of Ancient Greek of interest to general morphologists
experiment with the notion of an "executable grammar" where all paradigms, rules and assertions are tested automatically against a corpus and, with it, replace the existing plethora of books on paradigms and principal parts.

Particular thanks to Jonathan Robie, who continues to provide the inspiration and encouragement for a lot of this work.

More on Diagramming Greek Accent Placement

2016-11-26

I've put together slides and a voice-over to further explain Greek accent placement from a moraic point-of-view.

After posting Diagramming Greek Accent Placement, a couple of people asked me to unpack the second diagram, so I put together a series of slides with a view to perhaps doing a voice-over to accompany them.

I put the slides up at https://www.academia.edu/29725241/Basic_Greek_Accentuation and immediately got a suggestion to do a voice-over.

Here's the resultant video:

Basic Greek Accentuation from James Tauber on Vimeo.

greek-accentuation 1.0.4 Released

2016-11-26

Three weeks ago I fixed a few bugs in greek-accentuation and ended up doing three releases (although I only blogged about two at the time). I've now done a fourth bug fix release: 1.0.4.

1.0.3 was the bug fix mentioned in Diagramming Greek Accent Placement where paroxytone wasn't being given as a possible accentuation when the penult was long and length of ultima was unknown (e.g. an unmarked alpha).

To this, 1.0.4 adds two new fixes:

syllabify.is_diphthong now works with uppercase letters (fixes a syllabification bug when capitalized word begins with diphthong)
syllabify.add_necessary_breathing now returns a NFKC normalized form (improving rebreath/debreath roundtripping)

You can pip install greek-accentuation==1.0.4. The repo is at https://github.com/jtauber/greek-accentuation.

Diagramming Greek Accent Placement

2016-11-07

Cleaning up code as part of another bug fix to greek-accentuation led me to update an old diagram I'd done showing the Greek accentuation possibilities in terms of morae.

Back in 2014 I came up with the following diagram to try to explain that the "law of limitation" was fairly easy to understand in terms of morae. Once you understand the acute and circumflex accents in terms of morae, it's clear that the accent can just go on one of the final three morae but that if the penult is long and the ultima short, the next-to-last mora is skipped over.

In trying to fix a bug in greek-accentuation, I was stepping through all the possibilities again (with the additional complexity that the code there sometimes can't tell if a vowel is long or short). I realised it might be clearer to put the four combinations of penult/ultima length in a 2-by-2 matrix.

I added a bit more information on the resulting accents and came up with this:

Let me know what you think. Do other people find this a helpful way to conceptualise things visually?

greek-accentuation 1.0.2 Released (and How Persistent Accentuation Works)

2016-11-04

Hot on the heels of the 1.0.1 bug fix, I've released 1.0.2 with another fix, this time in the persistent accent placement. So I thought I'd explain how persistent accent placement is implemented and what the bug was.

greek-accentuation.accentuation has a method persistent used for placing accents that are persistent, that is, they stay in place through different inflections as much as is allowed by basic accentuation rules.

The persistent function takes both the unaccented word to be accented and a lemma or base form that IS accented.

The first step is seeing on which syllable the accent is on this base form and what type of accent it is. Note that the position of the accent is determined by the syllable position counting from the left, not the right. The code syllabifies both the word-to-be-accented and the base form. It then works out which three (or fewer) syllable placements are allowed on the word-to-be-accented based on the basic accentuation rules. This is provided by another function possible_accentuations.

Now the first thing that's tried is whether the exact syllable position and accent type of the base is in the possible accentuations for the word-to-be-accented. If so, we're done. If not, however, we try changing the accent type from acute to circumflex while keeping it in the same position. If that's still not allowed, we iterate back, trying to place an acute on each successively later syllable until it's an accentuation allowed by the basic rules.

However, this algorithm hit a problem with accenting Ἰουδαιων using the base Ἰουδαῖος.

The first thing is tries is Ἰουδαῖων which of course is not permitted so it immediately jumps to an acute on the next position: Ἰουδαιών. However this is incorrect. The bug was that only a change from acute to circumflex was attempted before trying later positions. In this case, the correct thing to do was try an acute in the same position as the original circumflex.

This was an easy addition and results in the correct answer: Ἰουδαίων

You can pip install greek-accentuation==1.0.2. The repo is at https://github.com/jtauber/greek-accentuation.

greek-accentuation 1.0.1 Released

2016-11-03

A minor bug fix release that fixes a problem with add_necessary_breathing.

My library for accenting Greek which includes a function for adding missing breathing was throwing an exception if given a word beginning with an initial uppercase vowel, e.g. Ιησους

The bug has now been fixed.

You can pip install greek-accentuation==1.0.1. The repo is at https://github.com/jtauber/greek-accentuation.

Thoughts on Voice

2016-09-11

Occasionally I get in to conversations about the Greek middle (or voice in general) but I've never written down my thoughts on the topic. Here's an attempt to summarize my current thinking although there's nothing particularly novel about it.

Imagine a transitivity spectrum of high object-affectedness at one end and high subject-affectedness at the other end.

When describing an event, there may be some freedom in where on the spectrum to go but for different choices, there's an ordering of where they would be placed relatively on the spectrum. For example, consider:

I broke the vase
The vase broke
The vase was broken by me

These three descriptions of the same event would be placed, relatively, from left to right on the spectrum.

Now consider each of the following pairs. If being used to describe the same event, the first of the pair would be placed on the spectrum (again, relatively) to the left of the second of the pair:

take / choose (choosing might just be a mental decision but taking involves action)
destroy / perish
resolve / deliberate (resolve is a more active step beyond merely deliberating)
stop / cease
honor / value (you might value something but honoring it is taking action in response to that value)
show / appear (you can just appear but you can also actively show someone)

Now in the imperfective, Greek offers two sets of endings that can (and I stress can) be used to capture the distinction between more to the left and more to the right on the spectrum. In the perfective, Greek offers three sets of endings.

However, where the line is drawn between these two or three segments of the spectrum to map them to the different endings is somewhat arbitrary between different words and it isn't always directly comparable between different tense-aspect forms either. A single set of endings might cover a pretty large part of the spectrum. There is also no "requirement" that a single lexeme use all ending sets available, either. Instead, voice is available as a potential way of conveying the kinds of distinctions in the pairs above and in the three-way distinction in the vase example.

Where distinctions don't need to be made, it should not surprise us to find only "middle" forms in use, especially in cases of lower object affectedness (like in mental verbs). This does mean in the imperfective there is not a separate form for a passive but passivization is less useful (and hence less likely) in these cases. But it should also not surprise us if some mental verbs use active forms.

It should also not surprise us to find, say, the future using the middle where the present uses the active. If the imperfectives only need a two-way distinction, the perfectives can also make just a two-way distinction even if choosing to use the two middle-passive forms to do so.

And if only a one-way distinction is required, there is nothing odd about a lexical item choosing to use a particular one of any of the three available voice endings (although we would expect broad tendencies to be based on object-affectedness).

The "active" is often described as unmarked with the "middle" marked for subject-affectedness but I think it's actually helpful to think less about markedness and more about this transitivity spectrum of relative object-affectedness vs subject-affectedness. One can then think of voice as a largely lexically-determined tool for making relative contrasts on this spectrum.

This way of thinking means that the names of voices should probably not be so absolute but somehow be expressed in purely relative terms. The use of "middle" for the middle of the three isn't bad but "active" and "passive" are highly misleading although they are "more active" and "more passive" than the "middle" when directly contrasting within the same lexeme.

greek-accentuation 1.0.0 Released

2016-07-27

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I've previously written about here) has been sitting on 0.9.9 for a while and I've been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

There is syllabify.debreath which removes smooth breathing and replaces rough breathing with an h. And there is syllabify.rebreath which reverses this.

The other big change made is there are no-longer three top-level modules—everything is enclosed in a greek_accentuation package so instead of from syllabify import * you say from greek_accentuation.syllabify import *.

You can pip install greek-accentuation==1.0.0. The repo is at https://github.com/jtauber/greek-accentuation.

greek-accentuation is made available under an MIT license.

Thanks to Kyle Johnson of the wonderful Classical Language Toolkit project for encouraging me to finally do the 1.0.0 release.

More Parsing of the DCC Principal Parts

2016-07-24

This is part 7 of a series of blog posts about modelling stems and principal part lists and looks in even more detail at the format of the principal parts list in the DCC verbs.

In the previous blog post, I used regular expressions to match DCC principal parts.

In moving from merely matching patterns to actually extracting parts correctly, I encountered further ambiguities.

Recall that previously, I just did matches like

{grk}, {grk}, {grk}, {grk}, {grk}

where {grk} matched any Greek word.

This weekend, I expanded that to patterns more like

{present}, {future}, {aorist}, {perfect_active}, {aorist_passive}
{present}, {future}, {perfect_active}, {perfect_middle}, {aorist_passive}
{present}, {future}, {aorist}, {perfect_middle}, {aorist_passive}

which actually took into account the endings of the Greek words (for example {perfect_middle} only matches Greek words ending in μαι.

Note that the one pattern from the previous blog post becomes three patterns. These more precise patterns, however, enable easier extraction of the actual parts with their morphosyntactic properties.

They also reveal some more inconsistencies. For example, 2nd aorists are not, it turns out, always explicitly marked.

Also, the four-part pattern

{grk}, {grk}, {grk}, {grk}

actually could be any of

{present}, {future}, {aorist}, {perfect_active}
{present}, {future}, {aorist}, {perfect_middle}
{present}, {future}, {aorist}, {aorist_passive}
{present}, {future}, {perfect_middle}, {aorist_passive}
{present}, {future}, {aorist_passive}, {perfect_middle}

The last pattern is necessitated by

δύναμαι, δυνήσομαι, ἐδυνήθην, δεδύνημαι

which is, presumably, an error with ἐδυνήθην and δεδύνημαι transposed.

Besides errors like this, there is at least one ambiguity where the endings aren't enough to disambiguate.

χαίρω, χαιρήσω, κεχάρηκα, κεχάρημαι, ἐχάρην

is ambiguous because, κα is a possible aorist ending. The ambiguity can obviously be resolved by looking at the entire form, but given some parts are annotated elsewhere to avoid possible misreading, it might be better to write the above as

χαίρω, χαιρήσω, pf. κεχάρηκα, κεχάρημαι, ἐχάρην

to make perfectly clear the aorist form has been skipped over.

Again, my point is not to nitpick the DCC principal parts list, but rather make explicit the assumptions that principal parts in this format make.

In determining what part a particular form is, the following needs to be considered:

explicit annotation (e.g. pf. for perfects)
ending (μαι ending a form other than the first two parts indicates the perfect middle)
position in the list (both absolutely and relative to other forms who part is worked out from other considerations)

And the main upshot of all this is I've now converted the DCC principal parts to a YAML format that I'll shortly merge in with the parts from Pratt and Morwood.

Parsing the DCC Principal Parts

2016-07-16

This is part 6 of a series of blog posts about modelling stems and principal part lists and looks more precisely at the format of the principal parts list in the DCC verbs.

We've already discussed that the DCC principal parts are presented slightly differently than the Pratt or Morwood inasmuch as the latter two are in tabular form whereas the DCC list just has a string of comma-separated parts.

In Formatting of Principal Parts we touched on many of the properties of the DCC format but in the spirit of precise modeling, what I've done below is actually write a set of regular expressions that match and enable parsing of every entry in the DCC list.

In the regex patterns below, I've used {grk} for Greek words, optionally preceded by a hyphen. In my code this expands to the regex (-?[\u0370-\u03FF\u1F00-\u1FFF]+). I also have {grk2} which just allows an optional second Greek word separated with "or" or "and". {grk2} hence expands to ({grk}( (or|and) {grk})?). And finally, in a couple of examples, I have {gloss} for glosses consisting of English words including a comma. This expands to ([a-z, ]+).

The simplest of cases just have a comma-separated list of Greek words. There may only be 1–5 rather than the full six although in these cases, the only gaps in the parts are in the final parts.

"{grk}, {grk2}, {grk}, {grk}, {grk}, {grk2}"
"{grk}, {grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, {grk}"
"{grk}, {grk}"
"{grk}"

As mentioned in the previous blog posts, when the third part is a 2nd aorist, that's made explicit. Again, sometimes the 5th and 6th, or 4th, 5th and 6th parts are omitted.

"{grk}, {grk}, 2 aor\. {grk2}, {grk2}, {grk}, {grk}"
"{grk}, {grk}, 2 aor\. {grk}, {grk}"
"{grk}, {grk}, 2 aor\. {grk}"

One pattern skips the second part but this is clear because of the explicit labeling of the third part.

"{grk}, 2 aor\. {grk}, {grk}, {grk}"

However, in one case, "ἔρχομαι, fut. εἶμι or ἐλεύσομαι, 2 aor. ἦλθον, ἐλήλυθα", the second part is explictly labeled fut. because it is suppletive, even though it is unmbiguously the second part by position.

"{grk}, fut\. {grk2}, 2 aor\. {grk}, {grk}"

Sometimes both a 1st and 2nd aorist are given as separate parts. In the sigmatic case, "ἁμαρτάνω, ἁμαρτήσομαι, ἡμάρτησα, 2 aor. ἥμαρτον, ἡμάρτηκα, ἡμάρτημαι, ἡμαρτήθην", the 1st aorist is not explicitly labeled and so the 2nd aorist is actually in the fourth position, the fourth part in the fifth position and so on.

"{grk}, {grk}, {grk}, 2 aor\. {grk}, {grk}, {grk}, {grk}"

However, sometimes the 1st aorist in this case is labeled because it is not sigmatic and so at a glance could be confused for a perfect.

"{grk}, {grk}, 1 aor\. {grk}, 2 aor\. {grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, 1 aor\. {grk}, 2 aor\. {grk}, {grk}, {grk}"
"{grk}, {grk}, 1 aor\. {grk}"

One example of this (matching the first line above) is "φέρω, οἴσω, 1 aor. ἤνεγκα, 2 aor. ἤνεγκον, ἐνήνοχα, ἐνήνεγμαι, ἠνέχθην".

In one case, "μιμνήσκω, -μνήσω, -έμνησα, pf. μέμνημαι, ἐμνήσθην", the fourth part is skipped and the fifth is labeled pf.. It would probably be clearer if this were labeled pf. mid. or similar.

"{grk}, {grk}, {grk}, pf\. {grk}, {grk}"

In another case, "ἥκω, ἥξω, pf. ἧκα", the perfect active is labeled explicitly because there's no third part and the kappa in the imperfective stem makes the perfect form perhaps harder to identify.

"{grk}, {grk}, pf\. {grk}"

Sometimes an explicit imperfect is given. This is usually at the end, after the usual parts are given.

"{grk}, {grk}, impf\. {grk}"
"{grk}, {grk}, {grk}, {grk}, impf\. {grk}"
"{grk}, {grk2}, 2 aor\. {grk}, {grk}, impf\. {grk}"
"{grk}, {grk}, 2 aor\. {grk}, {grk2}, {grk}, impf\. {grk}"

In one case, "οἴομαι or οἶμαι, οἰήσομαι, impf. ᾤμην, aor. ᾠήθην", (perhaps inconsistently) the imperfect is given before the aorist.

"{grk2}, {grk}, impf\. {grk}, aor\. {grk}"

In one case, "ἀκούω, ἀκούσομαι, ἤκουσα, ἀκήκοα, plup. ἠκηκόη or ἀκηκόη, ἠκούσθην", where there is no fifth part, two forms of the pluperfect are given instead.

"{grk}, {grk}, {grk}, {grk}, plup\. {grk2}, {grk}"

In another case, however, "καθίστημι, καταστήσω, κατέστησα, κατέστην, καθέστηκα, plupf. καθειστήκη, κατεστάθην", this turns out to be a little tricky because it has both a 1st and root aorist but that fact is not made explicit.

"{grk}, {grk}, {grk}, {grk}, {grk}, plupf\. {grk}, {grk}"

Also note the inconsistent use of "plup." vs "plupf.".

There are four cases of just providing various non-standard parts just as imperfects, infinitives or participles (in one case three participle parts).

"{grk}, impf\. {grk2}, infin\. {grk}"
"{grk}, {grk}, impf\. {grk}, infin\. {grk}"
"{grk}, ptc\. {grk}"
"{grk}, infin\. {grk}, ptc\. {grk}, {grk}, {grk}"

In the case of εἶδον, the first part actually is the suppletive 2nd aorist of another part.

"{grk}, 2 aor\. of {grk}, act\. infin\. {grk}, mid\.infin\. {grk}"

For our purposes this may end up getting treated differently.

There are five other cases where there is additional annotation:

"{grk}, infin\. {grk}, imper\. {grk}, plupf\. used as impf\. {grk}"
"{grk}, {grk}, {grk}, {grk}, {grk} \(but usu\. {grk} instead\), {grk}"
"{grk}, {grk}, {grk}, {grk}, {grk} \(but commonly {grk} instead\), {grk}"
"{grk} \(usually mid\. {grk}\), {grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, {grk}, 2 aor\. mid\. {grk}, pf\. {grk} \(“I have utterly destroyed”\) or {grk} \(“I am undone”\)"

And finally there are five cases that are clearly typos where the crucial comma delimiter has been ommitted or accidently replaced with a .

"{grk} {grk} {gloss}, {grk} {gloss}, 2 aor\. {grk} {gloss}, {grk} {gloss}, plup\. {grk} {gloss}, {grk} {gloss}"
"{grk}, {grk}, {grk}, {grk}\. {grk}, {grk}"
"{grk} {grk}"
"{grk} {grk}, 2 aor\. {grk}"
"{grk} {grk}, {grk}, {grk}, {grk}, {grk}"

These correspond to:

ἵστημι στήσω will set, ἔστησα set, caused to stand, 2 aor. ἔστην stood, ἕστηκα stand, plup. εἱστήκη stood, ἐστάθην stood
τυγχάνω, τεύξομαι, ἔτυχον, τετύχηκα. τέτυγμαι, ἐτύχθην
προσήκω προσήξω
ἕπομαι ἕψομαι, 2 aor. ἑσπόμην
βουλεύω βουλεύσω, ἐβούλευσα, βεβούλευκα, βεβούλευμαι, ἐβουλεύθην

These cases should probably just be fixed upstream.

Now, admittedly, it probably would have been quicker for me to just manually convert the 149 strings into some completely unambiguous format rather than write regular expressions that match them all, handling typos and idiosyncracies. But the approach highlights both specific issues with the DCC list (which admittedly are quite minor, I don't want to detract from the wonderful resource the DCC Core List is) and the value of precise modeling like this in identifying inconsistencies and potential ambiguities in the way this sort of information is presented.

While it's outside the scope of this blog series, I've been exploring for a while similar tests on entire lexicon entries. This pretty quickly exposes inconsistencies. Even in cases where a markup language such as XML is used, unless it's very fine-grained markup (like the Cambridge Lexicon is/was using) lots of inconsistencies and ambiguities can creep in.

All of this comes back to what I talked about in my 2015 SBL and BibleTech talks under the heading of Technical Aspects of Openness and what's involved in making linguistic data truly machine-actionable.

Formatting of Principal Parts

2016-06-26

This is part 5 of a series of blog posts about modelling stems and principal part lists and covers the format of the principal parts themselves in the Pratt, Morwood and DCC verb lists.

Now that we've looked at how the various lemmas interelate, let's turn our attention to the individual part formatting. Here I just describe the various idiosyncracies. In subsequent posts, I'll discuss how to bring together (the relevant parts of) this information in single, machine-actionable format.

Pratt

unattested form cells have emdash —
forms only found with a prefix but listed under the base verb are prefixed - (often still with breathing but sometimes inconsistently not)
alternative forms separated by /
- active vs middle (this will be an important distinction in later posts)
- different augment handling
- stem alternatives
- other spelling differences
some single-letter spelling differences are just indicated with parenthetical letter (could be expanded to just use / as above)
aorists sometimes indicate the root in parentheses where it might not be predictable from the part (particularly useful later for inferring unaugmented stems, etc)
(rarely) section number with paradigm is referenced
(rarely) part-specific gloss is included
forms taken from another synonymous verb indicated by * (although not all suppletion indicated this way)

Morwood

includes seventh part for future passive
vowel lengths indicated
pre-contracted forms (especially in future) are shown in parentheses
rare forms are in italics
forms only found with a prefix but listed under the base verb are prefixed - (not normally with breathing but one or two inconsistencies)
alternative forms separated by , (or on new line, see below)
imperfect form sometimes listed under aorist column (marked impf.)
specifically transitive or intransitive forms sometimes marked (tr.) or (intr.)
because alternative lemmas get their own line, corresponding forms can be lined up
(rarely) page number references
(rarely) part-specific glosses
poetic spelling variants sometimes indicated

DCC Greek Core List

unlike Pratt and Morwood, the parts are just a comma-separated list
missing forms are not indicated as such so sometimes fewer than six forms are listed; if there are gaps, the next form is sometimes annotated with which part it is (and sometimes it’s annotated even when it doesn’t need to be)
second aorists are annotated with 2 aor.
where there is a first and second aorist, they can both be given as separate, comma-separated parts (with first annotated as 1 aor.)
non-standard parts are sometimes given (e.g. impf., infin., ptc.)
forms only found with a prefix but listed under the base verb are prefixed - (not normally with breathing)
occasionally further annotated in parentheses, e.g.:
- πειράω (usually mid. πειράομαι)
- προστέθειμαι (but commonly προσκεῖμαι instead)
- τέθειμαι (but usu. κεῖμαι instead)
in a couple of cases forms are glossed (although inconsistently presented):
- pf. ἀπολώλεκα (“I have utterly destroyed”) or ἀπόλωλα (“I am undone”)
- ἵστημι στήσω will set, ἔστησα set, caused to stand, 2 aor. ἔστην stood, ἕστηκα stand, plup. εἱστήκη stood, ἐστάθην stood (note missing comma between first two parts)
alternative forms just listed separated by or

Merging the DCC Lemmas

2016-06-22

This is part 4 of a series of blog posts about modelling stems and principal part lists and covers the Dickinson College Commentaries (DCC) Greek Core lemmas and issues in merging them with the existing merge of Pratt and Morwood.

It was relatively straightfoward to merge in the lemmas from the DCC.

Of the 149 verb entries in the DCC, 111 of them matched exactly with an existing Pratt or Morwood lemma (dropping length in the latter as the DCC doesn't include it).

The remaining 38 cases were simple and fell in to one of nine categories:

1. multiple spellings [3]

There is only one case of multiple spellings in the DCC verbs (the first one below). In the other two cases, DCC only gives one of the spellings given by Pratt or Morwood.

οἴομαι/οἶμαι is given as "οἴομαι or οἶμαι"
only σκοπέω of σκέπτομαι/σκοπέω is given
only μίγνυμι of μείγνυμι/μίγνυμι is given

2. difference in voice [4]

ἀποκρίνω is given, not ἀποκρίνομαι
πορεύω is given, not πορεύομαι
φοβέω is given, not φοβέομαι
πειράω is given as "πειράω (usually mid. πειράομαι)"

3. compounds of existing base (also in DCC) [17]

DCC focuses more on useful vocabulary rather than useful principal parts in its choice of which verbs to include. In this sense it's the opposite of Morwood. As a result, it include compounds where Pratt or Morwood would only include the base. In all the cases below, the DCC also includes the base (but they all fall into the categories of 111 words matching exactly).

ἀναιρέω, ἀφαιρέω
ὑπάρχω
συμβαίνω
ἀποδίδωμι, παραδίδωμι
πάρειμι
προσήκω
παρέχω
ἀποθνῄσκω
ἀφίημι
καθίστημι, προστίθημι
καταλαμβάνω, ὑπολαμβάνω
διαφέρω, συμφέρω

ἀποδίδωμι here is somewhat debatable as we already have ἀποδίδομαι in Pratt and Morwood but only under πωλέω.

4. compounds of existing base (not in DCC) [2]

In two cases, DCC has a compound whose base is already in Pratt or Morwood but not in DCC itself.

ἀποκτείνω
ἀπαλλάσσω

5. compounds where other compound but not base existed [1]

In one case, DCC has a compound whose base is not in Pratt, Morwood or DCC but another compound of the same base is.

κατασκευάζω (no σκευάζω but παρασκευάζω existed)

6. compounds with no base existing [1]

And in one case, DCC has a compound whose base, nor any other compounds of that base are in Pratt, Morwood or DCC.

κατηγορέω

7. σσ vs ττ [3]

DCC favours σσ over ττ (whereas Pratt and Morwood use latter; although Morwood does have ἀλλάσσω alongside ἀλλάττω)

πράσσω
τάσσω
φυλάσσω

8. words appearing under different entry due to suppletion [3]

δέδοικα (Pratt has under δείδω)
εἶδον (Pratt and Morwood have under ὁράω)
εἶμι (Pratt and Morwood have under ἔρχομαι)

9. completely new words [3]

These are unique to DCC.

ἔρομαι
λαλέω
πολεμέω
ἔοικα

Merging the Morwood and Pratt Lemmas

2016-06-21

This is part 3 of a series of blog posts about modelling stems and principal part lists and covers the Morwood lemmas and issues in merging them with Pratt's.

Like Pratt, Morwood conflates the lemma with the first principal part and similarly calls the relevant column “present”.

One of the first differences one notices is that Morwood’s principal parts list indicates vowel length. This is useful in many cases for the accentuation stage of my form generating code. That Morwood indicates length and Pratt doesn’t has at least two implications: (1) it means that any matching between the lists will have to strip length (not a big deal); (2) it raises the question of whether forms in Pratt but not Morwood should somehow be tagged as underspecified for length (perhaps to be later inferred from accentuation or looked up manually in other sources).

Like Pratt, Morwood indicates where a base form is used but a particular compound is more common. As we saw previously, Pratt does this by saying αἰνέω {ἐπαινέω}. Morwood, in turn, says αἰνέω (ἐπ-). Each is fairly easily derivable from the other and whatever our own internal format will be, we should be able to reconstruct both the Pratt and Morwood display. However Morwood will sometimes include more than one preverb. For example στέλλω (ἀπο-, ἐπι-). In this case Pratt just gives στέλλω.

Sometimes a single preverb will have alternative spellings (depending on assimilation) which Morwood indicates like πίπλημι (ἐμ-/ἐν-).

One somewhat unusual feature of Morwood is it will group synonyms such as βιόω and ζάω, or πωλέω and ἀποδίδομαι. It still puts them on separate lines, though, which enables other parts to be correlated.

A similar approach is taken to spelling variations. In Morwood, these are:

ἀλλάσσω and ἀλλάττω
ἁρμόττω and ἁρμόζω
κλαίω and κλᾱ́ω (the latter of which Morwood annotates with (in prose))
αὐξάνω and αὔξω
μείγνῡμι and μῑ́γνῡμι
οἶμαι and οἴομαι

each expressed as a pair of lines.

There are only two other things to note about Morwood’s first column: (1) where he groups βιόω and ζάω, the latter is inexplicably put in square brackets; (2) italics is occasionally used to indicate a form that is rare or non-attested. This is more often seen in parts other than the first but it does occurs in the first part in Morwood’s second list in two cases: βλώσκω and δαρθάνω (κατα).

Matching up Pratt and Morwood

There are 73 entries identical in lemma between Pratt and Morwood’s first list. There are 27 entries identical in lemma between Pratt and Morwood’s second list.

There are 14 entries where Morwood simply adds vowel length but otherwise the lemmas are the same (10 in first list, 4 in second).

In three cases the lemmas are in fact the same but the common compound is just formatted differently:

αἰνέω {ἐπαινέω} vs αἰνέω (ἐπ-)
θνῄσκω {ἀποθνῄσκω} vs θνῄσκω (ἀπο-)
κτείνω {ἀποκτείνω} vs κτείνω (ἀπο-)

Similarly in two cases, Pratt just adds the preverb analysis:

[ἀνα]λίσκω vs ἀνᾱλίσκω
[ἀφ]ικνέομαι vs ἀφικνέομαι

(although note ἀνᾱλίσκω also adds vowel length)

In one case, Pratt gives common compound on base entry but Morwood doesn't

ἵημι {ἀφιημι} vs ῑ̔́ημι

(and Morwood adds vowel length)

In five cases, Pratt gives a compound with preverb analysis but Morwood has base (showing common preverb):

[ἀν]οίγνυμι/[ἀν]οίγω vs οἴγνῡμι (ἀν-)
[ἀπ]όλλυμι vs ὄλλῡμι (ἀπ-)
[καθ]εύδω vs εὕδω (καθ-)
[κατα]δαρθάνω vs δαρθάνω (κατα)
[δια]φθείρω vs φθείρω (δια-)

(although note Pratt also has φθείρω as separate entry; Morwood adds vowel length for οἴγνῡμι (ἀν-) and ὄλλῡμι (ἀπ-); Morwood doesn’t have the alternative ἀνοίγω for ἀνοίγνῡμι)

In three cases, Pratt gives the base (as does Morwood) but Morwood adds a common preverb:

μιμνῄσκω vs μιμνῄσκω (ἀνα-)
πίμπλημι vs πίμπλημι (ἐμ-/ἐν-)
στέλλω vs στέλλω (ἀπο-/ἐπι-)

φθείρω vs φθείρω (δια-) would be included here but Pratt separately has [δια]φθείρω.

Also, Pratt has an unmatched [ἀπο]κρίνομαι but Pratt and Morwood have a separate κρίνω and κρῑ́νω respectively.

In two cases, Pratt gives middle form but Morwood gives active form:

μαίνομαι vs μαίνω
ψεύδομαι vs ψεύδω

And in two cases, Morwood gives an indefinite form where Pratt gives 1st singular:

δέω (2) vs δεῖ
μέλω vs μέλει

There are 105 entries lemmas unique to Pratt (although this includes [δια]λέγομαι and [συλ]λέγω which could be mapped to λέγω). Most of these entries appear to be regular and so, given Morwood’s focus on irregular verbs, it is not surprising there are omissions.

Morwood’s first list adds three new lemmas: ἀποδίδομαι (grouped under πωλέω with which it's suppletive in 3rd part), βιόω and χρή.

Morwood’s second list adds 42 new lemmas: ἄγνῡμι, αἰδέομαι, ἀλείφω, ἅλλομαι, ἁρμόττω / ἁρμόζω, βλώσκω, ἐξετάζω, ζεύγνῡμι, ζέω, καθαίρω, καλύπτω, κείρω, κεράννῡμι, κερδαίνω, κηρῡ́ττω, κρεμάννῡμι, νέω, ὄζω, ὀνινημι, ὀρύττω, ὀσφραίνομαι, ὀφλισκάνω, παίω, περαίνω, πέρδομαι, πετάννῡμι (ἀνα-), πέτομαι, πήγνῡμι, πίμπρημι (ἐμ-/ἐν-), πνέω, σβέννῡμι, σκάπτω, σπάω, σπείρω, σπένδω, σφάλλω, τελέω, τήκω, ὑφαίνω, φείδομαι, χρῑ́ω, ὠθέω.

Concluding Thoughts

Inclusion of vowel length and differences in how common compounds are shown are easy to handle in any model merging these two lists. If bases and compounds get individual entries containing their parts but are otherwise linked via additional properties, we get around those issues too.

However there remain four open issues to deal with:

whether spelling differences that don't span all parts should get separate entries.
how to handle one list giving form in active but another in middle
how to handle one list giving indefinite its own entry, the other putting it under the first person singular
situations where one list uses forms from one lexeme for some of the parts of another

Sources of Principal Part Lists

2016-06-18T17:35:09

This is part 1 of a series of blog posts about modelling stems and principal part lists and covers the three sources of Attic Greek principal parts used to expand and test the Morphological Lexicon.

Because Louise Pratt’s The Essentials of Greek Grammar was the basis for testing a lot of paradigms, it made sense to use it as the starting point for Attic Greek principal parts as well. Pratt lists the principal parts (the standard six, i.e. not separating out the so-called “future passive”) for 247 verbs. It is not indicated the reason for her particular choice of verbs other than them being "common Attic Verbs".

The second source is James Morwood’s Oxford Grammar of Classical Greek. Morwood has two lists, one of "Top 101 irregular verbs" and one of (81) "More principal parts". The title of the first list suggests common verbs are omitted if regular. I have included both lists (although can treat them separately). Morwood includes a seventh part for the “future passive” (when and why this is useful is worthy of a separate blog post).

For my third source I used Chris Francese’s principal parts in the wonderful DCC Greek Core Vocabulary list. The DCC core vocabulary consists of 500 common words of which 151 are verbs.

All three lists included the occasional form outside the usual six or seven principal parts and a future post in this blog series will address the modelling of that.

The DCC principal parts were in electronic form and so were relatively easy to deal with (although I’ll discuss specifics in a later post). Both the Pratt and Morwood lists I did not have in electronic form and so manually keyed them in over the course of a few weeks (mostly in Vienna earlier this year).

I have also referred at times to Wilfred Major’s 80% list (discussed elsewhere on this blog) but, as it doesn’t contain principal parts, it was more of a reference for lemma choice and additional metadata than an input for testing part generation itself.

Of course many other lists could be included but these three are sufficient to establish most of the modelling issues and ensure the code works correctly. Data from other lists can be incorporated later relatively easily.

Lemmas in the Pratt Principal Parts

2016-06-18T23:51:03

This is part 2 of a series of blog posts about modelling stems and principal part lists and covers the complexities in the notion of a lemma identifying lexical entries, specifically in the Pratt principal parts.

Before we get to the other principal parts beyond the first, there is a lot to be discussed just about the first part and its use as a lemma, identifying the lexical entry to which all the parts belong. In this post, we’ll start just looking at the presentation of lemmas in the Pratt list and in the next post move on to the other sources and the problems of merging multiple lists that may differ in choice of lemma for the same lexical entry.

The canonical lemma / first principal part is the present active (or middle) indicative first person singular of the verb but there are at least eight ways in which the first column in the Pratt principal parts table differs from this ideal.

1. Contract verbs

The present active indicative first person singular of a contract verbs like ἀγαπάω is, of course, not ἀγαπάω but ἀγαπῶ. The pre-contract version is often used (and is indeed used by Pratt) in lemmas and the first principal part so the stem vowel is explicit (as it’s necessary for generating other forms).

2. Base Verbs With a More Common Compound

Where a base verb gets its own entry but there is a more common compound, Pratt includes the latter in braces:

αἰνέω {ἐπαινέω}
ἀπατάω {ἐξαπατάω}
θνῄσκω {ἀποθνῄσκω}
ἵημι {ἀφιημι}
κτείνω {ἀποκτείνω}

Note that the other parts in this case are still given just for the base verb, even if that means they are not attested in Greek texts.

3. Compound Verbs

In some cases only one compound verb gets an entry, but the preverb is indicated in square brackets:

[ἀνα]λίσκω
[ἀν]οίγνυμι/[ἀν]οίγω
[ἀπ]αντάω
[ἀπο]κρίνομαι
[ἀπ]όλλυμι
[ἀπο]λογέομαι
[ἀφ]ικνέομαι
[δια]λέγομαι
[δια]νοέομαι
[δια]φθείρω
[δι]ηγέομαι
[ἐκ]πλήττω
[ἐπι]θυμέω
[ἐπι]μελ(έ)ομαι
[ἐπι]τηδεύω
[ἐπι]χειρέω
[καθ]εύδω
[καθ]ίζω
[κατα]δαρθάνω
[παρα]σκευάζω
[συλ]λέγω
[ὑπ]οπτεύω

It seems that compound verbs with a common base used for other compound verbs don’t get their own entries at all in Pratt and the base verb is to be referred to in that case. This is one example where bringing in metadata from Major’s list is potentially useful, in making sure common compound verbs can easily be looked up in their base verb form.

4. Multiple Present Stems Conjoined with Slashes

In these cases there are multiple alternative present (or more properly imperfective) stems conjoined with a slash.

[ἀν]οίγνυμι/[ἀν]οίγω
αὔξω/αὐξάνω
καίω/κάω
κλάω/κλαίω
μείγνυμι/μίγνυμι
οἴομαι/οἶμαι
σκέπτομαι/σκοπέω

While these could arguably be treated as separate lemmas (and hence lexical entries) there are two arguments against doing this: (1) the two forms given are really just alternative spellings; (2) the lexical entries converge in other parts.

5. Homographs That Differ In Other Parts

δέω has two senses that, while identical in form in the first part, differ in other parts.

6. Spelling Differences with Optional Letter in Parentheses

There are two cases where an optional epsilon is given in parentheses:

[ἐπι]μελ(έ)ομαι
οἰκτ(ε)ίρω

In some cases the spelling alternative continues into other parts.

7. Lexemes Where Other Lexemes are Merged In for Other Parts

These aren’t marked in the lemma itself but I’ve included them here as they represent a particular choice of lemma to group parts under. The actual parts from other lexemes are indicated by an asterisk in Pratt. Note that this is not the same as suppletion although arguably there is a fine line worth exploring in more detail at some point.

ἔρχομαι
ἐρωτάω
ἐσθίω
λέγω
πωλέω
ὠνέομαι

8. Lexemes Without An Imperfective Stem

Some words like οἶδα have a lemma which is from a part other than the first. While in some cases when this happens, the lexeme has been merged with another (see 7), this category covers the case where it hasn’t been.

Concluding Thoughts

We’ll see further issues when we look at the other lists and how to merge them but for now let’s discuss possible solutions to the issues seen already.

It is important to note that the information in the first column of the Pratt principle parts table (headed “present”) in the book is serving a number of distinct purposes:

providing an identifier for the entire row (what could properly be called the “lemma”)
providing the first principal part (and hence the present / imperfective stem)
providing additional information about the lexeme such as its preverb / base

By separating these out we have a much clearer way forward. The lemma proper can really be any unique identifier and it can be treated completely opaquely. The first principal part (or parts when there is more than one under a single lemma) can be a separate field. Finally, information such as preverb / base decomposition can be expressed in yet further separate fields. This keeps the first principal part free of extra characters and the lemma opaque.

Modelling Stems and Principal Part Lists

2016-06-17

This is part 0 of a series of blog posts about modelling stems and principal part lists, particularly for Attic Greek but hopefully more generally applicable. This is largely writing up work already done but I’m doing cleanup as I go along as well.

A core part of the handling of verbs in the Morphological Lexicon is the set of terminations and sandhi rules that can generate paradigms attested in grammars like Louise Pratt’s The Essentials of Greek Grammar. Another core part is the stem information for a broader range of verbs usually conveyed in works like Pratt’s in the form of lists of principal parts.

A rough outline of (future) posts is:

the sources of principal part lists for this work
lemmas in the Pratt principal parts
merging the Morwood and Pratt lemmas
merging the DCC lemmas
formatting of principal parts
parsing the DCC principal parts
more parsing the DCC principal parts
how to model a merge of the lists
inferring stems from principal parts
stems, terminations and sandhi
relationships between stems
???

pyuca Published in The Journal of Open Source Software

2016-05-19

A research career requires publication in peer-reviewed journals but what if some of your scholarly output is in the form of software? The Journal of Open Source Software attempts to solve that by essentially wrapping peer-reviewed software packages up as lightweight papers. My pyuca library was just accepted for publication by the journal.

pyuca is a Python implementation of the Unicode Collation Algorithm and is a vital part of most of my Greek work because it lets me properly sort Greek words. It's not limited to Greek, though, and the library is potentially useful for anyone doing text processing using Python on natural languages other than English.

pyuca has always been citable in an ad-hoc fashion, but thanks to publication in The Journal of Open Source Software, it can now be cited as a peer-reviewed journal article.

The submission process was straightforward. I dug up an ORCID (a persistent identifier for researchers) I'd acquired a while ago but never used and set up my GitHub repo on Zenodo so a Digital Object Identifier (DOI) gets minted for each release.

I then added a specially-formatted paper.md file to the repo (including my ORCID, abstract about the software and any references) and submitted the repo for consideration.

JOSS reviews are done openly using GitHub issues. A reviewer stepped up and gave some excellent feedback on the usage example in my README and on adding contributor guidelines. Once I'd addressed that feedback, the paper was accepted by the reviewer and the editor-in-chief and a new DOI was minted for the paper itself.

I also got a notification from ORCID that Crossref had found a new work to be added to my ORCID record.

Of course, I could at some point write an article about pyuca but an article about software is not the same as the software itself (they would likely have quite different audiences) and so citing an article about particular software is not the same as citing the software itself. Thanks to JOSS, the distinction can be maintained while still keeping within a framework of peer-reviewed journal articles.

I'm particularly excited that JOSS accepted software with a digital humanities application rather than their typical scientific computing applications.

So if you publish a work that made use of pyuca, you can now cite it as:

Tauber, J. K. (2016). pyuca: a Python implementation of the Unicode Collation Algorithm. The Journal of Open Source Software. DOI: 10.21105/joss.00021

Varro’s Four Parts of Speech for Latin

2016-05-04

In my post Morphological Parts of Speech in Greek last year, I presented a model of five or six parts of speech based purely on what they inflect for. I just found out Varro suggested similar for Latin over two thousand years ago.

In his article Dionysius Thrax vs Marcus Varro in Historiographia Linguistica 17:1-2 (1990), Daniel Taylor argues for the greater significance of Varro over Thrax in the history of Greco-Roman lingustics.

I actually started reading the article for comparisons made with Theodosius but his description of Varro's parts of speech caught my eye. After introducing Thrax's list of eight parts of speech for Greek (noun, verb, participle, article, pronoun, preposition, adverb, and conjunction) which has dominated since, he describes Varro's for Latin:

His definitions are exclusively grammatical, and there are but four parts of speech: one with case, one with tense, one with both, one with neither.

This results in a similar division to the first table in my earlier blog post although conflates infinitives and finite verbs (which Thrax does as well).

It's certainly appealing as an initial taxonomy of parts of speech, for Greek as well as Latin.

Inflexion: Generic Code for Morphological Generation and Parsing

2016-05-01

Over the last few years, I've worked on a number of iterations of code that can generate Ancient Greek verb forms. I've now broken out the Greek-specific pieces and released a generic library called inflexion.

There's nothing particularly innovative about the approach from a computational morphology point of view: it just uses a stem database combined with a list of endings including sandhi rules. I talked a bit about the endings / sandhi rules in my SBL talk last year.

It takes a very practical approach, though, and, with a suitable stem database, ending / sandhi rules and accentuation code (all of which I'm releasing separately shortly) it can currently generate every single verb form in Louise Pratt's intermediate grammar, on Helma Dik's Greek verb handouts and in Andrew Keller & Stephanie Russell's beginner-intermediate text book.

There's some support for parsing forms if the stem is known and I'll soon be working on support for when the necessary stem is not yet in the database. There's not yet any notion of stems being related and that will be a big part of future work which might be more interesting from a computational morphology point of view.

In a way, the real power (or "knowledge") is in the pieces not included in this library itself but I wanted to break out the generic code partly in case other people wanted to use it for other inflected languages but mostly just to keep my own code more modular.

The GitHub repo is https://github.com/jtauber/inflexion and example-based documentation is available.

Stay tuned for new releases of the inflexion library but also the stem database, ending / sandhi rules and accentuation code that are specific to Greek.

17th International Morphology Meeting

2016-02-19

I'm current in Vienna for the International Morphology Meeting.

It's been quite an adventure to get here, which you can read about elsewhere.

If four days full of morphology weren't enough there are workshops specifically on computational methods and discriminative approaches, both of which are obviously of huge interest to me.

I'm also hoping to catch up with Jim Blevins who is a sort of undergraduate version of a Doktorvater to me.

I'm sure in the coming months you'll see a lot on this blog the seeds of which will have been sown at this conference.

(and yes, that was a legitimate use of the future perfect)

An Updated Solution to Polytonic Greek Unicode’s Problems

2016-02-09

In Polytonic Greek Unicode Still Isn’t Perfect, I enumerated various challenges that still exist with using Polytonic Greek when vowel length needs to be marked. I now have a better appreciation of what solutions are actually realistic.

After discussions with people on the Unicode mailing list, it's clear the solution is NOT to add more precomposed character code points to Unicode (or rather, such a solution will never be adopted by Unicode). Rather, the solution likely lies in the tools just understanding grapheme clusters. For more background, see Grapheme Cluster Boundaries in the Unicode Standard Annex on Unicode Text Segmentation.

Perl 6 already has support for this: a layer above code points representing what are considered single graphemes even if made up of multiple code points. See, for example, Jonathan Worthington's slides on Normal Form Grapheme.

So my plan is to at the very least implement a similar approach for Python 3 (unless someone else already has). That will still mean the problem has to separately be solved by:

font foundries
text editor developers
keyboard / input source software developers
operating system developers

I'll try to engage with each of these groups and will keep people posted on my progress.

Thanks to Ken Whistler for making clear that the path forward is not in more precomposed characters but in working with system vendors and font foundries.

Thanks to Markus Scherer and Elizabeth Mattijsen for their pointers to TR29 and the Perl 6 work.

UPDATE (2016-12-04): Now see Diacritic Stacking in Skolar PE Fixed.

Polytonic Greek Unicode Still Isn’t Perfect

2016-01-28

Whether we're talking about fonts, programming languages, keyboard entry or even the command-line, support for polytonic Greek has greatly improved even in the last 10 years much less the 23 years since I've been doing computational analysis of Greek texts.

UPDATE (2016-12-04): The Skolar examples in this post will no longer make sense as the issues have now been fixed. See Diacritic Stacking in Skolar PE Fixed.

With configurable input sources in OS X, it's easy to type polytonic Greek and the default fonts support all the Unicode codepoints for polytonic Greek. I can now just type Greek (rather than a transliteration or BetaCode) in data files or forum posts or emails or tweets or GitHub issues. There are still some display issues with using polytonic Greek in fixed-width fonts but that's improving. Last year I talked about the bug I reported that got fixed in the Atom editor.

Python has long supported Unicode and Python 3 made it even easier to deal with text processing of Unicode files. It doesn't sort polytonic Greek correctly out of the box, but I wrote pyuca to solve that problem!

The situation seemed almost perfect until I started doing a lot more work that required me to track vowel length and, in particular use a macron ˉ to distinguish long α, ι, and υ from short. It's okay when the macron is the only diacritic on a vowel: the problems start when a vowel has both an acute and a macron. (There is no need for a macron and a circumflex as the circumflex already implies the vowel is long. Same with an iota subscript.)

Problem 1: No precomposed character code points

ᾱ can be written as the decomposed U+03B1 U+0304 or the precomposed U+1FB1:

>>> len('ᾱ')
1
>>> [hex(ord(ch)) for ch in 'ᾱ']
['0x1fb1']    
>>> [unicodedata.name(ch) for ch in 'ᾱ']
['GREEK SMALL LETTER ALPHA WITH MACRON']
>>> unicodedata.decomposition('ᾱ')
'03B1 0304'

ά can be written as the decomposed U+03B1 U+0301 or the precomposed U+03AC (assuming normalization to a tonos which the Greek Polytonic Input Source on OS X does):

>>> len('ά')
1
>>> [hex(ord(ch)) for ch in 'ά']
['0x3ac']
>>> [unicodedata.name(ch) for ch in 'ά']
['GREEK SMALL LETTER ALPHA WITH TONOS']
>>> unicodedata.decomposition('ά')
'03B1 0301'

But there's no precomposed character ᾱ́:

>>> len('ᾱ́')
2
>>> [hex(ord(ch)) for ch in 'ᾱ́']
['0x1fb1', '0x301']
>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', 'ᾱ́')]
['0x1fb1', '0x301']

As you can see, even Python 3 views ᾱ́ as two characters. This also screws up font metrics in many text editors and browser text areas (like the one I'm writing this post in).

Problem 2: Many fonts with otherwise excellent polytonic Greek support don't display it properly

The Skolar PE font I use on this site can't properly display ᾱ́. It displays it as ᾱ́. Ironically this is one time the fixed width fonts do a better job!

Problem 3: You can't normalize an alternative ordering of diacritics

If you already have a GREEK SMALL LETTER ALPHA WITH TONOS and you add a COMBINING MACRON you end up (at least in the fonts I've tried) with something that even visually looks different from the GREEK SMALL LETTER ALPHA WITH MACRON followed by COMBINING ACUTE ACCENT:

>>> "\u03ac\u0304"
'ά̄'

(Notice that ά̄ != ᾱ́ and oddly, Skolar PE does a better job of the former than the latter: ά̄ vs ᾱ́)

And to make matters worse, you can't normalize one to the other:

[hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03ac\u0304')]
['0x3ac', '0x304']

You have to combine the components in the correct order with the macron FIRST:

>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03b1\u0304\u0301')]
['0x1fb1', '0x301']
>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03b1\u0301\u0304')]
['0x3ac', '0x304']

This is not a bug: technically ά̄ and ᾱ́ are distinct graphemes but it's still an annoyance because it requires any code that adds diacritics to need to know the correct order in which to add them.

Problem 4: No support in the Greek Polytonic Input Source

The Greek Polytonic Input Source supports typing a digraph (diacritic then base) to produce precomposed characters but you can't use a trigraph to enter ᾱ́. In fact, every time I've needed to type ᾱ́ in this post, I've needed to copy paste it from an earlier usage (and manually minted one via Python the first time).

Problem 5: My existing syllabification heuristics didn't work

I recently had to tweak the syllabification heuristics in my greek-accentuation Python library to correctly syllabify words like φῡ́ω. Prior to 0.9.4, it put a syllable division between the macron and the acute!

This would have not happened if Unicode (and hence Python) treated ῡ́ as a single character.

Problem 6: There's also breathing

I thought I was all set after fixing Problem 5 but then I hit the imperfect of ἵστημι which starts in most cases with ῑ́̔/ῑ̔́ (yes, that should be a rough breathing and acute with a macron.) I'm in the process of working around this problem in greek-accentuation now.

The Solution

The root cause of all this is just that Unicode-based code can't treat ῑ́̔ or ῡ́ or ᾱ́ as single characters because Unicode doesn't have a codepoint for the precomposed characters. I imagine it's a long road to get the Unicode Consortium to "fix" this, if it's even possible. And even if some future version of Unicode fixed it; I'd have to wait for Python and OS X to catch up before the problem really goes away. For now I'll just have to continue to work around the problem in code like my greek-accentuation library. That still doesn't solve the problem with the Skolar PE fonts but I might be able to raise that issue with the font foundry.

It's possible there are additional workarounds or tricks I'm not aware of. If there are, please let me know.

CORRECTION: Thanks to Tom Gewecke for pointing out an earlier misstatement about the Polytonic Greek Input Source on OS X producing combining characters. It does not. It supports digraphs to produce precomposed characters.

CORRECTION: Thanks to Martin J. Dürst for pointing out that ά̄ and ᾱ́ are distinct graphemes and so the fact they aren't normalized to each other isn't a problem with Unicode as such.

UPDATE: I remarked at the end of Problem 1 about font metrics in editors / text areas but really I should make that a separate problem. Related (and perhaps yet another problem) is selecting characters with multiple diacritics.

Updated Solution

Now see my later post: An Updated Solution to Polytonic Greek Unicode’s Problems.

greek-utils 0.1 Released

2016-01-18

While I write and release a lot of Python code for working with Ancient Greek, it tends to be either throwaway code for data wrangling or fairly specialized code for things like accentuation or inflectional morphology.

I decided there needed to be a place to put lightweight utilities that can be used by a range of different projects. This is the motivation for greek-utils.

The initial 0.1 release of greek-utils just provides the following features:

Convert BetaCode to Unicode
Turn an iterable into a generator over trigrams
A Trie datastructure
MorphGNT BCV string to human-readable verse reference

greek-utils is pip installable and the repo is at

https://github.com/jtauber/greek-utils

Full documentation is included there.

I'll be moving a lot more out of gists and individual project repos over the coming months.

Direct Speech Capitalization and the First Preceding Head

2016-01-17

As part of my explicit annotation of the normalization column in MorphGNT, I started down the rabbit hole of capitalization conventions which led to an interesting experiment with direct speech and the GBI syntax trees.

Back in Annotating the Normalization Column in MorphGNT: Part 1, I talked about wanting to catalogue the reasons why a word in the text differs from the normalized form, and annotate the text on a per-case basis. One difference mentioned was capitalization.

In Greek texts printed now-a-days, there are three reasons why a word might start with an uppercase letter:

it's a proper noun
it's the start of a paragraph
it's the start of direct speech

So I obviously want to be able to explictly say in each case, which it is (of course it could be more than one or even all three, potentially).

The heuristic for the proper nouns is easy if you actually have tagged the proper nouns or lemmatized the text (although there are some inconsistencies as I've already mentioned which need to get cleaned up in MorphGNT).

The start of a paragraph heuristic should be straight forward as the electronic SBLGNT text has paragraphs indicated but there are some oddities I'm looking at (including 30 cases where a word after a paragraph break is not capitalized, some of which are inconsistencies in SBLGNT itself).

The direct speech is most interesting. I started by assuming that, if the lemma isn't capitalized and the word isn't at the start of a paragraph, it must be the start of direct speech. There are 2,225 cases of this in the SBLGNT text underlying the MorphGNT.

Then I implemented a little heuristic where I traversed up the heads from the start of the direct speech (using the dependency version of the GBI Syntax Trees) until hitting a word that preceded the direct speech. Let's call that the first preceding head.

My hypothesis was that the first preceding head would be some verb of communication (saying, writing, etc). In theory one might also expect a complementizer but the GBI Syntax Trees don't treat complementizers as heads so they don't come up in practice.

In 1,641 instances, the first preceding head was a form of λέγω. In much rarer instances (no lexeme with more than 64 instances) there were other verbs like γράφω, ἀποκρίνομαι, φημί, ἐπερωτάω, or κράζω.

In some cases the first preceding head was clearly not a verb of communication (and often not a verb at all). Going through the first half of Matthew so far, here are the explanations I've discovered:

in Matt 6.31, three instances of direct speech are disjoined and the GBI Trees model disjunction in such a way the second and third instance are linked to the first rather than the actual verb of communication, λέγοντες
in Matt 8.9, the verb of communication is elided in the second and third cases so the GBI Tree attaches the direct speech elsewhere
Matt 9.13 has "μάθετε τί ἐστιν" and Matt 12.7 has "εἰ δὲ ἐγνώκειτε τί ἐστιν" and the GBI Trees end up hanging the direct speech (or "meaning") off τί

There were 118 cases in the entire text where there was no first preceding head. Going through the first half of Matthew again, the majority of these are cases where there is no direct speech but a word has been capitalized without an actual paragraph break. However, there are a couple of other interesting scenarions:

in Matt 11.21, we might expect ἤρξατο ὀνειδίζειν to be linked to the direct speech with a participle of saying but none is provided
similarly in Matt 13.33, there is direct speech but no participle linking to ἐλάλησεν

My plan is to go through the rest of the text and describe all the scenarios, but as this is somewhat of an unexpected rabbit hole, it might take me a while.

If anyone is interested in a raw dump of the data with my explanations (covered above) so far, see https://gist.github.com/jtauber/39d85cff34c71a2df169.

MorphGNT 6.07 Released

2016-01-16

The latest release of MorphGNT (with a corresponding release of the Python library py-sblgnt) fixes some lemmatization issues along with a couple of accent and part-of-speech changes.

use acute at end of sentence in Luke 10.38
use ἄγω as lemma of ἄγε per issue #39
use ἱερός lemma in all situations per issue #36
fix accent in συνίημι lemma in Acts 28.26 per issue #37
fixed θαρσέω lemmas where forms use ρσ as well per issue #38
fixed προώρισε(ν) lemma in Acts 4.28 per issue #40
elaborated on part of speech and parsing codes in README
corrected lemmatization of ἤρχοντο in John 4.30 per issue #41
changed μακράν to adverb when lemma is μακράν per issue #33
changed lemma for ἔδει to δέω per issue #24

Thanks Scott Fleischmann, Ulrik Sandborg-Petersen and Emma Ehrhardt.

MorphGNT is available at https://github.com/morphgnt/sblgnt and all issues should be filed there.

py-sblgnt 0.5 is now available on PyPI for those wanting to access MorphGNT via a pip-installable Python API.

Gouin on Language Learning

2016-01-13

I recently found out about François Gouin, a sort of proto-Charles Berlitz who wrote (in French) a book called The art of teaching and studying languages, published in 1880 and then translated and published in English in 1892.

I've only skimmed the book so far but it looks like it contains some real gems relating to the teaching of Greek.

Gouin was a classics professor who attempted to learn German initially using the grammar-translation method used for Latin and Greek. The beginning of the book recounts what an utter failure it was and it's quite an amusing read with section headings such as "An attempt at conversation—Disgust and fatigue—Reading and translation, their worthlessness demonstrated".

After observing three-year-olds playing with language, the light went off and he developed his Series Method, described in the bulk of the rest of the book.

He ends the book discussing implications of his findings for the teaching of Greek and Latin. Again, I haven't read in detail but I did enjoy his scathing remarks about the uselessness of dictionaries for learning a language and his bafflement at the fact students can spend 12 years learning Latin at school and still know nowhere near what someone learning German for six months under his method would know.

If you're interested in the history of second language teaching with particular reference to Latin and Greek, the book might be worth checking out. It's available at Internet Archive.

Off to the Linguistic Society of America’s 90th Annual Meeting

2016-01-06

I'm heading off to the LSA's annual meeting for the first time.

This morning my twitter timeline was filled with classicists heading off to the SCS annual meeting (okay, maybe not filled, but there were three or four). I must follow more classicists than linguists because I didn't see anyone tweeting about heading off to Washington DC for the LSA annual meeting.

The fact they are on at the same time on different sides of the country doesn't exactly help cross-disciplinary collaboration and for a brief moment I wondered which to go to. It was actually an easy choice. I'm far more of a linguist than a classicist, even though most of my linguistics for the last twenty two years has been Ancient Greek related. A quick look at the programmes of each conference reassured me I'd made the right decision.

I don't yet know if anyone I personally know will be there, which always makes conferences awkward for me. I'm also sitting an exam being proctored at a local university on Monday which I need to spend a decent amount of time studying for.

That exam is actually the main reason I haven't blogged much since SBL. That will hopefully change next week when I'm done!

A (Not So) New Numbering System for Greek New Testament Lexemes

2015-12-15T11:40:13

Ten years ago, when Ulrik Sandborg-Petersen and I started collaborating, we came up with a way of referencing lexemes that would satisfy both the lumpers and splitters. At the time we wrote a paper that we circulated to a small audience but now it's finally up on Academia.edu.

The 2006 unpublished paper is entitled A New Numbering System for Greek New Testament Lexemes.

Here's the abstract:

Numbering systems (such as Strong’s) are a popular way to reference the lexemes of the Greek New Testament corpus but a straight enumeration is not without problems, particularly when there is disagreement about whether two forms are the same lexeme or not. We present a way of referencing lexemes that allows competing viewpoints to be represented simultaneously. Existing numbering systems can be mapped into this new system without any loss of granularity and new analyses can be expressed without violating the integrity of existing references into the system.

Functional Dependency in the MorphGNT Table

2015-12-15T17:06:47

Often it's useful to see whether certain columns in a table can be entirely determined by others. For example, can you unambigously get the lemma from just the form (the answer is no so a more useful question is which forms are ambiguous as to lemma)? Does knowing the part-of-speech help? Here we provide some code and give some examples.

At the end I provide the script used.

Run in the same directory as the MorphGNT SBLGNT, it runs like this:

$ ./dep.py 6 7
45

What this is telling us is that there are 45 times where the value of column 6 (the normalized form) gives us multiple possible values for column 7 (the lemma). In relational database terms was say that column 7 is not functionally dependendent on or not functionally determined by column 6 because of those 45 cases.

If you run:

$ ./dep.py -v 6 7

it will actually list all 45, starting with something like:

ἄμωμον {'ἄμωμος', 'ἄμωμον'}
ἴδε {'ἴδε', 'ὁράω'}
ὑποταγῇ {'ὑποταγή', 'ὑποτάσσω'}
καλῶν {'καλός', 'καλέω'}
Ἰουδαίας {'Ἰουδαῖος', 'Ἰουδαία'}
...

You can also give more than one column for either the determinant or dependent.

For example, does knowing the form AND part-of-speech determine the lemma?

Turns out there are only 8 exceptions in the current MorphGNT/SBLGNT:

$ ./dep.py -v 6,2 7
Ἅννα N- {'Ἅννα', 'Ἅννας'}
ἀνώτερον A- {'ἀνώτερος', 'ἀνώτερον'}
ἀλάβαστρον N- {'ἀλάβαστρος', 'ἀλάβαστρον'}
χρυσᾶ A- {'χρύσεος', 'χρυσοῦς'}
μακράν A- {'μακράν', 'μακρός'}
ὕστερον A- {'ὕστερον', 'ὕστερος'}
ταχύ A- {'ταχύ', 'ταχύς'}
ἤρχοντο V- {'ἄρχω', 'ἔρχομαι'}
8

There are other things that can be explored with this. How many lemmas have more than one part-of-speech in the MorphGNT/SBLGNT?

$ ./dep.py 7 2
70

How many forms have more than one parse analysis extant in the text, even if you know the lemma and part-of-speech:

$ ./dep.py 6,7,2 3
903

Given a lemma, part-of-speech and parse analysis, how many cases are there where multiple alternative forms are seen:

$ ./dep.py 7,2,3 6
132

Looking at these with the -v option, you can see some are unavoidable:

ὁράω V- 1AAI-P-- {'εἴδομεν', 'εἴδαμεν'}
κλείς N- ----APF- {'κλεῖς', 'κλεῖδας'}

whereas others are likely corrections that need to be made to the lemmatization:

τις RI ----GSM- {'τινος', 'τινός'}

The most recent set of corrections to MorphGNT/SBLGNT (which will be in release 6.07) stem from this sort of analysis.

There are still more to discuss and resolve, however. See https://github.com/morphgnt/sblgnt/issues/32 and other issues on GitHub for details and to help in the discussion.

The script

Annotating the Normalization Column in MorphGNT: Part 1

2015-11-27

Since the Series-6 release, MorphGNT has had a column that normalizes the word forms in the text for contextual things like accent changes, elision, movable nu and capitalization. I thought it would be useful to provide an annotation of exactly what normalization had been done for each word in the text and why.

I wrote a short Python script that runs some heuristics on each case where the "word" column and "norm" column differ to determine the nature of the in-context change.

In this post, I'll just report on some statistics. In later posts, I'll dive into further details that rely on actually looking at the surrounding context (rather than just the difference in one row).

There are 47,630 times where the word and norm columns differ.

38,523 times there is a change of accent (clitics, oxytones taking graves, etc).

3,721 times there is a change in capitalization.

1,221 times there is elision: 984 times a straight dropping of a final vowel, 237 times an additional aspiration of the preceding consonant.

5,223 times there is a movable nu. Note that both the existence and absence of nu is normalized to (ν) so this covers all cases where a nu could be dropped as well as the 142 times when it actually is.

226 times there is a movable sigma (20 times where it's actually dropped). This doesn't count ἐξ (another 234 times). There are also 825 times οὐκ appears and 105 times οὐχ appears.

In addition to the 47,630 cases above, there are also 32 other instances of two types of discrepancy that need to be resolved. One is ἑλπίδι with a rough accent in Romans. The other is the cases where Χριστός appears with lower case χ. I'm not sure what the solution to the former is but the latter might just involve having two distinct lemmata for Χριστός vs χριστός.

All these statistics might seem of trivial interest but they are side effects of a more important task of both verifying the normalization and, as will be covered in subsequent posts, testing context-sensitive accentuation rules.

Back to a More Sustainable Blogging Pace

2015-11-23

Well, I did it! I blogged a post for every day in the four weeks leading up to my talk at SBL. It was a fantastic motivator but I can't sustain the pace.

I'll try to at least blog once a week with a substantial post at least once a month but we'll see.

There'll hopefully be a lot of ongoing progress to report but I'll also try to occasionally step back and write some more well-thought-out pieces, particularly on general linguistics. For thoughts-in-progress, I'll likely use ThoughtStreams.

I'm really hoping to collaborate with others on all the work I've been talking about over the last four weeks and in my SBL talk, so if you're interested, email me at jtauber@jtauber.com.

And because blogging won't be as regular, please subscribe to get email updates if you haven't already. Just fill out your email address in the form to the right (if you're on the site).

A Morphological Lexicon of New Testament Greek: My SBL 2015 Slides

2015-11-22

This morning I gave my talk at SBL 2015 on my Morphological Lexicon project.

I've put the slides up here.

Analyzing Verbal Morphology: Part 1

2015-11-21

In anticipation of my SBL talk tomorrow, here's an update on my verbal analysis.

As I mentioned in Analyzing Nominal Morphology: Part 1, I started off with nominal morphology but, the last couple of years have been more focused on the verb (until a couple of months ago when I switched back to the noun).

My current modeling approach is actually my third attempt at verbs. Perhaps in a later post I'll describe the earlier approaches and why I backed out and started from scratch twice. I'm happy with the path I'm following now, though.

Unlike the approach I took later with nouns, my verb analysis didn't focus on theme/distinguisher but on stem/suffix with sandhi rules. One reason for this is one of my immediate goals was stem generation.

Prior to running on all the MorphGNT verbs, I started with Helma Dik's Nifty Greek Handouts and the verb paradigms in Louise Pratt's The Essentials of Greek Grammar. Coverage is now those plus all the MorphGNT verbs except for imperatives, subjunctives and optatives.

The code and data is currently available at https://github.com/jtauber/greek-inflection although I may move at least the GNT-specific data to be in the morphological-lexicon repo soon.

The basic approach is to have an "endings" database and a "stems" database. The "endings" database looks like:

PAI.1S:
    - "|>ω<ω|"
    - "|ε>ῶ<ω|"
    - "|ο>ῶ<ω|"
    - "|α>ῶ<ω|"
    - "|ο!>ω<_1|μι"
    - "|ε!>η<_1|μι"
    - "|υ!>υ<_1|μι"
    - "|α!>η<_1|μι"
    - "|ει!>ει<_1|μι"

AAI.1S:
    - "|><|α"
    - "|%>ο<T_1|ν"
    - "|α^>η<_1|ν"
    - "|ε^>η<_1|ν"
    - "|ο^>ω<_1|ν"
    - "|α!>η<_1|ν"

where endings and sandhi are expressed. You can see various stem diacritics like ! for athematic, ^ for root aorists and % for second aorists. T_1 represents a thematic vowel and _1 a particular ablaut pattern.

Along side this is a larger stem database:

ἀγαπάω:
    stems:
        1-: ἀγαπα
        1+: ἠγαπα
        2-: ἀγαπησ
        3-: ἀγαπησ
        3+: ἠγαπησ
        4-: ἠγαπηκ
        5-: ἠγαπη
        7-: ἀγαπηθησ
ἀναλαμβάνω:
    compound: ἀνά++λαμβάνω
    stems:
        1-: ἀναλαμβαν
        3-: ἀναλαβ%
        3+: ἀνελαβ%
        6-: ἀναλημφθ
        6+: ἀνελημφθ

Stems are keys by a principal-part like scheme where - / + refers to augmented and unaugmented. The 7- stem is the future perfect.

The stem database can also do overrides for individual paradigm cells, show preverbs, mark enclitics and more.

All this gets tested against the Dik and Pratt examples and the verb forms in the MorphGNT in two ways:

given a lemma and features, is the correct form generated?
given a form, lemma and features, is the correct stem identified?

Once the imperatives, subjunctives and optatives are done, I'll work on stem relationships, essentially treating the stems as another paradigm. I may also at some point generate distinguishers for each verb form (within a particular aspect/tense-voice form).

Further work will involve using it to actually analyze new texts, particularly handling the case where the stem is not yet in the stem database.

Greek Accentuation Library

2015-11-20

I knew that a necessary component of a comprehensive morphological analyzer for Ancient Greek was going to be a library for handling accentuation, so back in January 2014, I started the greek-accentuation Python library.

It consists of three modules:

characters
syllabify
accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

>>> base('ᾳ')
'α'

>>> iota_subscript('ᾳ') == IOTA_SUBSCRIPT
True

>>> add_diacritic('α', IOTA_SUBSCRIPT)
'ᾳ'

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

>>> syllabify('γυναικός')
['γυ', 'ναι', 'κός']

>>> penult('οἰκία')
'κί'

>>> paroxytone('λόγος')
True

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

The library is open source under an MIT license. You can get the package on PyPI and the source repo is https://github.com/jtauber/greek-accentuation.

The Dangers of Reconstructing Too Much Morphophonology

2015-11-19

What is the genitive singular ending for 2nd declension nouns?

The beginner student probably thinks the ending is ου.

Those that are told the stem ends in ο might be tempted to conclude the actual ending is υ. At least one popular introductory text teaches this but it's incorrect.

Those more familiar with the sandhi rules might conclude the ου could come from ο+σο or ε+σο via οο. Those who know some Homer might speculate an ο+ιο, but ου is also found in Homer (especially in the pronouns) which might seem confusing.

Those who study proto-Indo-European might know of *osyo becoming *ohyo in Proto-Greek then *oyyo.

How should this be modeled synchronically? I think there's too much of a tendency in morphophonology to adopt an "ontogeny recapitulates phylogeny" approach and assume that speakers are storing a historical underlying form and then replaying millennia of sound changes.

The problem here is there's no way a Koine speaker would have reconstructed *osyo during acquisition. In my stem+ending annotations I tentatively used ο+ιο but I'm reconsidering that. There is no evidence I can think of that would have helped a native Koine speaker choose between ο+ιο, ο+σο or ο+ο as underlying.

And given that there are a class of 1st declension masculine nouns whose genitive singular ends in ου despite the α stem ending (which could not result in ου unless the α was actually dropped), it may actually be best to view the speakers' knowledge as the ending just being "ου"— the naïve view we wrote off at the start.

At the very least, we need to be very careful when saying "the stem is X, the ending is Y" as to whether we are trying to explain the form historically or the speakers' synchronic knowledge.

Full Citation Forms and Inflectional Classes

2015-11-18

Back in July and August 2014, I started looking at patterns in the full citation forms of nouns in Danker's Concise Lexicon. My goal was partly to explore, in a systematic way, the relationship between inflectional classes and the information expressed in the common pattern of {nominative form}, {genitive ending}, {article}. I also wanted to put together a kind of automated test to catch typos and inconsistencies in the lexicon.

I started drafting a paper with my findings as I went along and I intend to get back to it at some point but I wanted to mention this little project here, point to the code and mention a couple of things coming out of it so far.

The code is available at https://github.com/morphgnt/morphological-lexicon/tree/master/projects/citation_forms.

In particular, the file citation_form_data.py contains the rules (still needing some work outside the basic {nominative form}, {genitive ending}, {article} pattern) for what a full citation form can look like.

Each row in this file contains a tuple of:

a tuple of regexes matching the full citation form, Mounce's category and Dobson's part-of-speech/gender (the last mostly to catch errors in that file)
a tentative new label for the inflectional class
a (potentially empty) list of child rules

For example:

((r"α, ας, ἡ$", r"^n-1a$", r"^N:F$"), "1.1/a1/F", []),

These rules are organized in a hierarchy starting with the most general rules and, containing as children, more specific subsets. The inflectional class labels like 1.1/a1/F are intended to reflect this hierarchy. For example, here are the ancestors of the above rule:

((r"^(\w+), (\w+), (\w+)$", r"^n-", r"^N"), "", [
    ((r"[αη]ς, {art}$", r"^n-1", r"^N:.$"), "1", [
        ((r"ας, ἡ$", r"^n-1", r"^N:F$"), "1.1/F", [
            ((r" ας, ἡ$", r"^n-1", r"^N:F$"), "1.1/F", [
                ((r"α, ας, ἡ$", r"^n-1[ah]$", r"^N:F$"), "1.1/a/F", [

The first line is the most general rule for any nouns whose citation form in Danker has three parts. The next level (given the class 1) are those that have a citation form ending with either ας or ης and then an article. This is further subset (class 1.1/F) into citations forms ending with ας and a feminine singular article. This is further subset into citation forms with no other letters before ας in the genitive ending provided. This is further subset (class 1.1/a/F) into those citation form whose nominative form ends with α. Because this still results in a Mounce category of n-1a or n-1h, this is further refined into the first line we saw with the inflectional class 1.1/a1/F.

From these rules certain inconsistencies show up. For example, "γῆ, γῆς, ἡ" is the only "η, ης, ἡ" entry that gives the full genitive form rather than just the genitive ending. Five of the six masculine words with genitive in "τος" give "τος" with the preceding vowel as the genitive ending but the other one gives the full genitive form. 34 feminine words with genitive in "τος" give just the preceding vowel but one gives the preceding consonant + vowel.

For a lexicon whose editors want consistency in their citation forms, this kind of thing is useful to be able to check programmatially.

Lots more to say when I get around to finishing the paper but I wanted to at least share the code and (in-progress) rules. For the tie-in to inflectional class modeling, I'll soon integrate this work with my recent work on Analyzing Nominal Morphology but I'll also use the "automatic consistency checking" aspect of the work to ensure better consistency in the Morphological Lexicon.

Modern Greek Text to Speech for Biblical Greek

2015-11-17

Text-to-speech is pretty good these days but a lot of people don't realize that operating systems like OS X have support for languages other than English, including Modern Greek. So I thought I'd experiment with using it to read the Greek New Testament.

On OS X, if you go to System Preferences > Dictation and Speech, then select "Customize..." under System Voice, you can download or upgrade your Greek voices. There are a male and female voice you can try: Nikos and Melina respectively.

There are two ways I know of that you can then get those voices to read Greek for you.

The first way is, with Nikos or Melina selected as the System Voice, you select any Greek text in another app (such as TextEdit), right click and select Speech > Start Speaking. This will honour the speed setting in System Preferences > Dictation and Speech. Slowing down the speech drops quality dramatically, though.

The second way is on the command line with say. I can't work out if say supports slowing down the reading (it doesn't honour the speed setting in System Preferences) but it does support outputting the result to an AIFF file.

Note that you can't feed it polytonic Greek so you need to strip breathing and convert accents. I did that to produce a text like this:

Ήν δέ άνθρωπος εκ τών Φαρισαίων, Νικόδημος όνομα αυτώ, άρχων τών Ιουδαίων· ούτος ήλθεν πρός αυτόν νυκτός καί είπεν αυτώ· Ραββί, οίδαμεν ότι από θεού ελήλυθας διδάσκαλος· ουδείς γάρ δύναται ταύτα τά σημεία ποιείν ά σύ ποιείς, εάν μή ή ο θεός μετ’ αυτού. απεκρίθη Ιησούς καί είπεν αυτώ· Αμήν αμήν λέγω σοι, εάν μή τις γεννηθή άνωθεν, ου δύναται ιδείν τήν βασιλείαν τού θεού.

I then used

say -v Nikos -f john_3_1.txt -o john_3_1

to produce the following AIFF file.

A pretty decent reading of the Greek New Testament with Modern Greek pronunciation.

The only oddity is that the ου in the last clause is spelled out. Not sure how to fix that.

What excites me about this is less the generation of long audio files of entire passages, but more how it could be used in conjunction with an intelligent tutor to pronounce individual words and phrases that the student is currently studying.

Actual Core Vocab Lists for Greek New Testament

2015-11-16

Back in The Core Vocabulary of New Testament Greek I talked about Wilfred Major's 2008 paper on core vocabulary lists for Classical Greek and provided code for producing the same for the Greek New Testament along with some discussion of the results. I didn't actually include the full results, however.

Prompted by Paul-Nitz's request on the B-Greek forum, I put together https://github.com/jtauber/core-gnt-vocab which includes not only the code but actually generated lists (currently 50% and 80% lemma lists).

I've included as a starting point glosses from Dodson but I'd love people to file issues (or even better, pull requests) if they have improvements they'd like to see.

I'm also interested if people think certain lexeme should be split like Major does (e.g. suppletive verbs).

You can get the raw lists at:

First Prototype of New Online Reader

2015-11-15

Over in the lab section of this site, I've added a little prototype Patrick Altman and I built last night.

At the moment it just shows the first paragraph of John 3 but if you click on a word it gives the lemmatization and parsing from MorphGNT, the gloss from Dodson and links to the head and child dependencies based on the GBI Syntax trees.

You can try it out at https://jktauber.com/labs/reader.html.

The source code is available at https://github.com/morphgnt/reader-demo.

Besides the obvious extention to the rest of the GNT text, I'll soon bring in information from the Morphological Lexicon to help readers understand why the form is what it is.

Longer term, I'd like to add user accounts so authenticated users can bookmark passages, words and forms. Giving users the ability to mark which words they do or don't understand means that the site can then produce custom quizzes, recommend what to read next, etc.

This is starting to get to the real heart of learning tools driven by better linguistic databases.

If you're a Django and/or React developer who would like to help with this, let me know. If you teach intermediate students and have feedback on what would make this more useful, I'd also love to hear from you.

Analyzing Nominal Morphology: Part 2

2015-11-14

In Analyzing Nominal Morphology: Part 1, I talked about putting together a list of nominal distinguishers and verifying it on the MorphGNT, generating a per-lexeme theme + distinguisher analysis. Here, I'll outline some further steps I've taken.

As well as producing a YAML file with entries for each lexeme, I also now generate a (space-delimited) tabular form that looks like this:

ἀβαρής a-4a -- M n-3d(2aA) ἀβαρ AS ἀβαρῆ ἀβαρ ῆ εσ+α
ἄβυσσος n-2b -- F n-2b ἀβυσσ GS ἀβύσσου ἀβύσσ ου ο+ιο
ἄβυσσος n-2b -- F n-2b ἀβυσσ AS ἄβυσσον ἄβυσσ ον ο+ν
ἀγαθοποιέω verb PA M n=3c(5b-OU) ἀγαθοποι NS ἀγαθοποιῶν ἀγαθοποι ῶν ουντ+
ἀγαθοποιέω verb PA M n=3c(5b-OU) ἀγαθοποι NP ἀγαθοποιοῦντες ἀγαθοποι οῦντες ουντ+ες
ἀγαθοποιέω verb PA M n=3c(5b-OU) ἀγαθοποι AP ἀγαθοποιοῦντας ἀγαθοποι οῦντας ουντ+ας
ἀγαθοποιέω verb PA F n-1c ἀγαθοποιουσ NP ἀγαθοποιοῦσαι ἀγαθοποιοῦσ αι α+ι
ἀγαθοποιΐα n-1a -- F n-1a ἀγαθοποιϊ DS ἀγαθοποιΐᾳ ἀγαθοποιΐ ᾳ α+ι
ἀγαθοποιός a-3a -- M n-2a ἀγαθοποι GP ἀγαθοποιῶν ἀγαθοποι ῶν +ων
ἀγαθός a-1a(2a) -- M n-2a ἀγαθ NS ἀγαθός ἀγαθ ός ο+ς

The columns are:

lemma
Mounce category (or verb for particples) for overall lexeme
aspect / voice (for participles)
gender
Mounce category used for particular sub-paradigm (different from overall lexeme for adjectives or participles)
lexeme-level theme
case / number
form
form-specific theme
form-specific distinguisher
stem ending and suffix

What's helpful about this format is you can use awk, grep, sort, wc and other Unix tools to very quickly extract information. (I may soon put it in SQL and expose a web interface too). So you can see all the times a particular distinguisher is used, or all the times it's used for a particular case / number. Or what all the sandhi rules are.

I've already written a Python script that generates a list of paradigms based on this (keyed off Mounce category for now, until I've finalized my own, which will actually be defined by these paradigms).

The paradigms look like:

n-3b(1) M (10):
    NS:   ξ          {κ+ς}
    GS:   κος        {κ+ος}
    DS:   κι         {κ+ι}
    AS:   κα         {κ+α}
    NP:   κες        {κ+ες}
    GP:   κων        {κ+ων}
    AP:   κας        {κ+ας}

There's actually a feedback loop where inconsistencies and errors spotted in this paradigm output inform corrections to the underlying distinguisher rules.

The code and data are available at https://github.com/morphgnt/morphological-lexicon/tree/master/projects/nominal_distinguishers.

Initial Thoughts on the Cost of Learning a Form

2015-11-13

Over the years, when generating vocab coverage stats or orderings for graded readers, I've used either lemmas or inflected forms as the items being learnt.

The problem with using inflected forms is that it assumes knowing one form of a lexeme has nothing to do with knowing any other form of that lexeme. The problem with using lemmas is that it assumes knowing one form of a lexeme is enough to know all of them.

Of course, the path forward lies somewhere in between and one of the motivations for all my Morphological Lexicon work is to have the necessary data in machine-actionable form to take a much more intelligent approach to the relationship between knowing one form and knowing another.

This gets in to some very deep areas of psycholinguistics and learnability but, for now, I'm mostly just looking for a better measure of the "cost" or "effort" of learning a new form for the purposes of judging readability, etc. than just assuming all forms are equal or that learning a lemma gives you all the forms.

An initial improvement could be made by using themes and distinguishers. Consider λόγου, whose theme is λογ and distinguisher is ου. The theme identifies the lexeme (by definition it's the part of the word shared by all cells in a paradigm for a particular lexeme). The distinguisher both identifies some morphsyntactic properties (the fact it's a genitive singular, assuming we can tell it's a nominal) and gives some hints as to inflectional class (i.e. it reduces the possible distinguishers other cells in the paradigm can take).

So a simple way of modeling things is to say that, in order to understand λόγου, you need to know λογ and ου. Breaking apart the themes and distinguishers is an improvement over just looking at lexemes or forms. Using the theme takes care of suppletive stems too. (Although it does raise the question: does learning that two suppletives stems are the same lexeme cost effort or save it?)

There are a few situations that need more consideration though. Firstly stems that aren't truly suppletive but are systematically derived from one another. (e.g. λαμβαν / λαβ). To first approximation, you could just model this as full suppletion in terms of effort but a more refined approach would be to give a "discount" on the effort of learning λαμβαν if you already know λαβ or vice-versa. Even then, you'd likely only want to provide that discount once learning the nu-infix pattern had been costed.

Secondly, consider families of distinguishers for the same properties that differ because of sandhi (either in that particular cell or in others, causing the theme to have less of the stem). For example here are the 28 distinguishers for dative singular nominals according to my current analysis: -ᾳ, -αντι, -ατι, -γι, -δι, -ει, -ειρι, -ενι, -εντι, -ῃ, -ι, -ιδι, -κι, -κτι, -νι, -ντι, -οϊ, -ονι, -οντι, -οτι, -ουντι, -πι, -ρι, -τι, -τῳ, -υϊ, -ῳ, -ωντι. The reason 28 are needed are because of sandhi in other cells such as the nominative singular. The only ending is -ι so you really only need to know that one thing (plus perhaps that iota is subscripted after a long alpha, eta or omega). The distinguisher analysis is still useful (particularly for its role in hinting at inflectional class) but the cost should be massively discounted once you recognize the -ι pattern.

Thirdly, I haven't yet talked about costs and discounts for the actual sandhi rules. Should the -ους ending in the genitive singular (for stems ending in εσ or οσ) be discounted if you know both the genitive singular ending -ος and the εσ+ος → ους / οσ+ος → ους sandhi rules?

And finally, while I've talked a couple of times here about the distinguisher hinting at the inflectional class, that information hasn't been incorporated in to any costing or discounting in our discussions yet. It's worthy of a little more research into the psycholinguistics literature, but presumably seeing something like πίνακος primes you for recognizing πίναξ. It's also potentially useful for disambiguation: if you know the nominative plural ends in -ες, for example, then you know that -ος is a genitive singular not a nominative singular.

There's clearly lots more to explore but it reinforces what I keep saying: having data like the distinguisher analysis opens us up to explore this sort of thing and potentially incorporate it in new learning tools.

In this post, I've just talked about morphology, but things can of course be extended (and need to be extended) to constructions beyond the word. That, of course, requires richer analysis beyond what I'm doing with the Morphological Lexicon but that is something I eventually want to tackle as well.

Analyzing Nominal Morphology: Part 1

2015-11-12

While much of my work going back 10 years or more was on the nominals, the last few years I've been focused on verbal morphology. I decided that for my SBL paper, however, I'd revisit some of my noun work and ended up exploring some ideas afresh.

By nominals I mean nouns, adjectives, determiners, pronouns, proforms, participles. Basically anything marked for case (see Morphological Parts of Speech in Greek).

I wanted to, at the very least, generate themes and distinguishers for the nominals. But once you have that, you have a nice set up to explore stems, endings and sandhi. This is a nice interface into some of the general (i.e. not language-specific) morphology I was doing for my PhD. Finally, it enables me to get back to my long-running goal of laying out a system of inflectional classes that improves on Funk, Mounce and others.

You can see the work in progress at https://github.com/morphgnt/morphological-lexicon/tree/master/projects/nominal_distinguishers.

The first phase involved enumerating the possible distinguishers for each combination of case/number/gender. This was done incrementally, running a Python script that (a) showed me forms that weren't covered by the existing list; (b) showed me lexemes that had more than one theme. In some cases, multiple themes was a legitimate suppletion but in other cases it meant I hadn't gotten the theme/distinguisher split right. Because I had them in electronic form, I also used Mounce's inflectional classes as a hint to disambiguate distinguishers.

So the first phase involved creating a file that looked something like this (just a very small subset of what is currently an 851-line file):

NSM:
    - ας n-1d α+ς
    - ης n-1f η+ς
    - ος n-2a ο+ς
    - ψ n-3a\(1\) π+ς
    - ψ n-3a\(2\) β+ς
    - ξ n-3b\(1\) κ+ς
    - ξ n-3b\(2\) γ+ς
    - ξ n-3b\(3\) χ+ς
    - ους n=3c\(2-OD\) οδ+ς
    - ς n-3c\(1\) τ+ς
    - ς n-3c\(2\) δ+ς
    - ς n-3c\(3\) θ+ς

You'll notice I annotated each distinguisher with the underlying stem ending and inflectional ending. You can see I needed to use Mounce's codes (for now) to disambiguate distinguishers like ψ, ξ and ς. You'll also notice I had to invent my own temporary extensions to Mounce in the case of οδ+ς → ους because there are deliberately no sandhi rules built in to my scripts (more on that later).

My initial script takes the above file, runs across all forms in the MorphGNT SBLGNT are produces entries like the following:

ἀγαλλίασις:
    forms:
        F:
            theme(s): ἀγαλλιασ
            NS: ἀγαλλίασις ἀγαλλίασ|ις ϳ+ς
            GS: ἀγαλλιάσεως ἀγαλλιάσ|εως ϳ+ος
            DS: ἀγαλλιάσει ἀγαλλιάσ|ει ϳ+ι

In some (not necessarily immediately) following posts, I'll talk more about additional outputs and other scripts in the pipeline.

This mini-project is a great example of where having a deterministic verification process on manually tweaked rules works well (over, say, trying to automate the generation of the rules entirely).

Technical Aspects of Openness

2015-11-11

In my previous post, I talked about the legal / licensing aspects of open linguistic data but there are technical aspects in order for linguistic data to be open too.

To illustrate, consider an out-of-copyright, printed lexicon. From a licensing point of view, it's open—it can be redistributed with or without modifications, etc. But that doesn't make it particularly usable for computational work.

A while ago I came across something Greg Crane had written where he talked about things being machine-actionable. I like this a lot more than "machine-readable" because it isn't just about being able to "read" the work, it but to actually do interesting things with it.

There are various facets of this so I thought I'd try to enumerate some of them.

correctable — can I make corrections if I find mistakes?
verifiable — can I write code to check for errors?
reproducible — can I reproduce the results others have found?
extensible — can I extend it with my own data or data from other sources?
queryable — can I search, filter, or sort the data to get subsets of interest?
reusable — can I use the same data for multiple applications?
repurposable — can I use the data for purposes not conceived of initially?
adaptable — can I produce different variants of the data applicable to different users?

My BibleTech 2015 talk touched on a number of these.

I should note that it's entirely possible to have works that are proprietary from a licensing point of view but completely open technically. I may be able to purchase a database that I can't redistribute but which is in a clean, consistent format I can write software to process. It has the disadvantage that I can't make corrections available to others or redistribute derivative works, but it's better than a closed-license work that's also closed with regard to facets discussed in this post.

Why I Use CC-BY-SA Licenses

2015-11-10

I don't think I've ever articulated why I favour a Creative Commons CC-BY-SA license on all my New Testament Greek data.

I don't mean why do open scholarship in general, but why my specific choice of Attribution-ShareAlike?

I avoid NoDerivs (ND) because I want people to build on my work, make corrections, add new analyses.

I use ShareAlike (SA), though, because I want to be able to incorporate corrections and new analyses back and want to avoid private forks of projects. Note that when it comes to software, I generally favour MIT/BSD-style licenses that aren't viral. But when it comes to data and analyses, I want the openness to be viral.

Perhaps more controversially, I avoid NonCommercial (NC). My reason is simple: I don't want someone who wants to use my work in a commercial package to have to waste time reinventing the wheel and redoing everything just so they can use it. Duplication of effort doesn't help anyone. Because of the ShareAlike, a commercial project can't make private forks. I don't care if someone is making money as long as improvements they make to my work are shared back.

Creative Commons doesn't have a license that requires ShareAlike but not Attribution but, even if they did, I'd use Attribution (BY). Particularly in scholarship, I think it's important to give credit where credit is due. Plus having a chain of who did the work is useful for providing corrections upstream.

My arguments for using ShareAlike and Attribution are why I don't like just putting things in the "public domain" / under a CC0 license. (Incidentally, I put "public domain" in quotes because it's an ill-defined concept, which is why the CC0 license was developed in the first place. Even if you're not persuaded by my arguments for BY-SA, at least use CC0 rather than saying "public domain").

Finally, I'd be remiss if I didn't acknowledge the great work of the Creative Commons organization in making all this possible.

Mean Log Frequency of Dependency Paths

2015-11-09

Adding another potential readability metric, let's look at the mean log frequency of dependency paths.

So far we've looked at the mean log frequency of lexemes, the mean log frequency of forms, and, after calculating dependency paths or "swords", the mean dependency depth.

What we haven't looked at is the mean log frequency of those dependency paths—a rough proxy for a target having common (rather than merely shallow) syntactic structures.

By this measure, the top five (i.e. lowest scoring) books are:

4832 1 Corinthians
4929 3 John
4935 1 John
4938 John
5027 James

and the top 10 chapters are:

4183 1 Corinthians 13
4362 1 Corinthians 9
4386 1 Corinthians 14
4485 Romans 14
4486 John 16
4550 1 John 3
4558 2 Corinthians 11
4564 1 Corinthians 6
4566 1 Corinthians 7
4576 John 7

It is interesting just how much 1 Corinthians features here. The book (and those chapters featured above) do poorly in terms of mean log frequency of lexemes.

If 1 Corinthians is actually syntactically easy to read, I wonder if that's an argument for having some readings which, because of vocab, need to be heavily footnoted with glosses but which are still worth reading early because of the syntax.

At the Half Way Point

2015-11-08

Exactly two weeks ago I said I'd be blogging every day until my talk at SBL. Well, that's two weeks away so I'm at the half way point. I think the blogging has gone well.

Many of the posts have been things I've had drafts of for a while. Others have been ideas that haven't taken long to get down in a post. Attempting to blog every day means I haven't really worked on posts that represent multiple days much less weeks or months of work.

In the next two weeks I do hope to talk about a few longer-running projects but, that said, I do enjoy getting down an idea or concept that's just a short post but which has been on my mind for years.

Thanks to the people who have so far engaged with my posts via email and elsewhere. My interactions with you are a huge motivation for me doing this.

Generating Readers

2015-11-07

Back in April 2014, Brian Renshaw posted a Good Friday Greek Reader. It was presumably manually produced but I knew such things could be generated automatically and so went about building a system to do so.

You can see a sample PDF at https://github.com/jtauber/greek-reader/blob/master/example/reader.pdf which roughly looks like what Brian produced.

From a code point of view, it's a fairly simple Python 3 script that generates LaTeX that is then typeset using XeTeX. There is also an experimental backend using SILE. The code is open source under an MIT license and is available at https://github.com/jtauber/greek-reader. It assumes you're comfortable with those tools and editing text files to tweak things, but my hope is eventually a website could be built around this.

To produce a reader like this, whether manually or automatically, you need:

a text
lemmatization
frequency counts
glosses
full citation forms / headwords (e.g λαμπάς, άδος, ἡ) for nominals
parsing (e.g. AAI 3S) for verbs

MorphGNT gave me 1, 2, 3 and 6. 4 came from Dodson (although you can override both globally and per verse) and 5 came from Danker's Concise Lexicon.

What's nice about doing this programmatically, besides that fact you can make corrections upstream and have them applied to all the generated readers is that you can make this adaptive. In the example, I chose which words to annotate based on frequency but it could just as easily be based on other criteria such as what a particular student has learnt up to this point or what has been covered in a particular textbook up to this point.

One major feature I want to add, though, is richer annotation both morphologically AND syntactically so it becomes possible to generate something more akin to Zerwick and Gosvenor's A Grammatical Analysis of the Greek New Testament.

One major motivation for my continuing work on a Morphological Lexicon is being able to provide more focused, helpful annotations for readers indicating not just a lemma but a principal part or some additional information that helps the student understand the form.

For the syntax, I'd like to eventually develop a catalog of constructions so, much like forms are only annotated if they are less frequent (or otherwise unknown to the student), particular syntactic constructions in a text can be called out based on similar criteria. Some of this is possible with existing syntactic analyses, the trick is knowing which annotations to include and which are already obvious. (I have some ideas for how to crowdsource difficult constructions, but more on that later).

The greek-reader project is a great example of a pretty simple tool that can do a lot because it builds on rich data. As we get better and better data, we can build better and better tools.

Inline Annotation of Sandhi

2015-11-06

In many Greek morphology projects, I've wanted a way of conveying the surface form of an inflected word while also conveying the underlying components prior to the application of the sandhi rule. A couple of years ago, I came up with a simple representation for inline annotation.

Say you want to convey the fact that φιλοῦμεν comes from φιλε + ομεν by application of the rule that ε + ο → ου. In the representation I've been using you'd write φιλ|ε>ου<ο|μεν.

This enables you to see the stem and affix easily but also the result of sandhi.

So what A|B>C<D|E means is there is a sandhi rule that B + D → C and that rule has been applied in AB + DE to form ACE.

Using Stump's terminology introduced in a previous post:

A / φιλ is the theme
CE / ουμεν is the distinguisher
AB / φιλε is the stem
DE / ομεν is the affix

It also means that you can search for |B>C<D| to find where that particular sandhi rule has been applied.

Morphological Parts of Speech in Greek

2015-11-05

The parts of speech in a particular language can be drawn up on the basis of syntactic properties, morphological properties, and/or (perhaps most problematically) semantic properties.

What if we just want to classify lexemes in the MorphGNT based on what morphosynactic and morphosemantic features they have?

Minimally, we might get something like this:

case	person	aspect
−	−	−	conjunctions, adverbs, interjections, prepositions, particles, indeclinable nouns and adjectives
+	−	−	nouns, adjectives, pronouns, articles
−	−	+	infinitives
+	−	+	participles
−	+	+	finite verbs

We could consider voice, but it co-occurs with aspect, so its value is predictable.

Mood only appears in finite verbs, which means it's also predictable (arguably, co-occurent with person but see below).

Number is predictable as it co-occurs with case or person.

As things stand above, gender is also predictable (it co-occurs with case).

However, let's consider the distinction between the 1st/2nd person pronouns on the one hand and the proforms on the other.

(There are strong arguments beyond just morphology for distinguishing the (1st/2nd person) personal pronouns and proforms. See Bhat's book Pronouns for cross-linguistic arguments for the distinction.)

The 1st/2nd person pronouns, unlike the proforms, don't inflect for gender. So let's add gender to the mix:

case	person	gender	aspect
−	−	−	−
+	?	−	−
+	−	+	−
−	−	−	+
+	−	+	+
−	+	−	+

The ? under person for the personal pronouns is because they don't really inflect for person. Person is lexical in the personal pronouns.

Interestingly, though, if we do give it a + then we don't need gender to distinguish the category.

You may wonder what about degree. I'm currently of the inclination that degree is better modeled derivationally rather than inflectionally, although that's worthy of a separate post.

Mean Log Frequency of Forms

2015-11-04

In a previous post, we looked at which chapters had the highest mean log frequency of lexemes. The code provided there was applicable to other items, though, so let's now take a look at mean log frequency of forms.

The code change is a simple change to one line.

The top 10 are:

6277 2304 449
6373 2305 429
6500 2302 585
6558 0403 657
6562 2303 467
6596 1001 401
6600 0408 905
6617 2301 207
6640 0702 287
6646 2720 406

In other words:

1 John 4 (also 1st for lexemes)
1 John 5 (also 2nd for lexemes)
1 John 2 (8th for lexemes)
John 3 (9th for lexemes)
1 John 3 (7th for lexemes)
Ephesians 1 (11th for lexemes)
John 8 (6th for lexemes)
1 John 1 (4th for lexemes)
1 Corinthians 2 (32nd for lexemes)
Revelation 20 (14th for lexemes)

Generally form frequency will track pretty closely with lexeme frequency because a form being common makes the lexeme common. This makes 1 Corithinians 2 interesting.

Frequent words and forms obviously doesn't necessarily mean shallow syntax, though. 1 John 4, 5 and 2 are respectively the 36th 67th and 38th by mean dependency depth. There are no chapters that are in the top ten of both mean log form frequency AND mean dependency depth.

So we now have mean log frequences for lexemes and forms as well as mean dependency depth. In future posts, I'll add parse codes and the actual dependency path to the mix and then we can look at combining all five metrics. I'll also look at paragraphs rather than chapters as targets.

Distinguishers in Morphology

2015-11-03

A few years ago, I was introduced by Greg Stump to the notion of distinguishers in morphological description. The analysis of inflected forms in terms of theme + distinguisher is a very helpful concept and one that is made use extensively in my ongoing work on New Testament Greek morphology.

Take a word like φιλοῦμεν. The underlying stem is φιλε and the suffix is ομεν. The sandhi rule ε + ο → ου has been applied.

So in the surface form of the word, the φιλ is part but not all of the stem. It's the part that will likely (unless there is suppletion) be common with other cells in the paradigm. Similarly οῦμεν is not the suffix but it is the part that is indicating "first person plural" (as well as indicating that the stem likely ends in ε or ο).

Stump calls φιλ the theme and οῦμεν the distinguisher. The theme is what the cells in a paradigm have in common, the distinguisher is what distinguishes them from one another.

SPOILER ALERT: I'm working on a full theme/distinguisher and stem/suffix analysis of every inflected form in the Greek New Testament as part of my Morphological Lexicon of New Testament Greek.

Atom Editor 1.1 Fixes Polytonic Greek Bug

2015-11-02

Release 1.1 of GitHub's Atom Editor fixes a problem I had with using it for polytonic Greek.

I was an early adoptor of Atom Editor despite some initial rough edges. I now use it for all my development, including Greek-related stuff talked about on this blog—not just code but data files as well.

Most of the rough edges got sorted out early on and certainly before the 1.0 release but there was one problem, highly relevant to this blog, that persisted.

Basically, Atom was miscalculating the width of characters formed from Unicode combining characters which made it quite difficult to work with text files containing polytonic Greek.

You can see the problem in this screenshot:

Notice that the existence of diacritics on the alpha at the end of some of the lines actually changes the width of preceding characters, even though a fixed-width font is being used. As well as just looking weird, it made files difficult to use as the cursor position didn't correspond visually to where typing would occur.

I filed a bug report back in March and was disappointed a fix didn't make the Atom 1.0 release. But once I found out what was involved in fixing it (it didn't just affect polytonic Greek but a lot of non-ASCII use cases) I was impressed. If you want the raw details, see here and here.

A couple of weeks ago Atom 1.1 came out and it includes all that work that (amongst other things) fixes the bug I filed.

Now it works perfectly:

Renaming Non-Indicative Tense-Forms

2015-11-01

I think it's confusing that we name the non-indicative tense-forms with the same terms as indicative tense-forms. For example “present indicative” and “present infinitive”. The word “present” doesn't mean the same thing in both cases.

When there is a past/non-past alternation in Greek (e.g. imperfect/present or pluperfect/perfect), only one of the pair is possible in non-indicatives.

The reason for this is simple: only the indicative mood makes a past/non-past distinction. In other cases, only aspect is conveyed.

But this is undermined when we then go and choose for the non-indicative, "aspect only" forms the same terms that, in the indicative mood, are specifically conveying a non-past tense.

It would be far better to use a term with the non-indicatives that conveys only the aspect.

"Imperfective" and "perfective" are obvious choices instead of "present" and "aorist" respectively (although it's not clear what we'd use for the perfect or future non-indicatives).

The same issue arises in discussion of "systems" and "stems". Rather than the "present system" or the "present stem" should we instead talk about the "imperfective system" and "imperfective stem" in Greek?

If we use "perfective stem" rather than "aorist stem" we avoid the asymmetry of talking about an augmented/un-augmented aorist stem but not (or at least not without some awkwardness) an augmented/un-augmented present stem. (One might be forgiven for thinking Greek involves a morphological process of removing an augment if some descriptions of the aorist/perfective system are to be believed.)

Of course even in the above, there is the confusing use of terminology for what to call the bundle of aspect and tense.

Sometimes the bundles themselves are called "tenses" and the tense axis (as opposed to aspect) is referred to as "time".

Sometimes the bundles are called "tense-forms", which I think is better but still slightly confusing as that should really be "tense-aspect-forms" or, perhaps, "aspect-tense-forms".

As an aside: the use of "form" is interesting as it places the bundling squarely in the realm of form, not meaning. In other words, even though the realization involves cumulative exponence (to adopt the terminology of Matthews), the meaning is just the union of the tense and aspect.

All of this plays into morphological tagging as well. I've suggested for the rethink of the parse codes in MorphGNT 7 that tense and aspect be split into two features.

An Experimental REST API to MorphGNT

2015-10-31

Back in July, I thought I'd prototype a REST API for MorphGNT with resources for books, paragraphs, sentences, verses and words.

The prototype is available on http://api.morphgnt.org/ and the underlying code here.

The API exposes in JSON not only the normal MorphGNT data but also the paragraphs from the SBLGNT proper, the sentence divisions from the GBI syntax analysis AND the dependency relationships discussed in Converting the GBI Syntax Trees to a Dependency Analysis. So for now, at least, it's the only place you can get all that info.

The prototype is currently served up using Django hitting a PostgreSQL database but it would be possible to just generate the roughly 150,000 JSON files once and serve them up from a CDN.

There's only one thing using the API that I know of at the moment and that's the lab on this site. It doesn't make use of a lot of the rich word-level information but it does demo how you can navigate through paragraphs of the GNT purely using the links in a book's first_paragraph or a paragraph's prev and next fields.

Note that the /v0/ prefix is used in URLs because there is no commitment to keep this API. It is subject to rapid change at the moment.

The URI patterns are:

/v0/root.json
/v0/book/{osis_id}.json
/v0/paragraph/{paragraph_id}.json
/v0/sentence/{sentence_id}.json
/v0/verse/{verse_id}.json
/v0/word/{word_id}.json

A word (currently) looks something like this:

{
    @id: "/v0/word/64001001005.json",
    @type: "word",
    verse_id: "/v0/verse/640101.json",
    sentence_id: "/v0/sentence/640001.json",
    paragraph_id: "/v0/paragraph/64001.json",
    crit_text: "λόγος,",
    text: "λόγος,",
    word: "λόγος",
    norm: "λόγος",
    lemma: "λόγος",
    pos: "N",
    case: "N",
    number: "S",
    gender: "M",
    dep_type: "S",
    head: "/v0/word/64001001002.json"
}

A verse (currently) looks something like this:

{
    @id: "/v0/verse/640101.json",
    @type: "verse",
    prev: null,,
    next: "/v0/verse/640102.json",
    book: "/v0/book/John.json",
    words: [...]
}

where words is a list of objects like the word above.

A paragraph and sentence are very similar to a verse (with an @id, @type, prev, next, book and words list).

A book (currently) looks something like this:

{
    "@id": "/v0/book/1Cor.json",
    "@type": "book",
    "name": "1 Corinthians",
    root: "/v0/root.js",
    "first_paragraph": "/v0/paragraph/67001.json",
    "first_verse": "/v0/verse/670101.json",
    "first_sentence": "/v0/sentence/670001.json"
}

Feedback is greatly appreciated to make this more useful. I'd particularly like to work with some front-end developers to do some more complex demos built on the API.

The Core Vocabulary of New Testament Greek

2015-10-30

In a 2008 paper, Wilfred Major constructs what he calls the 50% and 80% vocab lists for Classical Greek. That is, the lemmata that account for 50% and 80% respectively of tokens in the Classical Greek corpus. In this post I provide the code for the equivalent for the Greek New Testament and talk about some of the results.

Major's paper is It’s Not the Size, It’s the Frequency: The Value of Using a Core Vocabulary in Beginning and Intermediate Greek and as well as listing the 65 words in the "50% List" he lists the roughly 1,100 words in the "80% List" complete with glosses in both cases.

Major also discusses other issues near and dear to this blog such as the relevance of form frequency as well as lemma frequency. I'll respond to him on some of these topics in later blog posts.

Now, for many years I've talked about the limitations of a purely frequency-based approach to vocab ordering but that doesn't mean producing such lists is useless, just that there are things we can do to improve on that approach. But I still thought it would be interesting to produce GNT 50% and 80% lists.

The code is available here.

The 50% list consists of just 27 lemmata. The only verbs are γίνομαι, εἰμί, ἔχω, and λέγω. The only nouns are θεός, κύριος, and Ἰησοῦς.

The 80% list consists of 317 lemmata.

As expected, this is considerably smaller than Major's Classical Greek lists which are based on a considerably larger corpus.

It's easy to tweak the code to look at forms rather than lemmata. The 50% forms list for the GNT consists of 97 forms from 52 lemmata.

Interestingly, those 97 forms consist of 16 forms of the article, 15 forms of the (1st/2nd person) personal pronouns, and 6 forms of αὐτός. This suggests that even without arguments on morphological grounds, it's worth learning the full paradigms for the article, the personal pronouns and αὐτός really early on.

Unsurprisingly, λέγω gets a decent showing with 4 forms: εἶπεν, λέγει, λέγω and λέγων. I've long though it's worth learning those right away without needing to introduce full paradigms.

There's a lot more that could be explored even with this frequency-based approach. And lots more to say based on the other things Major talks about in his paper.

Finally, it should be stressed that very few full verses of the GNT would be readable with just the 80% list and probably none with the 50% list. I may do another post later on to confirm that.

UPDATE: Now see Actual Core Vocab Lists for Greek New Testament

Mean Dependency Depth

2015-10-29

With dependency paths calculated for the Greek New Testament, we can use mean dependency depth as a proxy for syntactic complexity.

In Mean Log Frequency of Lexemes I mentioned that, as well as mean log word frequency, reading comprehension measures such as the Lexile® framework use average sentence length. Now that we have Dependency Paths calculated, we can explore potentially more useful proxies for syntactic complexity.

As an initial experiment, we'll simply take the mean dependency depth of each target where our targets are chapters and by "dependency depth" I simply mean the number of labels in the dependency path. In other words np-O-CL-CL will count as 4 and we'll just average across all the words in each chapter.

An initial run reveals one interesting problem. Luke 3 is given a considerably higher score than anything else because of the analysis of the genealogy (A the son of B the son of C...and so on, leads to very long paths). Reading that genealogy is arguably not that taxing syntactically which highlights one flaw in the dependency depth approach (or, perhaps the analysis chosen for the genealogy).

This aside, let's look at what this measure identifies as easiest chapters:

Interestingly, the top 10 chapters for lowest mean dependency depth are all in Romans, 1 Corinthians and Galatians.

If we average, instead, across entire books, the top ten are:

3 John
1 Corinthians
1 John
James
Galatians
John
Romans
Matthew
Mark
2 John

which is perhaps a little less surprising.

The hardest chapters, Luke 3 aside, are the first chapters of Ephesians, 2 Timothy and Colossians, which probably isn't much of a surprise either. The hardest books overall are Ephesians and Colossians.

The code is available here (tweak line 13 to get book-level stats).

Note, this all may be quite sensitive to the choice of analysis. It would be an interesting exercise to see, for example, what the PROIEL dependency analysis yields.

In future posts, we'll try a few more measures and then try to bring them together to see how chapters (or books, or authors) compare across multiple criteria.

Dependency Paths

2015-10-28

For numerous corpus linguistics applications, it's useful to have a word-level indication of syntax. A presentation by Vanessa and Robert Gorman gave me the idea of using dependency paths for this purpose so I've now calculated them for the GNT based on the GBI syntax trees.

The presentation by the Gormans was entitled Greek Historiography Through Dependency Syntax Treebanking and they refer to the dependency paths as "syntactic words" or "swords" for short.

While their particular interest is authorship, the Gormans make an excellent point about the value of these dependency paths:

The chief advantage of recasting dependencies as syntax words is that they are immediately valuable: with trivial modifications such texts can be put into standard text-processing software to produce type-token ratios, word frequency histograms, etc., providing detailed syntactic information about individual authors.

I've previously written about Converting the GBI Syntax Trees to a Dependency Analysis so it's just a small step to producing dependency paths.

So if we take the output for the first part of John 3.16 from this dependency conversion:

64003016001 Οὕτως 64003016003 ADV
64003016002 γὰρ 64003016003 conj
64003016003 ἠγάπησεν None CL
64003016004 ὁ 64003016005 det
64003016005 θεὸς 64003016003 S
64003016006 τὸν 64003016007 det
64003016007 κόσμον 64003016003 O
64003016008 ὥστε 64003016013 conj
64003016009 τὸν 64003016010 det
64003016010 υἱὸν 64003016013 O
64003016011 τὸν 64003016012 det
64003016012 μονογενῆ 64003016010 np
64003016013 ἔδωκεν, 64003016003 CL

we can easily build up the dependency paths / swords:

64003016001 Οὕτως ADV-CL
64003016002 γὰρ conj-CL
64003016003 ἠγάπησεν CL
64003016004 ὁ det-S-CL
64003016005 θεὸς S-CL
64003016006 τὸν det-O-CL
64003016007 κόσμον O-CL
64003016008 ὥστε conj-CL-CL
64003016009 τὸν det-O-CL-CL
64003016010 υἱὸν O-CL-CL
64003016011 τὸν det-np-O-CL-CL
64003016012 μονογενῆ np-O-CL-CL
64003016013 ἔδωκεν, CL-CL

So it will tell you that μονογενῆ is qualifying the object of a subordinate clause (at least according to the GBI analysis). We've thrown away the noun it's modifying (υἱὸν) and the verb in the subordinate clause it's the object of (ἔδωκεν) and the verb in the main clause (ἠγάπησεν), but np-O-CL-CL is a decent label for its syntactic role as qualifying the object of a subordinate clause.

The code I used is available here.

Mean Log Frequency of Lexemes

2015-10-27

One component of many readability measures on texts is the mean log word frequency. Here I do a basic calculation across chapters in the Greek New Testament (with code provided).

Usually, the mean log word frequency is used in conjunction with something like the log mean sentence length (for example in the Lexile® framework). The latter is used as a proxy for syntactic complexity but, having a syntactic analysis, I think we can do better and I'll explore that in a future post.

For now, though, I wanted to get a per-chapter measure just based on mean log frequency of lexemes.

The code is available here. It's easy to adjust the targets (by default chapters, specified on line 14) and the items (by default lexemes, specified on line 15).

The result of running the script is something like this:

6153 0101 436
5757 0102 457
5471 0103 331
5487 0104 428
5437 0105 821
5532 0106 648

where the first column is -1000 times the mean log frequency (so the higher, the harder to read), the second column is the book and chapter number and the third column is just the number of word tokens in that chapter.

If we sort this output, we should get a list of the easiest chapters to read (at least by the measure of mean log lexeme frequency):

4704 2304 449
4746 2305 429
4926 0417 498
4949 2301 207
4973 0414 577
5025 0408 905
5036 2303 467
5044 2302 585
5080 0403 657
5090 2710 291

It is perhaps not surprising that the easiest chapters are from 1John and John's gospel (with Rev 10 coming it at number 10).

It will be interesting to see if we get similar results once we factor in some measure of syntactic complexity.

Incidentally, the most difficult chapter to read based on mean log lexeme frequency is 2 Peter 2 although 1 Timothy and Titus feature quite a bit in the most difficult ten chapters as well.

Updated Vocabulary Coverage Statistics

2015-10-26

In various mailing list posts, blog posts and talks, I've shown vocabulary coverage statistics. It's time to update the code to use more recent data and republish the results here.

The vocabulary coverage tables have a number of different parameters:

what are the items being learnt: lexemes or forms or something else?
what are the targets: verses or sentences or something else?
what ordering is being used: item frequency or something else?

and, of course, what text and lemmatization is being used.

Most of my published stats before were based on the UBS3 version of MorphGNT. Here I'm going to use the latest MorphGNT based on the SBLGNT (MorphGNT 6.06) and I'm going to explore not just verses but (in followup posts) clauses and sentences from the GBI Syntax Trees and paragraphs from the SBLGNT.

I also want to start incorporating the information from my morphological lexicon into the item/target modeling and ordering algorithms.

But first let's just update the basic stats.

Verses-Lexemes with Frequency Ordering

A target-item file for verses-lexemes can be achieved with:

awk '{print $1,$7}' sblgnt/*-morphgnt.txt

if we then feed that to vocab-coverage.py we get the following result:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00%
------------------------------------------------------------------
   100    99.91%    91.07%    24.36%     2.13%     0.64%     0.48%
   200    99.92%    96.83%    51.80%     9.75%     3.43%     2.54%
   500    99.97%    99.13%    82.23%    36.57%    17.81%    13.81%
  1000    99.99%    99.71%    93.60%    62.57%    37.28%    29.99%
  2000   100.00%    99.92%    98.41%    84.95%    65.38%    56.43%
  5000   100.00%   100.00%   100.00%    99.51%    96.44%    94.58%
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

What this table is saying is that if you learn, say, the 200 most frequent lexemes, you'll be able to read 95% of the lexemes in 3.43% of verses.

Verses-Forms with Frequency Ordering

A target-item file for verses-forms can be achieved with:

awk '{print $1,$6}' sblgnt/*-morphgnt.txt

if we then feed that to vocab-coverage.py but with 10000 added as an item count, we get the following result:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00%
------------------------------------------------------------------
   100    99.82%    57.63%     1.10%     0.04%     0.01%     0.01%
   200    99.86%    78.86%     6.51%     0.34%     0.05%     0.05%
   500    99.91%    92.85%    26.95%     2.23%     0.59%     0.52%
  1000    99.94%    96.95%    51.23%     7.75%     2.31%     1.74%
  2000    99.96%    98.65%    72.52%    21.74%     7.86%     5.80%
  5000    99.97%    99.74%    90.97%    52.13%    28.52%    21.61%
 10000   100.00%    99.94%    98.31%    78.28%    55.19%    45.28%
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

What this table is saying is that if you learn, say, the 500 most frequent forms, you'll be able to read 75% of the forms in 26.95% of verses.

Various talks, including those at BibleTech in 2010 and 2015 explain a ton of caveats around these numbers but I wanted to at least refresh them (and then code) with the latest data.

Blogging Every Day Between Now and SBL Annual Meeting

2015-10-25

It's exactly four weeks until I'm presenting at the SBL Annual Meeting in Atlanta. As I have a long backlog of posts I've wanted to do for a while, I thought I might try to blog every day between now and my talk on November 22nd.

As well as motivating me to finish up some posts and also get some other ideas down in writing, I also hope the blogging will get people more interested in what I'm going to be talking about at the SBL meeting and lay a foundation for some conversations I hope to have with people while there.

Speaking At The SBL Annual Meeting in Atlanta

2015-07-15

I've just finished up registration for the SBL Annual Meeting. Here's the paper I'll be presenting.

A Morphological Lexicon of New Testament Greek

Morphological analyses such as analytical lexicons have typically involved indicating lemma, part-of-speech, morphosyntactic and morphosemantic information (such as case, number, person, gender, tense, voice, mood and degree). Much progress has been made in recent years making analyses of this sort freely available in digital formats, but the kind of information they contain has not advanced significantly for decades. This paper will provide an overview of the work of the MorphGNT project to develop an electronic Morphological Lexicon of New Testament Greek that adds inflectional classes, roots and stems, stem formation and morphophonological processes, principal parts, and derivational morphology. Beyond serving as a database of linguistic information, the goal of the morphological lexicon is to provide an "executable grammar" so particular grammar points discussed in beginner grammars, intermediate grammars or advanced reference grammars can be tested against a corpus in a way that makes completely transparent where the "rules" are followed and where they fall down. This data also provides useful data for pedagogical tools such as intelligent tutoring systems that typically require better modeling of latent traits in order to determine what a student actually knows and what items best test that knowledge. All data is for the Morphological Lexicon of New Testament Greek is available under a Creative Commons license, and all code used for both the generation and verification of the morphological lexicon is open source.

Types of Disagreement in Syntactic Analyses

2015-07-13

As helpful as the GBI Syntax Trees are, I have disagreements with them. Randall and Andi are receptive to feedback but there are very different types of disagreement that can arise in syntactic analysis so I thought I'd start to note down what they are.

Somethings aren't disagreements, just corrections. Some are differences of interpretation of the Greek. Some are differences in overall approach.

Here's a first attempt at a more refined categorization of types. I'll call the the person/group who did the initial (published) analysis A1 and the person/group who has the change/disagreement A2.

I. correction—A1 actually agrees with A2 but simply made a mistake and can uncontroversially update their analysis accordingly
II. ambiguity—both A1 and A2's analysis is possible in the eyes of the other, but based on other factors, A1 and A2 disagree which analysis to go with. Perhaps this could further be refined into:
- IIA. cases where A1 and A2 each think their own analysis is the more likely; versus
- IIB. cases where A1 and A2 each their their own analysis is the only likely.
III. terminology/framework—A1 and A2 agree on structure and relationship up to a certain isomorphism but not in the specifics. This could be further split into:
- IIIA. cases where A1 and A2's analyses are structurally identical but just different in labels
- IIIB. cases where A1 and A2's analyses different in structure even though they are derivable from one another
IV. irreconcilable—A1 and A2 disagree on the way the language actually works and the analyses can't easily be mapped to one another.

I think many of my disagreements with the GBI Trees so far are of type IIIB which means it is likely possible for me to programmatically generate an alternative analysis with my preferred structure. Indeed, converting to a dependency analysis is a simple example of this but even different choices of head within the constituent structure (which is a major source of systemic disagreement) are easy to make.

The great thing about type III in general is that even if you disagree with A1, you can still use the analysis to explore the syntactic phenomenon you want (you just have to map your queries to their labels and their conventions).

I should also note that an important aspect to dealing with this is proper documentation of conventions followed.

With these thoughts down, I'm now interested in other work that has already been done in this area.

Converting the GBI Syntax Trees to a Dependency Analysis

2015-07-02

With one child on each branch identified as the head, a constituent analysis can be converted to a dependency analysis. Fortunately, the GBI syntax trees have an explicit indication of the head, so I went ahead and converted them to a dependency format.

Non-leaf nodes in the GBI syntax trees have a Head attribute which indicates the index of the child considered the head.

So the algorithm is fairly straightforward. For each leaf-node:

walk up the tree until you find a node whose Head attribute is NOT the index of the child we just came from
follow the Head attributes back down the tree until you hit another leaf-node
that second leaf-node is the head of the leaf-node you started on
the "type" of the dependency is the Cat of the second-to-last node you visited walking up in step 1.

The only catch is the source data this script uses omits a Head altogether in three types of cases. The original GBI analysis treated the Head as being "1" in these cases so I special case that in the code. I don't necessarily agree with the choice but it's easy to change (see below).

I've put the code in a gist: https://gist.github.com/jtauber/c02d0928811b7ed21c9a

The result (on the first part of John 3.16) is:

64003016001 Οὕτως 64003016003 ADV
64003016002 γὰρ 64003016003 conj
64003016003 ἠγάπησεν None CL
64003016004 ὁ 64003016005 det
64003016005 θεὸς 64003016003 S
64003016006 τὸν 64003016007 det
64003016007 κόσμον 64003016003 O
64003016008 ὥστε 64003016013 conj
64003016009 τὸν 64003016010 det
64003016010 υἱὸν 64003016013 O
64003016011 τὸν 64003016012 det
64003016012 μονογενῆ 64003016010 np
64003016013 ἔδωκεν, 64003016003 CL

The dependency relationship color highlighting experiment on this site shows a possible way of conveying this dependency information in a text (in this case, 2 John).

As mentioned, I don't necessarily always agree with the GBI choice of head, however, it's fairly straightfoward to alter the code to override the choice of head in certain contexts.

For example, if you consider the complementizer the head, you can just add code that takes Head="0" where Rule="that-VP" and so on. Similarly with prepositions, determiners, etc.

Finally note that it's not quite possible to reconstruct the original tree from the dependency data because the algorithm effectively eliminates information on some intermediate nodes. Some may consider this an advantage.

pyuca supports Python 2 again

2015-05-13

Thanks to Chris Beaven, Paul McLanahan and Michal Čihař, Python 2 support is back in pyuca 1.1.

There was a small amount of complaining about me dropping Python 2 support for the big release of pyuca last year.

I didn't have the time or motivation to bring it back, though.

Fortunately, other people did and thanks to Chris, Paul and Michael, pyuca 1.1 supports Python 2 and 3.

The repo is at https://github.com/jtauber/pyuca and you can get pyuca from PyPI with pip install pyuca.

My BibleTech 2015 Talk

2015-05-06

BibleTech talks were not recorded but I turned on my iPhone's Voice Memo recording and later sync'd the audio with my slides to make this video.

The abstract:

In an update on the ongoing work he has spoken about in previous Bible Tech conferences, James will talk about recent developments in open source learning software and the MorphGNT linguistic database, and how the two work together to provide tools for improving the learning of New Testament Greek.

Version 1.0 of pyuca released

2014-02-01

pyuca is my pure Python implementation of the Unicode Collation Algorithm (for sorting, amongst other things, Greek).

I've just released version 1.0 for Python 3.3 and above, and it passes 100% of the UCA conformances tests.

I implemented enough back in 2006 to be able to sort Ancient Greek and released it on PyPI in 2012.

Since then, with input from others, I've made various improvements but in October last year I decided to start testing against the comprehensive UCA conformance tests provided by the Unicode Consortium. The last couple of days I've had an intense sprint where I got 100% of the tests passing and also 100% code coverage.

I also made the decision to ditch Python 2 support as part of my encouragement to get people to move to Python 3.

The repo is available at https://github.com/jtauber/pyuca/ but you can most easily get pyuca with

pip install pyuca

and then use it as follows:

from pyuca import Collator
c = Collator("allkeys.txt")

sorted_words = sorted(words, key=c.sort_key)

UPDATE (2015-05-13): Python 2 support is back in 1.1

Rebasing MorphGNT off SBLGNT

2011-01-18

The last three months, I've been working on rebasing the MorphGNT database off the SBLGNT text rather than the UBS3.

While I have had permission to work with the CCAT database for over a decade, the fact the UBS3 text can be extracted from it has always been problematic. The existence of the SBLGNT solves the problem of having a critical text with clear licensing and so, in October 2010, I started the process of moving the MorphGNT analysis to the SBLGNT text.

This task is mostly done and the work-in-progress is available on GitHub at https://github.com/morphgnt/sblgnt.

It was a three step process, done one book at a time.

A Python script was used to do a first-pass alignment. The script allowed for differences in punctuation, accentuation, capitalization and movable-nu.
Any differences were then manually inspected and corrected. In 90% of cases it was a simple re-ordering of words but in the other 10%, a fresh analysis had to be made. These analyses were then checked against various sources such as BDAG, Perseus and the Lexham Reverse Interlinear.
Finally, I wrote another Python script that checked various heuristics

I'm in the process of making a batch of corrections based on the third step and then I'll formally release what will be called MorphGNT 6.0 (although possibly as a beta such as 6.0b1).

The next step (which I've started in parallel) will merge in the Robinson analysis and parse codes on the road to a completely new set of parse codes for MorphGNT 7.0.

originally published on morphgnt.org

Inline Replacement for John 2

2010-04-25

A post to the graded-reader mailing list from April 25, 2010.

This afternoon and evening, I updated and open sourced my code for doing inline replacement and did a rough literal translation John 2, marked up with the PROIEL clause (and in some cases phrase) boundaries.

I then just ran a next-best ordering based on forms only, with the targets that are PRED or multi-word SUB (adding the latter works quite well)

I've included the complete results below. The main outstanding issue is it doesn't yet properly handle discontinuous clauses (the parenthetical in 2.9) or clauses that span verses (2.9,2.10; 2.14,2.15,2.16; 2.24,2.25).

All the code (and my annotated translation) are available on github.

James

[343427] John 2.2
ὁ Ἰησοῦς and his disciples were invited to the wedding

[343464] John 2.4
ὁ Ἰησοῦς says to her , what (concern is that) to me and you , woman ? My hour is not yet come

[343517] John 2.7
ὁ Ἰησοῦς says to them : fill the water-jars with water and they filled them up to the top

[343607] John 2.11
This beginning of signs ὁ Ἰησοῦς did in Cana of Galilee and revealed his glory and his disciples believed in him

[343665] John 2.13
and near was the passover of the Jews and ὁ Ἰησοῦς went up to Jerusalem

[343841] John 2.22
so when he was raised from the dead , his disciples remembered that he was saying this and they believed the Scripture and the word which ὁ Ἰησοῦς said

[343430] John 2.2
ὁ Ἰησοῦς and οἱ μαθηταὶ αὐτοῦ were invited to the wedding

[343623] John 2.11
This beginning of signs ὁ Ἰησοῦς did in Cana of Galilee and revealed his glory and οἱ μαθηταὶ αὐτοῦ believed in him

[343642] John 2.12
After this , he and his mother and his brothers and οἱ μαθηταὶ αὐτοῦ went down into Capernaum and there they remained not many days

[343736] John 2.17
οἱ μαθηταὶ αὐτοῦ remembered that it has been written : the zeal for your house will devour me

[343825] John 2.22
so when he was raised from the dead , οἱ μαθηταὶ αὐτοῦ remembered that he was saying this and they believed the Scripture and the word which ὁ Ἰησοῦς said

[343753] John 2.18
so οἱ Ἰουδαῖοι answered and said to him : what sign are you showing us that you do these things ?

[343788] John 2.20
so οἱ Ἰουδαῖοι said : this temple was built in forty-six years and you will raise it in three days ?

[343549] John 2.9,2.10
as ὁ ἀρχιτρίκλινος tasted the water having become wine and didn't know from where it came ( but the servants who drew the water knew ) the head-steward calls the groom and says to him : all men first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343574] John 2.9,2.10
as ὁ ἀρχιτρίκλινος tasted the water having become wine and didn't know from where it came ( but the servants who drew the water knew ) ὁ ἀρχιτρίκλινος calls the groom and says to him : all men first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343514] John 2.7
λέγει αὐτοῖς ὁ Ἰησοῦς : fill the water-jars with water and they filled them up to the top

[343531] John 2.8
καὶ λέγει αὐτοῖς , draw now and carry it to the head-steward and they brought it

[343428] John 2.2
καὶ ὁ Ἰησοῦς καὶ οἱ μαθηταὶ αὐτοῦ were invited to the wedding

[343481] John 2.5
ἡ μήτηρ αὐτοῦ says to the servants : do whatever he tells you to

[343634] John 2.12
After this , he and ἡ μήτηρ αὐτοῦ and his brothers and οἱ μαθηταὶ αὐτοῦ went down into Capernaum and there they remained not many days

[343418] John 2.1
And on the third day , a wedding was happening in Cana of Galilee and ἡ μήτηρ τοῦ Ἰησοῦ was there

[343451] John 2.3
There was no wine because the wedding wine had been finished off . Then ἡ μήτηρ τοῦ Ἰησοῦ says to him : there is no wine

[343576] John 2.9,2.10
as ὁ ἀρχιτρίκλινος tasted the water having become wine and didn't know from where it came ( but the servants who drew the water knew ) ὁ ἀρχιτρίκλινος calls the groom λέγει αὐτῷ all men first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343755] John 2.18
so οἱ Ἰουδαῖοι answered and εἶπαν αὐτῷ : what sign are you showing us that you do these things ?

[343785] John 2.20
εἶπαν οὖν οἱ Ἰουδαῖοι : this temple was built in forty-six years and you will raise it in three days ?

[343750] John 2.18
ἀπεκρίθησαν οὖν οἱ Ἰουδαῖοι and εἶπαν αὐτῷ : what sign are you showing us that you do these things ?

[343754] John 2.18
ἀπεκρίθησαν οὖν οἱ Ἰουδαῖοι καὶ εἶπαν αὐτῷ : what sign are you showing us that you do these things ?

[343770] John 2.19
Jesus answered and εἶπεν αὐτοῖς : destroy this temple and in three days I will raise it

[343767] John 2.19
ἀπεκρίθη Ἰησοῦς and εἶπεν αὐτοῖς : destroy this temple and in three days I will raise it

[343769] John 2.19
ἀπεκρίθη Ἰησοῦς καὶ εἶπεν αὐτοῖς : destroy this temple and in three days I will raise it

[343872] John 2.24,2.25
αὐτὸς Ἰησοῦς did not entrust himself to them because he knows everyone and because he had no need that anyone should testify about man for he knew what was in man

[343638] John 2.12
After this , he and ἡ μήτηρ αὐτοῦ and οἱ ἀδελφοὶ αὐτοῦ and οἱ μαθηταὶ αὐτοῦ went down into Capernaum and there they remained not many days

[343632] John 2.12
After this , αὐτὸς καὶ ἡ μήτηρ αὐτοῦ καὶ οἱ ἀδελφοὶ αὐτοῦ καὶ οἱ μαθηταὶ αὐτοῦ went down into Capernaum and there they remained not many days

[343444] John 2.3
There was no wine because ὁ οἶνος τοῦ γάμου had been finished off . Then ἡ μήτηρ τοῦ Ἰησοῦ says to him : there is no wine

[343442] John 2.3
There was no wine because συνετελέσθη ὁ οἶνος τοῦ γάμου . Then ἡ μήτηρ τοῦ Ἰησοῦ says to him : there is no wine

[343461] John 2.4
λέγει αὐτῇ ὁ Ἰησοῦς , what (concern is that) to me and you , woman ? My hour is not yet come

[343765] John 2.18
ἀπεκρίθησαν οὖν οἱ Ἰουδαῖοι καὶ εἶπαν αὐτῷ : what sign are you showing us that ταῦτα ποιεῖς ?

[343459] John 2.3
There was no wine because συνετελέσθη ὁ οἶνος τοῦ γάμου . Then ἡ μήτηρ τοῦ Ἰησοῦ says to him : οἶνος οὐκ ἔστιν

[343740] John 2.17
οἱ μαθηταὶ αὐτοῦ remembered that γεγραμμένον ἐστίν : the zeal for your house will devour me

[343734] John 2.17
ἐμνήσθησαν οἱ μαθηταὶ αὐτοῦ ὅτι γεγραμμένον ἐστίν : the zeal for your house will devour me

[343476] John 2.4
λέγει αὐτῇ ὁ Ἰησοῦς , what (concern is that) to me and you , woman ? ἡ ὥρα μου is not yet come

[343543] John 2.8
καὶ λέγει αὐτοῖς , draw now and carry it to the head-steward οἱ δὲ ἤνεγκαν

[343829] John 2.22
so when he was raised from the dead , οἱ μαθηταὶ αὐτοῦ remembered that τοῦτο ἔλεγεν and they believed the Scripture and the word which ὁ Ἰησοῦς said

[343479] John 2.5
λέγει ἡ μήτηρ αὐτοῦ τοῖς διακόνοις : do whatever he tells you to

[343439] John 2.3
καὶ οἶνον οὐκ εἶχον ὅτι συνετελέσθη ὁ οἶνος τοῦ γάμου . Then ἡ μήτηρ τοῦ Ἰησοῦ says to him : οἶνος οὐκ ἔστιν

[343416] John 2.1
And on the third day , a wedding was happening in Cana of Galilee and ἦν ἡ μήτηρ τοῦ Ἰησοῦ ἐκεῖ

[343580] John 2.10
λέγει αὐτῷ πᾶς ἄνθρωπος first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343534] John 2.8
καὶ λέγει αὐτοῖς , ἀντλήσατε νῦν and carry it to the head-steward οἱ δὲ ἤνεγκαν

[343537] John 2.8
καὶ λέγει αὐτοῖς , ἀντλήσατε νῦν and φέρετε τῷ ἀρχιτρικλίνῳ οἱ δὲ ἤνεγκαν

[343536] John 2.8
καὶ λέγει αὐτοῖς , ἀντλήσατε νῦν καὶ φέρετε τῷ ἀρχιτρικλίνῳ οἱ δὲ ἤνεγκαν

[343619] John 2.11
This beginning of signs ὁ Ἰησοῦς did in Cana of Galilee and revealed his glory and ἐπίστευσαν εἰς αὐτὸν οἱ μαθηταὶ αὐτοῦ

[343557] John 2.9,2.10
as ὁ ἀρχιτρίκλινος tasted the water having become wine and οὐκ ᾔδει πόθεν ἐστίν ( but the servants who drew the water knew ) ὁ ἀρχιτρίκλινος calls the groom λέγει αὐτῷ πᾶς ἄνθρωπος first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343547] John 2.9,2.10
as ἐγεύσατο ὁ ἀρχιτρίκλινος τὸ ὕδωρ οἶνον γεγενημένον and οὐκ ᾔδει πόθεν ἐστίν ( but the servants who drew the water knew ) ὁ ἀρχιτρίκλινος calls the groom λέγει αὐτῷ πᾶς ἄνθρωπος first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343555] John 2.9,2.10
as ἐγεύσατο ὁ ἀρχιτρίκλινος τὸ ὕδωρ οἶνον γεγενημένον καὶ οὐκ ᾔδει πόθεν ἐστίν ( but the servants who drew the water knew ) ὁ ἀρχιτρίκλινος calls the groom λέγει αὐτῷ πᾶς ἄνθρωπος first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343796] John 2.20
εἶπαν οὖν οἱ Ἰουδαῖοι : ὁ ναὸς οὗτος was built in forty-six years and you will raise it in three days ?

[343661] John 2.13
and near was the passover of the Jews and ἀνέβη εἰς Ἱεροσόλυμα ὁ Ἰησοῦς

[343656] John 2.13
and near was τὸ πάσχα τῶν Ἰουδαίων and ἀνέβη εἰς Ἱεροσόλυμα ὁ Ἰησοῦς

[343654] John 2.13
Καὶ ἐγγὺς ἦν τὸ πάσχα τῶν Ἰουδαίων and ἀνέβη εἰς Ἱεροσόλυμα ὁ Ἰησοῦς

[343660] John 2.13
Καὶ ἐγγὺς ἦν τὸ πάσχα τῶν Ἰουδαίων καὶ ἀνέβη εἰς Ἱεροσόλυμα ὁ Ἰησοῦς

[343718] John 2.14,2.15,2.16
he found, sitting in the temple , the ones selling oxen and sheep and doves , and the coin-dealers and, having made a whip out of ropes , he threw out of the temple all the sheep and the oxen and he threw out the coins of the money-changers and he overturned the tables and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν take these things from here . don't make my father's house a market-place

[343711] John 2.14,2.15,2.16
he found, sitting in the temple , the ones selling oxen and sheep and doves , and the coin-dealers and, having made a whip out of ropes , he threw out of the temple all the sheep and the oxen and he threw out the coins of the money-changers and τὰς τραπέζας ἀνέστρεψεν and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν take these things from here . don't make my father's house a market-place

[343423] John 2.2
ἐκλήθη δὲ καὶ ὁ Ἰησοῦς καὶ οἱ μαθηταὶ αὐτοῦ εἰς τὸν γάμον

[343720] John 2.16
and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν ἄρατε ταῦτα ἐντεῦθεν . don't make my father's house a market-place

[343570] John 2.9,2.10
as ἐγεύσατο ὁ ἀρχιτρίκλινος τὸ ὕδωρ οἶνον γεγενημένον καὶ οὐκ ᾔδει πόθεν ἐστίν ( but the servants who drew the water knew ) ὁ ἀρχιτρίκλινος calls the groom λέγει αὐτῷ πᾶς ἄνθρωπος first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343575] John 2.9,2.10
as ἐγεύσατο ὁ ἀρχιτρίκλινος τὸ ὕδωρ οἶνον γεγενημένον καὶ οὐκ ᾔδει πόθεν ἐστίν ( but the servants who drew the water knew ) ὁ ἀρχιτρίκλινος calls the groom λέγει αὐτῷ πᾶς ἄνθρωπος first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343563] John 2.9,2.10
as ἐγεύσατο ὁ ἀρχιτρίκλινος τὸ ὕδωρ οἶνον γεγενημένον καὶ οὐκ ᾔδει πόθεν ἐστίν ( οἱ διάκονοι οἱ ἠντληκότες τὸ ὕδωρ knew ) ὁ ἀρχιτρίκλινος calls the groom λέγει αὐτῷ πᾶς ἄνθρωπος first put out the good wine and when they are drunk , the inferior . you have kept the good wine until now

[343474] John 2.4
λέγει αὐτῇ ὁ Ἰησοῦς , what (concern is that) to me and you , woman ? οὔπω ἥκει ἡ ὥρα μου

[398696] John 2.4
λέγει αὐτῇ ὁ Ἰησοῦς , τί ἐμοὶ καὶ σοί , woman ? οὔπω ἥκει ἡ ὥρα μου

[343819] John 2.22
so when ἠγέρθη ἐκ νεκρῶν , οἱ μαθηταὶ αὐτοῦ remembered that τοῦτο ἔλεγεν and they believed the Scripture and the word which ὁ Ἰησοῦς said

[343823] John 2.22
ὅτε οὖν ἠγέρθη ἐκ νεκρῶν ἐμνήσθησαν οἱ μαθηταὶ αὐτοῦ ὅτι τοῦτο ἔλεγεν and they believed the Scripture and the word which ὁ Ἰησοῦς said

[343845] John 2.23
when δὲ ἦν ἐν τοῖς Ἱεροσολύμοις ἐν τῷ πάσχα ἐν τῇ ἑορτῇ , many believed in his name , seeing his signs which he was doing

[343832] John 2.22
ὅτε οὖν ἠγέρθη ἐκ νεκρῶν ἐμνήσθησαν οἱ μαθηταὶ αὐτοῦ ὅτι τοῦτο ἔλεγεν and ἐπίστευσαν τῇ γραφῇ καὶ τῷ λόγῳ ὃν εἶπεν ὁ Ἰησοῦς

[343831] John 2.22
ὅτε οὖν ἠγέρθη ἐκ νεκρῶν ἐμνήσθησαν οἱ μαθηταὶ αὐτοῦ ὅτι τοῦτο ἔλεγεν καὶ ἐπίστευσαν τῇ γραφῇ καὶ τῷ λόγῳ ὃν εἶπεν ὁ Ἰησοῦς

[343449] John 2.3
καὶ οἶνον οὐκ εἶχον ὅτι συνετελέσθη ὁ οἶνος τοῦ γάμου . εἶτα λέγει ἡ μήτηρ τοῦ Ἰησοῦ πρὸς αὐτόν : οἶνος οὐκ ἔστιν

[343782] John 2.19
ἀπεκρίθη Ἰησοῦς καὶ εἶπεν αὐτοῖς : destroy this temple and ἐν τρισὶν ἡμέραις ἐγερῶ αὐτόν

[343804] John 2.20
εἶπαν οὖν οἱ Ἰουδαῖοι : ὁ ναὸς οὗτος was built in forty-six years and σὺ ἐν τρισὶν ἡμέραις ἐγερεῖς αὐτόν ?

[343773] John 2.19
ἀπεκρίθη Ἰησοῦς καὶ εἶπεν αὐτοῖς : λύσατε τὸν ναὸν τοῦτον and ἐν τρισὶν ἡμέραις ἐγερῶ αὐτόν

[343778] John 2.19
ἀπεκρίθη Ἰησοῦς καὶ εἶπεν αὐτοῖς : λύσατε τὸν ναὸν τοῦτον καὶ ἐν τρισὶν ἡμέραις ἐγερῶ αὐτόν

[343585] John 2.10
λέγει αὐτῷ πᾶς ἄνθρωπος πρῶτον τὸν καλὸν οἶνον τίθησιν and when they are drunk , the inferior . you have kept the good wine until now

[398697] John 2.10
λέγει αὐτῷ πᾶς ἄνθρωπος πρῶτον τὸν καλὸν οἶνον τίθησιν and ὅταν μεθυσθῶσιν τὸν ἐλάσσω . you have kept the good wine until now

[343587] John 2.10
λέγει αὐτῷ πᾶς ἄνθρωπος πρῶτον τὸν καλὸν οἶνον τίθησιν καὶ ὅταν μεθυσθῶσιν τὸν ἐλάσσω . you have kept the good wine until now

[343594] John 2.10
λέγει αὐτῷ πᾶς ἄνθρωπος πρῶτον τὸν καλὸν οἶνον τίθησιν καὶ ὅταν μεθυσθῶσιν τὸν ἐλάσσω . σὺ τετήρηκας τὸν καλὸν οἶνον ἕως ἄρτι

[343743] John 2.17
ἐμνήσθησαν οἱ μαθηταὶ αὐτοῦ ὅτι γεγραμμένον ἐστίν : ὁ ζῆλος τοῦ οἴκου σου will devour me

[343747] John 2.17
ἐμνήσθησαν οἱ μαθηταὶ αὐτοῦ ὅτι γεγραμμένον ἐστίν : ὁ ζῆλος τοῦ οἴκου σου καταφάγεταί με

[343628] John 2.12
Μετὰ τοῦτο κατέβη εἰς Καφαρναοὺμ αὐτὸς καὶ ἡ μήτηρ αὐτοῦ καὶ οἱ ἀδελφοὶ αὐτοῦ καὶ οἱ μαθηταὶ αὐτοῦ and there they remained not many days

[343647] John 2.12
Μετὰ τοῦτο κατέβη εἰς Καφαρναοὺμ αὐτὸς καὶ ἡ μήτηρ αὐτοῦ καὶ οἱ ἀδελφοὶ αὐτοῦ καὶ οἱ μαθηταὶ αὐτοῦ and ἐκεῖ ἔμειναν οὐ πολλὰς ἡμέρας

[343645] John 2.12
Μετὰ τοῦτο κατέβη εἰς Καφαρναοὺμ αὐτὸς καὶ ἡ μήτηρ αὐτοῦ καὶ οἱ ἀδελφοὶ αὐτοῦ καὶ οἱ μαθηταὶ αὐτοῦ καὶ ἐκεῖ ἔμειναν οὐ πολλὰς ἡμέρας

[343890] John 2.24,2.25
αὐτὸς Ἰησοῦς did not entrust himself to them because he knows everyone and because he had no need that τις μαρτυρήσῃ περὶ τοῦ ἀνθρώπου for he knew what was in man

[343887] John 2.24,2.25
αὐτὸς Ἰησοῦς did not entrust himself to them because he knows everyone and because οὐ χρείαν εἶχεν ἵνα τις μαρτυρήσῃ περὶ τοῦ ἀνθρώπου for he knew what was in man

[343794] John 2.20
εἶπαν οὖν οἱ Ἰουδαῖοι : τεσσεράκοντα καὶ ἓξ ἔτεσιν οἰκοδομήθη ὁ ναὸς οὗτος and σὺ ἐν τρισὶν ἡμέραις ἐγερεῖς αὐτόν ?

[343799] John 2.20
εἶπαν οὖν οἱ Ἰουδαῖοι : τεσσεράκοντα καὶ ἓξ ἔτεσιν οἰκοδομήθη ὁ ναὸς οὗτος καὶ σὺ ἐν τρισὶν ἡμέραις ἐγερεῖς αὐτόν ?

[343613] John 2.11
This beginning of signs ὁ Ἰησοῦς did in Cana of Galilee and ἐφανέρωσεν τὴν δόξαν αὐτοῦ and ἐπίστευσαν εἰς αὐτὸν οἱ μαθηταὶ αὐτοῦ

[343705] John 2.14,2.15,2.16
he found, sitting in the temple , the ones selling oxen and sheep and doves , and the coin-dealers and, having made a whip out of ropes , he threw out of the temple all the sheep and the oxen and τῶν κολλυβιστῶν ἐξέχεεν τὸ κέρμα and τὰς τραπέζας ἀνέστρεψεν and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν ἄρατε ταῦτα ἐντεῦθεν . don't make my father's house a market-place

[343809] John 2.21
ἐκεῖνος δὲ ἔλεγεν περὶ τοῦ ναοῦ τοῦ σώματος αὐτοῦ

[343897] John 2.25
and because οὐ χρείαν εἶχεν ἵνα τις μαρτυρήσῃ περὶ τοῦ ἀνθρώπου αὐτὸς γὰρ ἐγίνωσκεν τί ἦν ἐν τῷ ἀνθρώπῳ

[343760] John 2.18
ἀπεκρίθησαν οὖν οἱ Ἰουδαῖοι καὶ εἶπαν αὐτῷ : τί σημεῖον δεικνύεις ἡμῖν ὅτι ταῦτα ποιεῖς ?

[343519] John 2.7
λέγει αὐτοῖς ὁ Ἰησοῦς : γεμίσατε τὰς ὑδρίας ὕδατος and they filled them up to the top

[343525] John 2.7
λέγει αὐτοῖς ὁ Ἰησοῦς : γεμίσατε τὰς ὑδρίας ὕδατος καὶ ἐγέμισαν αὐτὰς ἕως ἄνω

[343874] John 2.24,2.25
αὐτὸς Ἰησοῦς did not entrust himself to them because he knows everyone and because οὐ χρείαν εἶχεν ἵνα τις μαρτυρήσῃ περὶ τοῦ ἀνθρώπου αὐτὸς γὰρ ἐγίνωσκεν τί ἦν ἐν τῷ ἀνθρώπῳ

[343409] John 2.1
Καὶ τῇ ἡμέρᾳ τῇ τρίτῃ γάμος ἐγένετο ἐν Κανὰ τῆς Γαλιλαίας and ἦν ἡ μήτηρ τοῦ Ἰησοῦ ἐκεῖ

[343415] John 2.1
Καὶ τῇ ἡμέρᾳ τῇ τρίτῃ γάμος ἐγένετο ἐν Κανὰ τῆς Γαλιλαίας καὶ ἦν ἡ μήτηρ τοῦ Ἰησοῦ ἐκεῖ

[343602] John 2.11
ταύτην ἐποίησεν ἀρχὴν τῶν σημείων ὁ Ἰησοῦς ἐν Κανὰ τῆς Γαλιλαίας and ἐφανέρωσεν τὴν δόξαν αὐτοῦ and ἐπίστευσαν εἰς αὐτὸν οἱ μαθηταὶ αὐτοῦ

[343612] John 2.11
ταύτην ἐποίησεν ἀρχὴν τῶν σημείων ὁ Ἰησοῦς ἐν Κανὰ τῆς Γαλιλαίας καὶ ἐφανέρωσεν τὴν δόξαν αὐτοῦ καὶ ἐπίστευσαν εἰς αὐτὸν οἱ μαθηταὶ αὐτοῦ

[343725] John 2.16
and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν ἄρατε ταῦτα ἐντεῦθεν . μὴ ποιεῖτε τὸν οἶκον τοῦ πατρός μου οἶκον ἐμπορίου

[343492] John 2.5
λέγει ἡ μήτηρ αὐτοῦ τοῖς διακόνοις : ὅ τι ἂν λέγῃ ὑμῖν ποιήσατε

[343668] John 2.14,2.15,2.16
καὶ εὗρεν ἐν τῷ ἱερῷ τοὺς πωλοῦντας βόας καὶ πρόβατα καὶ περιστερὰς καὶ τοὺς κερματιστὰς καθημένους and, having made a whip out of ropes , he threw out of the temple all the sheep and the oxen and τῶν κολλυβιστῶν ἐξέχεεν τὸ κέρμα and τὰς τραπέζας ἀνέστρεψεν and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν ἄρατε ταῦτα ἐντεῦθεν . μὴ ποιεῖτε τὸν οἶκον τοῦ πατρός μου οἶκον ἐμπορίου

[343690] John 2.14,2.15,2.16
καὶ εὗρεν ἐν τῷ ἱερῷ τοὺς πωλοῦντας βόας καὶ πρόβατα καὶ περιστερὰς καὶ τοὺς κερματιστὰς καθημένους and, ποιήσας φραγέλλιον ἐκ σχοινίων πάντας ἐξέβαλεν ἐκ τοῦ ἱεροῦ τά τε πρόβατα καὶ τοὺς βόας and τῶν κολλυβιστῶν ἐξέχεεν τὸ κέρμα and τὰς τραπέζας ἀνέστρεψεν and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν ἄρατε ταῦτα ἐντεῦθεν . μὴ ποιεῖτε τὸν οἶκον τοῦ πατρός μου οἶκον ἐμπορίου

[343684] John 2.14,2.15,2.16
καὶ εὗρεν ἐν τῷ ἱερῷ τοὺς πωλοῦντας βόας καὶ πρόβατα καὶ περιστερὰς καὶ τοὺς κερματιστὰς καθημένους and, ποιήσας φραγέλλιον ἐκ σχοινίων πάντας ἐξέβαλεν ἐκ τοῦ ἱεροῦ τά τε πρόβατα καὶ τοὺς βόας and τῶν κολλυβιστῶν ἐξέχεεν τὸ κέρμα and τὰς τραπέζας ἀνέστρεψεν and τοῖς τὰς περιστερὰς πωλοῦσιν εἶπεν ἄρατε ταῦτα ἐντεῦθεν . μὴ ποιεῖτε τὸν οἶκον τοῦ πατρός μου οἶκον ἐμπορίου

[343498] John 2.6
there were there, standing according to the purification (rites) of the Jews , λίθιναι ὑδρίαι ἓξ χωροῦσαι ἀνὰ μετρητὰς δύο ἢ τρεῖς

[343494] John 2.6
ἦσαν δὲ ἐκεῖ λίθιναι ὑδρίαι ἓξ κατὰ τὸν καθαρισμὸν τῶν Ἰουδαίων κείμεναι χωροῦσαι ἀνὰ μετρητὰς δύο ἢ τρεῖς

[343857] John 2.23
Ὡς δὲ ἦν ἐν τοῖς Ἱεροσολύμοις ἐν τῷ πάσχα ἐν τῇ ἑορτῇ πολλοὶ ἐπίστευσαν εἰς τὸ ὄνομα αὐτοῦ θεωροῦντες αὐτοῦ τὰ σημεῖα ἃ ἐποίει

All Subtrees Not Just Clauses

2010-04-14

A post to the graded-reader mailing list from April 14, 2010.

I just ran a quick experiment where I treated the targets to learn not just as the clauses but any subtree in the dependency tree that has more than one word.

This results in 8209 targets in John's gospel instead of 3206.

Obviously it means learning common noun phrases and prepositional phrases first.

In particular, these are the first things learnt when using the next-best algorithm:

ὁ Ἰησοῦς
ἐν αὐτῷ
τοῦ θεοῦ
ἐκ θεοῦ
λέγει αὐτῷ
λέγει αὐτῷ Ἰησοῦς
λέγει αὐτῷ ὁ Ἰησοῦς
καὶ λέγει αὐτῷ
εἰς αὐτόν
πρὸς αὐτόν
τὸν πατέρα
πρὸς τὸν πατέρα
τὸν πατέρα μου
καὶ τὸν πατέρα μου
ἐν αὐτοῖς
λέγει αὐτοῖς
καὶ λέγει αὐτοῖς
λέγει αὐτοῖς ὁ Ἰησοῦς
εἶπεν αὐτῷ
καὶ εἶπεν ὁ Ἰησοῦς

Compare this with the first things learnt when the targets are clauses only (i.e. only subtrees rooted on "pred"):

εἶπεν
εἶπεν αὐτῷ
ἀπεκρίθη Ἰησοῦς
ἀπεκρίθη αὐτῷ Ἰησοῦς
ἀπεκρίθη Ἰησοῦς αὐτῷ
λέγει
λέγει αὐτῷ
λέγει αὐτῷ Ἰησοῦς
λέγει αὐτῷ ὁ Ἰησοῦς
εἶπεν αὐτῷ ὁ Ἰησοῦς
ἀπεκρίθη ὁ Ἰησοῦς
λέγει αὐτοῖς
λέγει αὐτοῖς Ἰησοῦς
λέγει αὐτοῖς ὁ Ἰησοῦς
εἶπεν αὐτοῖς
ἀπεκρίθη αὐτοῖς
ἀπεκρίθη αὐτοῖς Ἰησοῦς
ἀπεκρίθη αὐτοῖς ὁ Ἰησοῦς
εἶπεν αὐτοῖς Ἰησοῦς
εἶπεν αὐτοῖς ὁ Ἰησοῦς

(note, these are just based on surface form in text with no reference to any other linguistic information)

While it's kind of nice seeing the noun phrases emerge in the first list, I worry about learning prepositional phrases in isolation from their verb. Thoughts? Of course, when combined with inline replacement into English, the verb will be shown, albeit in English.

I also realise now, the former list should include one-word subtrees if the word is a "pred".

James

Initial Code Based on PROIEL Dependency Analysis

2010-04-12

A post to the graded-reader mailing list from April 12, 2010.

Until this weekend, all the GNT graded reader work I'd done has used clause boundaries from OpenText.org.

With the availability of the PROIEL dependency tree analysis, I thought I'd give that a go.

I've uploaded to github code for extracting the clauses in John's Gospel and generating a very basic reading programme from that.

Clauses were extracted by looking at any 'pred' arc and linearizing all nodes from that point down. If there were embedded preds then clauses corresponding to both inner and outer preds were generated.

Note that the current code is just based on forms with use made of syntactic or morphological information. I also can't do inline replacement into an English context because I don't have an English text mapped to the PROIEL analysis.

However, my initial impression is that the PROIEL analysis will be preferable to work with moving forward.

James

Then Patrick Narkinsky asked:

Could you clarify in what ways you see the PROIEL data being superior to the opentext data? One obvious one that leaps to mind is that OpenText seems to be a dead project...

It's actively maintained, is redistributable under a CC license, is based on a freely redistributable text and is a less idiosyncratic analysis.

Admittedly, I haven't spent THAT much time with it but it seems that it will be easier to extract the kind of syntactic information I'm interested in from it.

James

My BibleTech 2010 Talk

2010-03-28

Yesterday I gave a talk on the graded reader ideas at BibleTech.

Here is a video of my talk.

The abstract:

We will discuss a new approach to language learning based on texts, with a special focus on learning Greek from the New Testament.

We will be covering how various linguistic analyses of a text such as the Greek New Testament can help determine the order in which vocabulary and grammar is introduced and how each new word or grammatical concept can be shown in the context of the text.

Lastly, we will also discuss various algorithms that have been implemented as well as open source Python code for producing this new kind of graded reader.

The “Next-Best” Algorithm

2008-04-01

A post to the graded-reader mailing list from April 1, 2008.

In the last few posts, I've mentioned a simple algorithm I've used (one of a number) for ordering items.

The Input

This algorithm, like all the ordering algorithms I've tried takes as an input, a list of target-item pairs. For example,

T1 I1
T1 I3
T1 I7
T2 I2
T2 I7
T3 I4
...

means that to read T1, you need to know I1, I3, I7; to read T2, you need to know I2, I7 and so on.

The targets and items can be anything. For the various stats I've posted here I've used verses for the targets and either lemmas or inflected forms for the items. In the sample reader online, I use clauses as the targets and a combination of lemmas, inflected forms and a little bit of morphology (not much yet). If you want to model the fact that students can't read a target until they've learnt some syntactic point or even some cultural point, that can be modeled by including an appropriate item for this.

I make this point to emphasize that the ordering algorithm is independent of what we chose as targets and what items we include as prerequisites to being able to comprehend those targets.

The Output

What this (and my other algorithms) output is what I sometimes in comments and elsewhere refer to as a "learning programme". (yes, I tend to use that spelling when referring to any ordered list to be followed that isn't a computer program)

Such a programme looks like this:

learn I2
learn I5
learn I7
know T2
learn I1
learn I3
know T1

Note that this algorithm will sometimes (as it does in the example above) prematurely mention an item that could be delayed (in this case I5) so the optimize-order code I've mentioned previously and uploaded to Google Code is useful as a post-processing step.

The Algorithm Itself

The algorithm is very simple and follows an iterative process. At each step, each item not yet learnt is assigned a score. The item with the highest score is then learnt and the process repeats (with the scores being recalculated each time on the remaining items).

The score favours items that are the only remaining unlearnt item (or one of only a few remaining unlearnt items) in a lot of different targets.

At each step, each unlearnt item receives, for each target the item is a prerequisite for, an additional score of 1 / 2^num_unlearnt_items_in_target.

In other words, for each target the item is the only unlearnt item in, the score goes up by 1/2, for each target the item is one of two unlearnt items in, the score goes up by 1/4, for each target the item is one of three unlearnt items in, the score goes up by 1/8 and so on.

I haven't done much experimentation to see if this exponential decay is optimal but it seems to give good results.

Because this algorithm is iterative and picks a single item at each step rather than exploring multiple ordering possibilities, I'm tentatively calling this algorithm the "next-best" algorithm.

I've checked in the code as http://code.google.com/p/graded-reader/source/browse/trunk/code/next-best.py

It is important to note that this algorithm currently considers all items equally easy (or difficult!) to learn and assumes they are independent. However, it would be relatively easy to augment the algorithm with difficulty weightings and I plan to do that soon.

Another feature that I'm considering is being able to "pin down" certain items as not being available until a particular point. You may, for example, want to delay the introduction of participles but otherwise have the algorithm come up with its own ordering.

James

Vocab Coverage Table for a Better Ordering

2008-03-29T10:43:00

A post to the graded-reader mailing list from March 29, 2008.

I thought I'd calculate the vocabulary coverage table assuming the ordering generated for the post "just how much can frequency ordering be improved on?". To do this, I modified vocab-coverage.py to load in an arbitrary learning programme instead of assuming a frequency ordering. The code is now checked in as vocab-coverage-arbitrary.py.

Here's the original frequency ordering of forms in the Greek NT (using counts rather than percentages in the cells):

                   0%     50%     75%     90%     95%     100%

          100    7928    4585      88       1       0        0
          200    7931    6291     515      26       4        4
          500    7935    7388    2149     182      46       39
         1000    7937    7700    4085     631     184      141
         2000    7938    7838    5765    1736     628      456
         5000    7939    7920    7232    4161    2275     1711
         8000    7939    7935    7684    5691    3784     3004
        12000    7941    7939    7879    6858    5149     4310
        16000    7941    7941    7937    7777    7060     6549
        20000    7941    7941    7941    7941    7941     7941

And here's the table with the ordered produced in the "just how much
can frequency ordering be improved on?" post:

                   0%     50%     75%     90%       95%      100%

          100    7896    1762      78     *37*      *36*      *36*
          200    7927    4590     339     *81*      *71*      *70*
          500    7933    6781    1572    *315*     *225*     *213*
         1000    7935    7455    3155    *802*     *526*     *491*
         2000    7936    7739    4872   *1820*    *1242*    *1144*
         5000    7939    7869    6400    3592     *3246*    *3244*
         8000    7939    7908    7156    5071     *4745*    *4742*
        12000    7939    7924    7501    6501     *6463*    *6463*
        16000    7940    7933    7791    7646     *7645*    *7645*
        20000    7941    7941    7941    7941      7941      7941

I've marked with asterisks those instances where the number is better than the frequency ordering.

Note that because the ordering algorithm was highly biased towards reading entire verses, it is actually worse for coverage 75th or below. Even for 90% it's only better for the first 2000 items.

But for the 100% familiarity level, you can see just how much better even the simple algorithm I used (which I will explain shortly) is than frequency ordering. For 200 forms, you get 70 verses instead of 4!

I'll repeat the caveats I mentioned in the other post, though: items are considered independent and equally easy to learn, there's no consideration of morphology, syntax, idiom and this is using verses as targets. We'll fix all that over time.

James

Ordering is Ultimately of Targets not Items

2008-03-29T10:42:00

A post to the graded-reader mailing list from March 29, 2008.

[this is based on a blog post from August 2005 but with the terminology changed]

Say you have written a program which lists an order in which to learn items along with an indication, every so often, of what new target has been reached. Running on the Greek lexemes of 1John, you might get something starting like this:

learn μαρτυρέω
learn θεός
learn ἐν
learn εἰμί
learn ὁ
learn τρεῖς
learn ὅτι
know 230507

This gives seven items to learn and then a target that has been reached (230507 = 1John 5.7). The problem is that two of those items are unnecessary. You only need to learn μαρτυρέω, εἰμί, ὁ, τρεῖς and ὅτι to be able to read 1John 5.7.

The problem is that the program is ordering items first and only then establishing at each point what goals (if any) have been achieved.

What you really want to do is not display an item until it is needed. So back in 2005, I wrote some code that optimizes the ordering of items by delaying any that are not yet needed.

I've now made that code more generic and will check it in shortly.

It can be used as a post-processor on ordering from any source, even a manually crafted list of items. It will optimize the ordering of items for the same ordering of targets.

Because the algorithm for doing such an optimization is nearly identical to what's necessary to calculate the "area under the curve" that I described in my video (and will write more about soon) my new code also outputs a score.

I'll be checking it in shortly.

James

It's available at:

http://code.google.com/p/graded-reader/source/browse/trunk/code/optimize-order.py

James

Just How Much Can Frequency Ordering Be improved On?

2008-03-26T10:40:00

A post to the graded-reader mailing list from March 26, 2008.

Here's a quick demonstration. Recall that in my previous post, I pointed out that learning the top 100 inflected forms gives you 0 (zero, nada) target versus in the GNT. I showed that, for example target 130528 (1 Thessalonians 5.28) gets excluded because of one form that is #235 while the other eight forms appear in the top 66.

Well, what if those 9 forms were learnt first? That is:

Χριστοῦ, κυρίου, Ἰησοῦ, ὑμῶν, μετά, τοῦ, χάρις, ἡ, ἡμῶν

Not only could 130528 be read but also 071623

Now if the reader learnt πάντων (just one more form) they could read three more verses: 140318, 191325 and 272221

Now introduce these six forms:

καί, ὑμῖν, ἀπό, εἰρήνη, πατρός, θεοῦ

and suddenly seven more verses are readable: 140102, 070103, 100102, 110102, 090103, 180103, 080102

This was just with one algorithm I'm experimenting with (which I'll explain and provide code for soon) and there are likely others than do better.

So instead of 100 forms giving 0 verses, we now have just 16 forms giving us 12 entire verses from an actual corpus.

The usual caveats apply: items are considered independent and equally easy to learn, there's no consideration of morphology, syntax, idiom
and this is using verses as targets. We'll fix all that over time.

James

If Only They Knew That One Rare Word...

2008-03-26T10:41:00

A post to the graded-reader mailing list from March 26, 2008.

I'm going to talk in more detail about alternatives to frequency order in a different thread but I wanted to share the results of a quite striking little test I did.

In my last post, I show the vocab/coverage table applied to fully inflected forms in the Greek NT rather than lexemes. You may have noticed that the 100% coverage column and even the 95% coverage column said 0.0% verses for the 100 most frequent forms.

If you did, you might then have wondered: is this just a rounding error? The answer is no. Even if you knew the 100 most frequent inflected forms in the GNT, there is not a single verse you would know all the forms in (of course assuming you couldn't guess).

I wanted to test if this was because of just one outlier. So I modified (added 4 extra lines) the code that produced the table to instead output a list of the top ten targets (i.e. verses) whose second least frequent item (i.e. form) is most frequent overall.

Here are the results:

032030      2         [1, 2, 1077]
030146     35         [1, 35, 524]
041135     46         [2, 46, 14597]
130528     66         [5, 19, 38, 45, 49, 59, 65, 66, 235]
071623     66         [5, 19, 38, 45, 59, 66, 235]
070323     68         [3, 3, 29, 65, 68, 131]
020940     72         [8, 18, 22, 22, 44, 49, 49, 72, 102]
012425     78         [36, 78, 2846]
060211     96         [8, 14, 18, 22, 79, 96, 4276]
130519     98         [7, 17, 98, 14731]

What this listing is showing is that, for example, target 032030 (Luke 20.30) consists of the 1st, 2nd and 1077th most frequent forms; target 030146 (Luke 1.46) consists of the 1st, 35th and 524th most frequent forms. So if the rarest word wasn't needed, they would jump from needing the top 1077 forms to just the top 2 and from needing the top 524 forms to the top 35.

Now you may argue that many of these are bad examples because the verse doesn't make sense in isolation (a good reason to be more careful about what to use as targets) or that the one rare word is actually the one carrying most of the semantic weight.

But this little test demonstrates that sometimes a single rare item can massively delay reading an otherwise quite readable target unit.

By the way, here's the same listing based on lexemes rather than fully inflected forms:

032030      2           [1, 2, 346]
030146      9           [2, 9, 509]
011615      9           [3, 4, 5, 7, 8, 9, 9, 33]
032448     13           [4, 13, 415]
090124     14           [1, 2, 6, 7, 14, 267]
021337     16           [4, 5, 9, 9, 12, 16, 588]
040620     17           [1, 3, 5, 7, 8, 9, 17, 180]
041135     19           [1, 19, 4752]
040426     19           [1, 1, 3, 4, 7, 8, 9, 19, 56]
031934     24           [1, 1, 3, 5, 9, 15, 23, 24, 311]

I'll check in the code that produces this shortly.

James

It's now available at

http://code.google.com/p/graded-reader/source/browse/trunk/code/if-only.py

James

GNT Verse Coverage with Frequency Ordering

2008-03-25

A post to the graded-reader mailing list from March 25, 2008.

[if you'll indulge me, I'm trying to get all my thoughts and previous writing on these topics in one place and this list is a good place to do it]

[this is based on a post to b-greek[1] and my blog[2]. I hope the table comes out! ]

It is fairly common, in the context of learning vocabulary for a particular corpus like the Greek New Testament, to talk about what proportion of the text one could read if one learnt the top N words. I even produced such a table for the GNT back in 1996—see New Testament Vocabulary Count Statistics[3].

But these sort of numbers are highly misleading because they don't tell you what proportion of sentences (or as a rough proxy in the GNT case: verses) you could read, only what proportion of words.

Reading theorists have suggested that you need to know 95% of the vocabulary of a sentence to comprehend it. So a more interesting list of statistics would be how many verses can one understand 95% of the vocab of if one know a certain number of words. Of course, there's a lot more to reading comprehension than knowing the vocab. But it was enough for me to decide to write some code yesterday afternoon to run against my MorphGNT database.

To first of all give you a flavour in the specific before moving to the final numbers, consider John 3.16, which is, from a vocabulary point of view, a very easy verse to read.

To be able to read 50% of it, you only need to know the top 28 lexemes in the GNT. To read 75% you only need the top 85 (up to κόσμος). With the top 204 lexemes, you can read 90% of the verse and only a few more: up to 236 (αἰώνιος) gives you the 95%. The only word you would not have come across learning the top 236 words would be μονογενής but even that is in the top 1,200.

This example does highlight some of the shortcomings of this sort of analysis. There's no consideration of necessary knowledge of morphology, syntax, idioms, etc. Nor for the fact that the meaning of something like μονογενής is fairly easy to guess from knowledge of more common words. But I still think it's much more useful than the pure word coverage statistics I linked to above.

So let's actually run the numbers on the complete GNT. If you know the top N words, how many verses could you understand 50% of, 75%, 90% or 95% of...

vocab / coverage    any      50%        75%      90%      95%     100%

100                 99.9%    91.3%    24.4%     2.1%     0.6%     0.4%
200                 99.9%    96.9%    51.8%     9.8%     3.4%     2.5%
500                 99.9%    99.1%    82.3%    36.5%    18.0%    13.9%
1,000              100.0%    99.7%    93.6%    62.3%    37.3%    30.1%
1,500              100.0%    99.8%    97.2%    76.3%    53.5%    44.8%
2,000              100.0%    99.9%    98.4%    85.1%    65.5%    56.5%
3,000              100.0%   100.0%    99.4%    93.6%    81.0%    74.1%
4,000              100.0%   100.0%    99.7%    97.4%    90.0%    85.5%
5,000              100.0%   100.0%   100.0%    99.4%    96.5%    94.5%
all                100.0%   100.0%   100.0%   100.0%   100.0%   100.0%

What this means is purely from a vocabulary point of view if you knew the top 1000 lexemes, then 37.3% of verses in the GNT would be 95% familiar to you.

Note that this uses:

verses as the reading target
lexemes as the individual items to be learnt
frequency of lexemes as the ordering

It is possible to alter any of these variables and in subsequent posts I will do this.

James

[1] http://lists.ibiblio.org/pipermail/b-greek/2007-November/044685.html
[2] http://jtauber.com/blog/2007/11/04/gnt_verse_coverage_statistics/
[3] (via Internet Archive's Wayback Machine) http://web.archive.org/web/19961104033056/www.entmp.org/HGrk/grammar/lexicon/NTcount.shtml

I've checked in my Python code as: http://code.google.com/p/graded-reader/source/browse/trunk/code/vocab-coverage.py

If you're not comfortable running it yourself, I can run it on any data you provide.

(if you send data, I suggest you do it off-list and be careful because a "reply" will go to the entire mailing list)

Remember that, as I said in my post, there's no consideration of necessary knowledge of morphology, syntax, idioms, etc. Over time, we can incorporate that, but for now the results are limited to the somewhat naïve assumptions that:

comprehension is only at the level of the target (the verse in my example data)
learning the items (lexemes in the example table I gave) is all that matters to comprehending the target
all items are equally easy to learn
there is no dependency between items

and, of course, the table assumes a frequency ordering of items. Soon I'll be starting a separate thread on alternative orderings.

But all that said, the numbers produced are far more useful than misleading notions like "the top 10 words account for 37% of the text".

Incidentally, here is the table when applied to forms in the Greek NT rather than lexemes:

                0%       50%       75%       90%       95%      100%

   100       99.8%     57.7%      1.1%      0.0%      0.0%      0.0%
   200       99.8%     79.2%      6.4%      0.3%      0.0%      0.0%
   500       99.9%     93.0%     27.0%      2.2%      0.5%      0.4%
 1,000       99.9%     96.9%     51.4%      7.9%      2.3%      1.7%
 2,000       99.9%     98.7%     72.5%     21.8%      7.9%      5.7%
 5,000       99.9%     99.7%     91.0%     52.3%     28.6%     21.5%
 8,000       99.9%     99.9%     96.7%     71.6%     47.6%     37.8%
12,000      100.0%     99.9%     99.2%     86.3%     64.8%     54.2%
16,000      100.0%    100.0%     99.9%     97.9%     88.9%     82.4%
20,000      100.0%    100.0%    100.0%    100.0%    100.0%    100.0%

The fact that it takes 1,000 forms just to get 2.3% of verses at 95% coverage is indicative of the fact that frequency alone is not the way
to go. Soon, I'll also produce similar tables using clauses (in the OpenText.org sense), rather than verses, as the target.

James

Welcome (and some files)

2008-03-23T10:36:00

A post to the graded-reader mailing list from March 23, 2008.

Welcome to the graded-reader mailing list.

I've been getting a lot of queries in response to my presentation so I thought I'd start a mailing list so we can all discuss questions and issues together.

I also plan to make available the code that I'm using to produce the graded reader. Because it's closely tied to the particular text and linguistic data I'm currently dealing with, it will take some time to make generic but I plan to release stuff incrementally based on your feedback.

I want to spend some time going through my current approach and explaining the different components and the ideas behind them. For the most part, these ideas can be used independently of one another so if you don't like one aspect of what I've done, you can still make use of other aspects. Also I'm still improving things in lots of different ways and, of course, I look forward to a lot of new ideas coming from this list.

Because the video presentation actually doesn't show much in terms of results, I've uploaded two files that will give you a flavour of the current state of my work.

You can get to these files at http://groups.google.com/group/graded-reader

example-reader.html shows the first 50 word forms output by the current version of my software when run on the Greek text of John's gospel.

greek_2.pdf shows lesson 2 of an informal course I'm running for a couple of friends which uses the graded reader approach.

You'll notice (1) there is a lot of extra information in the lesson given to students; (2) the order in which words are presented is different.

There are three reasons for the difference in order:

the ordering in lesson 2 was hand tweaked from what the software originally produced
the lesson 2 ordering was produced by an earlier version of the ordering algorithm that what was used for example-reader.html
example-reader.html used slightly more linguistic information (in particular, it knew about some verb endings) in the generation of ordering

Note that the goal is to eventually not do any tweaking, but rather to capture in both the software and input data the criteria that motivated the manual reordering in the first place.

I'll send separate posts discussing different aspects of what goes in to producing the automated output.

James

Throttle and Delay

2008-03-23T10:38:00

A post to the graded-reader mailing list from March 23, 2008.

When you look at example-reader.html[1] you see that as well as the normal verse pairs, there are pairs marked REVIEW.

This is another idea I'm experimenting with that is independent of other ordering and display choices.

Basically, when a particular clause such as καὶ εἶπεν is introduced, I never repeat more than 3 instances of it. Instead I store up any additional instances to show later as reminders.

This "throttle-and-delay" technique is a separate part of the overall pipeline that produces the text.

The ordering algorithm, before the throttle-and-delay produces something like this:

NT.John.18_c108
NT.John.20_c122
NT.John.11_c131
NT.John.9_c174
NT.John.3_c117
NT.John.12_c121
NT.John.12_c178
NT.John.6_c97
NT.John.7_c53
NT.John.13_c95
NT.John.11_c161
NT.John.21_c114
NT.John.3_c50
NT.John.9_c25
NT.John.3_c12
NT.John.4_c71
NT.John.13_c27
NT.John.1_c206
NT.John.3_c46
NT.John.3_c4

and then the penultimate step is taking this and turning it in to the following. I'll explain later what the various parts of the "learn" lines are (I'm adding to them all the time), but for now the thing to note is that know_S means "show this new clause they now know", know_A means "they know this clause at this point but don't show it yet" and know_R means "show the previously introduced clause that was delayed due to throttling"

learn καί|καί|C-|---|-----|-
learn εἶπε(ν)|λέγω|V-|AAI|3-S--|-ε(ν):sa3S
know_S NT.John.3_c117
know_S NT.John.6_c97
know_S NT.John.7_c53
know_A NT.John.9_c174
know_A NT.John.11_c131
know_A NT.John.11_c161
know_A NT.John.12_c121
know_A NT.John.12_c178
know_A NT.John.13_c95
know_A NT.John.18_c108
know_A NT.John.20_c122
know_A NT.John.21_c114
learn αὐτῷ|αὐτός|RP|---|-DSM-|-
know_S NT.John.1_c198
know_S NT.John.1_c206
know_S NT.John.3_c4
know_A NT.John.3_c12
know_A NT.John.3_c46
know_A NT.John.3_c50
know_A NT.John.4_c71
know_A NT.John.5_c54
know_A NT.John.9_c25
know_A NT.John.13_c27
know_A NT.John.14_c101
know_A NT.John.18_c142
know_A NT.John.20_c132
learn αὐτοῖς|αὐτός|RP|---|-DPM-|-
know_S NT.John.2_c64
know_S NT.John.6_c113
know_S NT.John.6_c174
know_A NT.John.7_c78
know_A NT.John.8_c24
know_A NT.John.8_c54
know_A NT.John.9_c147
know_A NT.John.13_c54
know_A NT.John.16_c78
know_R NT.John.4_c71
know_R NT.John.3_c46
know_R NT.John.5_c54

This is actually the input to the final stage that produces the HTML.

James

[1] linked from http://groups.google.com/group/graded-reader/files

Embedding the Target Language in English

2008-03-23T10:37:00

A post to the graded-reader mailing list from March 23, 2008.

[this will be a bit of an experiment as to whether I can cut and paste formatted Greek and have it pass through Google Groups. I apologize in advance if it doesn't work]

One aspect of the reader that seems to have received a lot of interest is the embedding of the target language (in my case Greek) in English.

It is important to note that this is entirely independent of the 95% of the code and data which has to do which choosing the order in which to learn things.

I wanted to explain a little about how it's produced and what the variables are that could be tweaked or changed all together.

First of all, consider the very first block of text introduced:

John 3.26:
So they came to John and said to him, “Rabbi, the one who was with you on the other side of the Jordan River, about whom you testified – see, he is baptizing, and everyone is flocking to him!”
John 3.27:
John replied καὶ εἶπεν, “No one can receive anything unless it has been given to him from heaven.

For those of you who don't know Greek, καὶ εἶπεν means "and (he) said".

This was generated because the ordering component of the software said that the first thing to be introduced is clause NT.John.3_c117. That's a clause reference from OpenText.org's clause analysis of the New Testament. Part of my database is a listing of all the clauses, as identified by OpenText.org along with this unique identifier and what chapter/verse the clause comes from:

NT.John.3_c117|3.27|καὶ εἶπεν,

So my code knows that the clause to show is from John 3.27. I decided to always include the previous verse for context as well. So I retrieve John 3.26 and John 3.27 from a database containing the NET translation but annotated with the OpenText.org clause boundaries:

3.26 [c108 So they came to John ] [c109 and said to him, ] “Rabbi, the one who was with you on the other side of the Jordan River, [c112 about whom you testified – ] [c113 see, ] [c114 he is baptizing, ] [c115 and everyone is flocking to him!” ]
3.27 [c116 John replied ] [c117 and said, ] “No one can receive anything unless it has been given to him from heaven.

Notice that I haven't annotated everything yet. It's a slow and laborious process so I tend to just mark clauses as they are needed.

In some cases, I slightly alter the NET translation so there is something to annotate. This becomes challenging when NET has altered clause order and even more so when the Greek breaks apart words from the one clause that have to be together in the English. I still want to do more work in this area as the key thing to note is I never use the actual translation of the clause when introducing the clause; rather I use everything except the translation of the clause and that might make the problem easier if thought about in those terms (rather than what my annotation above focuses on which is annotating what English text corresponds to what Greek clause).

But this annotated NET is used to then produce what you see in the example-reader.html extra shown at the start. If other clauses were known at this point, they would be replaced by the Greek as well. Any clauses already known are show at normal weight and the new clause being introduced is shown in bold. Hence later on in example-reader.html (at step 13.)

John 4.49:
The official said to him, “Sir, come down before my child dies.”
John 4.50:
λέγει αὐτῷ ὁ Ἰησοῦς, “Go home; your son will live.” The man believed the word ὃν εἶπεν αὐτῷ ὁ Ἰησοῦς and set off for home.

So, to summarize: the input to this part of the process is:

what clause to introduce (by reference number)
what verse this clause is in
what other clauses are already known (by reference numbers)
what the English text of the verse (from 2) and the one before are, annotated by clause references that can be replaced by Greek if known

The variables to this particular step are:

the unit of text being introduced (in this example, a clause)
the unit of text to show (in this example, the verse containing the clause and the verse before it)

There is no reason why the unit of text being introduced in Greek could not be smaller (a phrase or even a word) and the unit of text being shown in English larger (a paragraph, for example).

Note that the clauses I am currently dealing with included embedded clauses such as relative clauses and so in the John 4.50 example, we have the relative clause ὃν εἶπεν αὐτῷ ὁ Ἰησοῦς ("that Jesus said to him") even though it might have been better to wait until the containing noun phrase were readable (which would, of course, have required knowledge of phrase boundaries)

James

Graded Reader Discussion and Code

2008-03-22

Owing to the amount of interest I received about A New Kind of Graded Reader...

I have started a mailing list at

http://groups.google.com/group/graded-reader

and also I plan to make my code available at

http://code.google.com/p/graded-reader/

If you're interested in the idea applied to any language (not just NT Greek) please join us.

UPDATE: The code has moved to GitHub: https://github.com/jtauber/graded-reader

originally published on jtauber.com

A New Kind of Graded Reader

2008-02-10

Back in 2004, I talked about algorithms for optimal vocabulary ordering.

Then in 2006, I talked about using this and other techniques in teaching New Testament Greek (which I've resumed doing with this method, btw).

Earlier this year at BibleTech:2008 I briefly touched on my graded reader approach. It generated a lot of interest so I decided to record a separate presentation at home this weekend, explaining some of the ideas behind the graded reader.

After multiple failed attempts to upload it to Google Video, it's now on YouTube and embedded below. Sound was recorded and mixed in Logic Pro and then synchronized with a presentation in Keynote and output as Quicktime.

Running time is just shy of 9 minutes.

UPDATE 2008-03-22: Now see Graded Reader Discussion and Code

originally published on jtauber.com

BibleTech 2008

2008-01-14

I don't think I've mentioned it here before but next week, I'm one of the keynote speakers at the BibleTech 2008 conference in Seattle.

While I've given talks a number of times about my Greek linguistics research, this will be the first time that I'll get to talk about how I've used technology in that research.

I plan to give a history of the [MorphGNT] project and the various sub-projects I've worked on over the last fifteen years, covering the evolution of data models, text encoding, tool sets and more. I then want to talk about the opportunities that lie ahead and where I hope the work will go in the future, particularly given my collaboration with Ulrik Sandborg-Petersen.

originally published on jtauber.com

GNT Verse Coverage Statistics

2007-11-04

I even produced such a table for the GNT back in 1996—see New Testament Vocabulary Count Statistics (via Internet Archive's Wayback Machine).

But these sort of numbers are highly misleading because they don't tell you what proportion of sentences (or as a rough proxy in the GNT case: verses) you could read, only what proportion of words.

To first of all give you a flavour in the specific before moving to the final numbers, consider John 3.16, which is, from a vocabulary point of view, a very easy verse to read.

So let's actually run the numbers on the complete GNT. If you know the top N words, how many verses could you understand 50% of, 75%, 90% or 95% of...

               coverage:    any     50%     75%     90%     95%    100%
        vocab
          100             99.9%   91.3%   24.4%    2.1%    0.6%    0.4%
          200             99.9%   96.9%   51.8%    9.8%    3.4%    2.5%
          500             99.9%   99.1%   82.3%   36.5%   18.0%   13.9%
        1,000            100.0%   99.7%   93.6%   62.3%   37.3%   30.1%
        1,500            100.0%   99.8%   97.2%   76.3%   53.5%   44.8%
        2,000            100.0%   99.9%   98.4%   85.1%   65.5%   56.5%
        3,000            100.0%  100.0%   99.4%   93.6%   81.0%   74.1%
        4,000            100.0%  100.0%   99.7%   97.4%   90.0%   85.5%
        5,000            100.0%  100.0%  100.0%   99.4%   96.5%   94.5%
        ALL              100.0%  100.0%  100.0%  100.0%  100.0%  100.0%

What this means is purely from a vocabulary point of view if you knew the top 1000 lexemes, then 37.3% of verses in the GNT would be 95% familiar to you.

I should emphasis that learning vocabulary in frequency order isn't necessarily the fastest way to get this proportion of readable verses up. I blogged about this fact three years ago, see Programmed Vocabulary Learning as a Travelling Salesman Problem, for example.

originally published on jtauber.com

Announcing MorphGNT.org

2006-03-12

I've hinted before about Ulrik Petersen and I collaborating on Greek New Testament linguistic endeavours.

I'm now delighted to announce the website that will be the home of our collaborative work:

http://morphgnt.org

I've transferred my [MorphGNT] files over there and Ulrik has done the same with his Tischendorf 8th and Strong's Dictionary.

We've been working on a bunch of other stuff for the last few months which will eventually find its way on to that site too.

originally published on jtauber.com

Bug Fix to Python Unicode Collation Algorithm

2006-02-13

See Python Unicode Collation Algorithm for background.

This version fixes a major bug that prevented the collation algorithm from working properly with any expansions:

http://jtauber.com/2006/02/13/pyuca.py

UPDATE (2012-06-21): Now see https://github.com/jtauber/pyuca

originally published on jtauber.com

Dynamic Interlinears with Javascript and CSS

2006-01-28

After the continuation of a permathread on the b-greek mailing list about the pros and cons of interlinears, I built some quick demonstrations of how CSS and Javascript could be used for dynamic interlinear glosses that would not be possible on the printed page.

Plain — show static glosses
Hover — show glosses when a word is hovered over
Toggle — toggle showing a gloss when a word is clicked
Frequency — filter appearance of gloss by frequency

They might be interesting as little Javascript tutorials too.

originally published on jtauber.com

Python Unicode Collation Algorithm

2006-01-27

My preliminary attempt at a Python implementation of the Unicode Collation Algorithm (UCA) is done and available at:

http://jtauber.com/2006/01/27/pyuca.py (old version—see UPDATE below)

This only implements the simple parts of the algorithm but I have successfully tested it using the Default Unicode Collation Element Table (DUCET) to collate Ancient Greek correctly.

The core of the algorithm, which is what I have implemented, basically just involves multi-level comparison. For example, café comes before caff because at the primary level, the accent is ignored and the first word is treated as if it were cafe. The secondary level (which considers accents) only applies then to words that are equivalent at the primary level.

The UCA (and my code) also support contraction and expansion. Contraction is where multiple letters are treated as a single unit—in Spanish, ch is treated as a letter coming between c and d so that, for example, words beginning ch should sort after all other words beginnings with c. Expansion is where a single letter is treated as though it were multiple letters—in German, ä is sorted as if it were ae, i.e. after ad but before af.

Here is how to use the pyuca module.

Usage example:

from pyuca import Collator
c = Collator("allkeys.txt")

sorted_words = sorted(words, key=c.sort_key)

allkeys.txt (1 MB) is available at

http://www.unicode.org/Public/UCA/latest/allkeys.txt

but you can always subset this for just the characters you are dealing with (and you will need to do this if any language-specific tailoring is needed)

UPDATE (2006-02-13): Now see bug fix

UPDATE (2012-06-21): Now see https://github.com/jtauber/pyuca

originally published on jtauber.com

File System Archaeology for MorphGNT

2006-01-01

Some of you will be aware of Ulrik Petersen's work on augmenting Tischendorf's 8th edition with morphological tags and lemmata, based on work by Clint Yale and Maurice Robinson. Ulrik is also the developer of Emdros, an open-source text database engine for annotated text.

The overlap of Ulrik's interests and work with my own on [MorphGNT] is very exciting and so we've started talking about how we might be able to collaborate on some things together.

To help facilitate this, I've spent much of this long weekend so far going through the last 12 years of work on MorphGNT and putting things into Subversion. Because my work on MorphGNT has always been in fits and spurts and has spanned approximately five different desktop machines over the 12 years, it's required a fair bit of "file system archaeology".

The archaeology analogy seems apt because, I'm essentially piecing together a history based on what "layer" I'm finding the files in—e.g. a file on a backup of my website in 2002 probably dates later than those found in the tar balls from when I moved from one machine to another in 1997.

There's also an analogy with textual criticism as in some cases I have to look at two files and judge whether a change from A to B or B to A is more likely.

It's been a lot of fun, especially uncovering little scripts I wrote back in the nineties to do various analyses.

originally published on jtauber.com

MorphGNT 5.08 Released

2005-11-07

I'm pleased to announce the release of a new version of [MorphGNT], the morphologically parsed Greek New Testament database made available under a Creative Commons license.

I haven't put together the change log yet but will shortly.

UPDATE (2005-11-08): Change log is now available on [MorphGNT] page.

originally published on jtauber.com

MorphGNT 5.07 Released

2005-08-31

I'm pleased to announce the release of a new version of MorphGNT, the morphologically parsed Greek New Testament database made available under a Creative Commons license.

See the [MorphGNT] page for a list of changes (47 changes in 940 places).

originally published on jtauber.com

Upcoming new MorphGNT

2005-08-30

I'm just about to release [MorphGNT] 5.07 and, shortly after that, a major new release I'll designate 6.07.

I've decided not to reset the minor release number on a new major release to emphasis the fact that 5.07 and 6.07 are identical in the data they have in common, the 6-series just adds some extra data.

I haven't yet decided just how much extra data will make it in the 6-series releases, but one new addition will be a column containing the surface form / inflected form / reflex (take your pick of terminology) of each word taken in isolation.

What do I mean by "taken in isolation"? Well a word like μετά could appear in the text as μετά μεθ' μετ' or μετὰ depending on the text after it. This new column normalises that to μετά. This happens to also be the lemma so it might not be clear what the extra value is in this case. So consider the text in Matthew 1.20 which reads:

παραλαβεῖν Μαρίαν τὴν γυναῖκά σου

Note that τὴν has a grave accent and γυναῖκά has two accents. If you were to ask someone what the accusative singular feminine article is, they'd say τήν not τὴν. Similarly, if you asked someone what the accustive of γυνή is, they'd say γυναῖκα not γυναῖκά. The reason for the differing accentuation in the text is the context: final syllable acute becomes grave unless clause-final and enclitics like σου throw their accent back to the end of the previous word.

Sometimes you want to treat the variations these cause as distinct, sometimes you don't. By including the extra column, users of MorphGNT will have the best of both worlds.

Here is a list of possible differences between the existing text column and the new column:

existing text may exhibit elision (e.g. μετ' versus μετά)
existing text may exhibit movable ς or ν
final-acute may become grave
enclitics may lose an accent
word preceding an enclitic may gain an extra accent
the οὐ / οὐκ / οὐχ alternation

The new column normalises all these differences.

originally published on jtauber.com

Using Simulated Annealing to Order Goal Prerequisites

2005-08-03T13:50:57

Back in November, I wrote about programmed vocabulary learning as a travelling salesman problem.

I'm pleased to say I've finally cleaned up my Python code and made an initial version available at:

http://jtauber.com/2005/08/sa_prereq_ordering.py

UPDATE (2005-08-04): You probably don't want to use the above script. See Ordering Goals Rather Than Prerequisites for why, along with a much improved script.

originally published on jtauber.com

Ordering Goals Rather Than Prerequisites

2005-08-03T13:53:20

The outcome of my simulated annealing program is a list of prerequisites to learn along with an indication, every so often, of what new goal has been reached.

Running on the Greek lexemes of 1John, you might get something starting like this:

learn μαρτυρέω
learn θεός
learn ἐν
learn εἰμί
learn ὁ
learn τρεῖς
learn ὅτι
know 230507

This gives seven prerequisites to learn and then a goal that has been reached (230507 = 1John 5.7). The problem is that two of those words are unnecessary. You only need to learn μαρτυρέω, εἰμί, ὁ, τρεῖς and ὅτι to be able to read 1John 5.7.

The problem is that the program is ordering prerequisites first and only then establishing at each point what goals (if any) have been achieved.

I can see two solutions:

write a post-processor that walks through and, at each goal, takes any "unused" prerequisites and postpones them to after that goal.
change the program to order goals rather than prerequisites and work out the latter from the former

The second is probably considerably more work but probably ultimately preferred.

UPDATE: I'm almost embarrassed to report that not only was changing over to ordering goals not as hard to do as I thought, but the particular way I did it performs 200 times faster than my previous prerequisite ordering script. New script is at http://jtauber.com/2005/08/sa_goal_ordering.py

originally published on jtauber.com

Parts of Speech and Number of Accents

2005-07-16

I thought I'd write a quick Python script to check how many accents were on each of the lemmata in [MorphGNT] 5.06.

Here are the counts by part of speech and number of accents on lemma:

|     |  0      |  1      |  2  |
+-----+---------+---------+-----+
| A   |  -      |  9159   |  -  |
| C   |  924    |  17361  |  -  |
| D   |  1592   |  4606   |  -  |
| I   |  -      |  17     |  -  |
| N   |  30     |  28271  |  1  |
| P   |  5433   |  5488   |  -  |
| RA  |  19862  |  4      |  -  |
| RD  |  -      |  1744   |  -  |
| RI  |  -      |  1165   |  -  |
| RP  |  -      |  11584  |  -  |
| RR  |  -      |  1677   |  -  |
| V   |  8      |  28101  |  1  |
| X   |  147    |  844    |  -  |

Some of the low numbers are definitely errors in the database. Now to investigate...

UPDATE (2005-07-16): both 2-accent cases were mistakes. The 30 0-accent nouns and 5 of the 0-accent verbs were foreign loan words that intentionally weren't accented but 3 of the 0-accent verbs were mistakes. The 4 accented articles were the result of crasis with the following noun and the word should probably be analyzed as a noun rather than an article. I guess there'll be a 5.07 release soon. NOTE: I haven't looked at the particles, adverbs, conjunctions or prepositions yet.

originally published on jtauber.com

MorphGNT 5.06 Released

2005-07-16

Well, it's been about a hundred hours work over the last six months, but I'm pleased to announce the release of a new version of [MorphGNT], the morphologically parsed Greek New Testament database made available under a Creative Commons license.

Besides some corrections to the text (mostly rho-breathing) and a couple of parsing code changes, this release has a huge number of corrections to the lemmata—160 lemma changes in 465 places. See this blog entry for how potential errors for this round of corrections were discovered.

You can download the new file at:

http://jtauber.com/2005/morphgnt/ccat-tauber-morphgnt-v5_06.zip

originally published on jtauber.com

MorphGNT Roadmap

2005-07-04

This month I should be doing another release of my morphologically-parsed Greek New Testament. This will be release 5.06. I thought I'd outline my future plans (as they currently stand).

At some point, I'll start doing 6.xx releases. This will involve a format change that includes some more information. I'll probably continue the 5-series releases for people used to the format. The 5-series data is just a subset of the 6-series data so it's always possible (and easy) for me to generate a 5 from a 6.

From Series-7, MorphGNT's format will likely change dramatically to adopt a graph structure rather than a simple tabular structure. This will enable much greater extensibility and annotation.

Series-7 will be the last that is based on the CCAT database. From Series-8 onwards, the data will hopefully be completely the results of my own parsing work.

First things first, though—getting 5.06 out. I'm down to 299 mismatches to resolve.

originally published on jtauber.com

MorphGNT Update

2005-06-10

A couple of months ago, I talked about the current process I'm going through to identify errors in my morphologically parsed Greek New Testament, [MorphGNT]. By the end of April, I was down to 400 mismatches I needed to check. At the time, I thought I'd be able to finish going through them by the time I left to go to Europe on holiday.

Unfortunately, I haven't actually worked on it at all the last month. I'm leaving tomorrow but still have 350 mismatches to check (an estimated 14 hours work).

Hopefully I'll get it done some time during July and then I'll be able to release another version of MorphGNT.

originally published on jtauber.com

DATR in Python

2005-04-19

I previously talked about wanting to implement the lexicon language DATR in Python. Well, I just received an email from Henrik Weber saying that (apparently inspired by my post) he has gone and done an implementation at http://pydatr.sourceforge.net/

Well done Henrik! I'm looking forward to trying it out and maybe contributing.

originally published on jtauber.com

Current MorphGNT Work

2005-04-19

For the last few months, I've been making corrections to [MorphGNT] by attempting to merge an English translation (NASB) marked with Strong's numbers with my database. Although it's a tedious process, it's revealing numerous errors.

When James Strong compiled his concordance, he assigned a number to every lemma in the underlying Greek text of the King James Version. Other translations are often made available annotated with these Strong's numbers. Zack Hubert provided me with an electronic text of the NASB translation with Strong's numbers which I converted to something looking like this:

010101 record 976
010101 genealogy 1078
010101 Jesus 2424
010101 Messiah 5547
010101 son 5207
010101 son 5207
010101 Abraham 11

The first column is the book, chapter and verse, the second column is the English word as it appears in the NASB translation and the third column is the Strong's number. Note that not all words are included.

I then found an electronic text of Strong's lexicon and stripped out the formatting and the definitions to just get a list of Strong's numbers with a transliteration of the Greek lemma:

1 a
2 Aaron
3 Abaddon
4 abares
5 Abba
6 Abel
7 Abia
8 Abiathar
9 Abilene
10 Abioud

Finally I took my [MorphGNT] database and extracted the lemmata:

010101 βίβλος
010101 γένεσις
010101 Ἰησοῦς
010101 Χριστός
010101 υἱός
010101 Δαυίδ
010101 υἱός
010101 Ἀβραάμ

I then wrote a Python program that attempts to merge the first and third files on the basis of the second. Note that the transliterations in Strong's lexicon don't have accents and there is ambiguity too (both epsilon and eta go to 'e'). That's a fairly straightforward part of the join, however, because it can be automated by the script.

The real challenge comes because:

NASB versification isn't the same as the MorphGNT Greek text
the text underlying the NASB is not the same critical text as that of MorphGNT
there are errors in each of the files
there are spelling differences
there are differences in the granularity of the lemmata

So my program simply indicates whenever it had trouble performing a match and I have to either:

correct my MorphGNT lemma
correct (or merely change to my lemma conventions) the Strong's lexicon file
correct the NASB-Strong file
change the verse numbering in the NASB-Strong file
comment out a particular word that appears in the text underlying the NASB but not the MorphGNT text

There were initially thousands of exceptions that each required one of these actions. After a number of months, I now have one thousand left. It takes me about 4 hours to make 100 corrections so I still have a little way to go.

When I'm done, I'll release a new version of [MorphGNT] with the lemma errors that this task revealed corrected.

originally published on jtauber.com

BetaCode to Unicode in Python

2005-01-27

BetaCode is a common ASCII transcription for Polytonic Greek. I've been dealing with it for around twelve years. (As an aside, back in 1994, I designed a METAFONT for Polytonic Greek that enabled one to use BetaCode in TeX—I typeset my self-published Index to the Greek New Testament with it).

For the last six years, my preference has been to use Unicode, so I wrote a program (initially in Java but then in Python) that used a Trie to represent the multiple BetaCode characters that can map to a single pre-composed Unicode character.

I've had a version available on this site since 2002, but I've now updated it to what I've been using for my most recent work. You can download it at http://jtauber.com/2004/11/beta2unicode.py

At some stage I'll better factor out the conversion pairs so the code is useful for other conversions. The Trie code might be useful for other contexts too.

(Also see Ricoblog's Converting Greek Beta Code into Normalized Unicode.)

originally published on jtauber.com

DATR, MorphGNT, RDF and Python

2005-01-19

I've been revisiting DATR, the lexical knowledge representation language, as a possible format for the next generation of [MorphGNT]. I was previously considering developing my own RDF/graph-based format but I suddenly remembered DATR from my student days and it makes a lot more sense to use it rather than try to build my own.

Looking at DATR material, I haven't seen anything more recent than 1998 so I'm not sure if it's still the state-of-the-art. It's a natural fit for some kind of RDFization, something I'm sure I'll eventually end up doing if someone hasn't already.

Of course, I'll have to write Python code to manipulate DATR. Again, unless some already exists. But I'm almost hoping not as I love implementing specs, especially using test-driven development.

UPDATE 2005-04-19: Now see DATR in Python

originally published on jtauber.com

Thoughts on GNT-NET Parallel Glossing Project

2004-12-14

Zack Hubert mentions that I'm thinking about using the NET Bible for a collaborative parallel glossing project.

Here is how it might work:

The user is presented with the Greek text and the NET text.

Consider Luke 1.1. The Greek reads:

Ἐπειδήπερ πολλοὶ ἐπεχείρησαν ἀνατάξασθαι διήγησιν περὶ τῶν πεπληροφορημένων ἐν ἡμῖν πραγμάτων,

The NET reads

Now many have undertaken to compile an account of the things that have been fulfilled among us,

It should be possible to select any number of words in the Greek and any number of words from the NET and assert that they correspond (or link) to one another. There is no need to link between the entire verse of Greek and the entire verse of the NET because that link has already been made automatically.

Say the user selects Ἐπειδήπερ. They should then be shown the part-of-speech and parse information for the word (in this case C) as well as the lexical form, ἐπειδήπερ. The user should also be shown all previous glosses for ἐπειδήπερ in other contexts.

The user is then instructed to select the word or words that directly translate ἐπειδήπερ. In this case, the user selects Now and submits.

The user need not progress in order. Say the next thing they select is the word πραγμάτων. As before, they are shown the part-of-speech and parse information (N-GPN) and the lexical form, πρᾶγμα. Again the user is show previous glosses. These glosses should include those specifically for πραγμάτων as well as other forms of πρᾶγμα, perhaps displayed differently.

The user then selects things and submits.

It should be possible to select multiple Greek words and link them to just one word from NET. It should also be possible to select one Greek word and link it to multiple words in the NET. Many-to-many links should also be possible. For example, a user could select περὶ τῶν πεπληροφορημένων ἐν ἡμῖν πραγμάτων and of the things that have been fulfilled among us and submit that linkage.

It is also possible that some words won’t link to anything.

Many-to-many linkages should be encouraged where the particular sense of a word is entirely determined by its use in a sequence (such as an idiom).

Users should be discouraged from doing many-to-many linkages where the sequence isn't a grammatical unit such as a phrase. For example, a user shouldn't submit a link between περὶ τῶν and of the. This clearly can't be enforced.

Users should be required to log in before they can submit linkages. Each linkage will be stored with the email address of the person that made the linkage.

While users may be encouraged to work on particular verses, they should be free to go to whatever verses interest them. Duplicate effort is not a problem and provides redundancy. The data can be checked later for inconsistencies.

originally published on jtauber.com

MorphGNT v5.05 Available

2004-12-14

Various corrections.

Corrected occurrence of ἐμβάλλω for lemma instead of ἐμβλέπω or ἐμβαίνω (thanks to Ted Blakley via Zack Hubert)
Denormalized variant spellings of Ναζαρά
Corrected parse codes of κἀκεῖνος, θρόνοι
Added comparative parse code for σπουδαιοτέρως
Changed lemmata for ἀκριβέστερον, περισσότερον, τολμηρότερον
Changed lemmata for οὕτως, εἵνεκεν, ἑλπίς
Corrected lemma for ζώνην and ζώνη

originally published on jtauber.com

Best Use of MorphGNT So Far

2004-12-14

Zack Hubert has taken my [MorphGNT] and built a GNT Browser that blew me away!

It displays the text in the browser; hover on a word and the lemma and parsing is shown in a pop-up; click on the word and you get a graph of word occurrence by book with the ability to list all occurrences.

I've toyed with web interfaces to the MorphGNT for years but nothing even remotely as slick as this.

originally published on jtauber.com

MorphGNT v5.04 and Beyond

2004-12-09

I've released a new version of my [MorphGNT].

Details of the changes are on the [MorphGNT] page but they all stem from a simple query performed via a Python script: in cases where there is no parse-code (i.e. the word is essentially uninflected), is the text form the same as the lexical form (other than accentuation)?

In some cases this rule means that new lexical forms need to be provided to allow for spelling variation, rather than the lexical form normalising spelling. This is an editorial decision I've made that makes more sense in the larger picture of where I'm going with the MorphGNT.

The corrections I'm making to the CCAT database are really just a side-effect of my efforts to build an original database of New Testament Greek morphology. I'll say more about it as it develops but the idea is that surface forms, lexical forms, spelling variations, roots, stems, suppletion, morpho-phonological rules, etc. will all be catalogued with relationships between them expressed as a directed labelled graph.

Eventually, the MorphGNT will reference into this graph rather than merely give the lemma. There'll be a partial ordering of nodes in the graph (expressed by a subset of arc types) and so references will be to the node that is as general as can explain the specific surface form.

originally published on jtauber.com

MorphGNT v5.03 available

2004-12-07

More corrections now and more coming soon.

Version 5.03 contains a major correction to the lemma PRO; a correction to MYRA; some spelling distinctions ENEKEN/ENEKA, BETHSAIDA(N), GOLGOTHA(N); and case corrections in proper names GERASENOS, STEFANOS, FOROS, TREIS, TABERNE, DIABLOS.

See [MorphGNT].

originally published on jtauber.com

MorphGNT v5.02 Available

2004-12-05

Some breathing corrections on rho-initial words.

originally published on jtauber.com

Programmed Vocabulary Learning as a Travelling Salesman Problem

2004-11-26

For a while I've been interested in how you could select the order in which vocabulary is learnt in order to maximise one's ability to read a particular corpus of sentences. Or more generally, imagine you have a set of things you want to learn and each item has prerequisites drawn from a large set with items sharing a lot of common prerequisites.

As an abstract example, imagine you want to be able to read the "sentences":

{"a b", "b a", "h a b", "d a b e c", "d a g f"}

where we assume you must first learn each "word". Further assuming that all sentences are equally valuable to learn, how would you order the learning of words to maximise what you know at any given point in time?

One approach would be to learn the prerequisites in order of their frequency. So you might learn in an order like

<a, b, d, c, e, f, g, h>

However, had we put h before d, we could have had an overall learning programme that, although equal in length by the end, enabled the learner, at the half-way mark, to understand three sentences instead of just two.

To investigate this further, I needed a way to score a particular learning programme and decided that one reasonable way to do so would be to sum, across each step, the fraction of the overall set of sentences understandable at that point.

I then needed an algorithm that would find the ordering that would maximise this score.

After the quick realisation that the number of possible learning programmes was factorial in the number of words, it dawn on me that this was essentially a travelling salesman problem.

So my sister, Jenni and I wrote a Python script that implements a simulated annealing approach to the TSP. We then applied it to the above contrived example. Sure enough, it found a solution that was better than a straight prerequisite frequency ordering.

I then decided to try applying it to a small extract of the Greek New Testament (which, of course, [I have in electronic form], already stemmed). So I ran it on the first chapter of John's Gospel. 198 words and 51 verses. A straight frequency ordering on this text achieves a score of 48 so that was the score to beat.

My first attempt, it didn't even come close to that. What a disappointment! Jenni and I wondered if it was just the initial parameters to the annealing model. So we increased the number of iterations at a given temperature to 50 and lowered the final temperature to 0.001 (keeping the initial temperature at 1 and the alpha at 0.9).

Success!! It found a solution that scored 82.94. The first verse readable (after 27 words) was John 1.34. John 1.20 was then readable after just 2 more words and John 1.4 after another 7.

I decided to try different parameters. With 100 iterations per temp, a final temp of 0.0001 and a few hours, it achieved a score of 91.59 (and was still increasing at the time). This time the first verse readable was John 1.24, after only 8 words; then John 1.4 after another 9; John 1.10 after 4; and both John 1.1 and John 1.6 after another 4 and John 1.2 just 1 word after that.

Overall a very promising approach. I doubt it's anything new but it was fun discovering the approach ourselves rather than just reading about it in some textbook. The example I tested it on was vocabulary learning, but it could apply to anything that can similarly be modelled as items to learn with prerequisites drawn from a large, shared set.

The next step (besides more optimised code and even more long-running parameters) would be to try to work out how to model layered prerequisites — i.e. where prerequisites themselves have prerequisites — to any number of levels. I haven't thought yet how (or even whether) that boils down (no pun intended) to a simulated annealing solution to the TSP.

UPDATE (2005-08-03): Now see Using Simulated Annealing to Order Goal Prerequisites.

originally published on jtauber.com

MorphGNT v5.01 Available

2004-11-21

Found an accent and breathing problem in both the text and lemma for ABEL, ANNA and ANNAS which is now corrected.

originally published on jtauber.com

MorphGNT v5.00 Available

2004-11-14

At wildly varying intensities over the last ten years, I’ve worked on correcting the UPenn CCAT Morphological Parsed Greek New Testament as a side-effect of larger linguistic analyses I’ve undertaken.

The last big burst of activity was in 2002 when I resumed work on my own morphological analysis (starting with the nouns).

The last couple of weekends, I’ve been working on preparing a new release of the corrected MorphGNT file, the first in probably seven or so years.

Prompted by a post to the b-greek mailing list, I’ve now made that release. MorphGNT v5.00 is now available at [MorphGNT].

originally published on jtauber.com

The Bible and the Semantic Web

2004-05-04

For many years I’ve been thinking about the application of Semantic Web technology to studying (and presenting the results of the study of) the Bible. However, I never really thought about the application of Bible study (and the tools and techniques developed for it) to the Semantic Web.

Then I came across this great blog entry, discussing the latter.

On the former, there is a wonderful site SemanticBible that I hope I can contribute to in some way.

I also really need to get back to my morphological analysis. I haven’t thought about it for a while, but I need to come up with URIs for each lemmata and word form. I could even grandfather in Strong’s numbers and G/K numbers.

originally published on jtauber.com

	-θη-	-η-
α	13	1
ε	21	-
η	80	-
ι	17	-
ο	11	-
υ	26	-
ω	52	-
σ	108	-
ξ	-	-
ψ	-	-
ζ	-	-
τ	-	-
δ	-	-
θ	-	-
κ	-	-
γ	-	12
χ	37	-
β	-	2
π	-	3
φ	16	10
λ	-	6
ρ	6	8
μ	-	-
ν	16	3

	PM-1	PM-2	PM-3	PM-4	PM-5	PM-6a	PM-7	PM-8	PM-9	PM-10-C	PM-11	PM-11-C	PM-12
INF	89	15	4	8	-	4	-	-	12	2	-	3	-
1SG	85	17	-	3	-	1	4	-	9	1	1	-	1
2SG	19	1	-	5	-	-	-	-	7	-	-	-	-
3SG	228	7	-	8	-	-	-	-	74	2	7	11	-
1PL	20	4	-	3	1	3	-	-	9	-	1	-	-
2PL	24	9	-	3	-	-	1	-	32	-	-	-	-
3PL	58	4	1	3	1	1	-	-	13	-	-	1	-
	523	57	5	33	2	9	5	-	156	5	9	15	1

	-θη-	-η-
α	13	1
ε	21	-
η	80	-
ι	17	-
ο	11	-
υ	26	-
ω	52	-
σ	108	-
ξ	-	-
ψ	-	-
ζ	-	-
τ	-	-
δ	-	-
θ	-	-
κ	-	-
γ	-	12
χ	37	-
β	-	2
π	-	3
φ	16	10
λ	-	6
ρ	6	8
μ	-	-
ν	16	3

	PM-1	PM-2	PM-3	PM-4	PM-5	PM-6a	PM-7	PM-8	PM-9	PM-10-C	PM-11	PM-11-C	PM-12
INF	89	15	4	8	-	4	-	-	12	2	-	3	-
1SG	85	17	-	3	-	1	4	-	9	1	1	-	1
2SG	19	1	-	5	-	-	-	-	7	-	-	-	-
3SG	228	7	-	8	-	-	-	-	74	2	7	11	-
1PL	20	4	-	3	1	3	-	-	9	-	1	-	-
2PL	24	9	-	3	-	-	1	-	32	-	-	-	-
3PL	58	4	1	3	1	1	-	-	13	-	-	1	-
	523	57	5	33	2	9	5	-	156	5	9	15	1

	-θη-	-η-
α	13	1
ε	21	-
η	80	-
ι	17	-
ο	11	-
υ	26	-
ω	52	-
σ	108	-
ξ	-	-
ψ	-	-
ζ	-	-
τ	-	-
δ	-	-
θ	-	-
κ	-	-
γ	-	12
χ	37	-
β	-	2
π	-	3
φ	16	10
λ	-	6
ρ	6	8
μ	-	-
ν	16	3

	PM-1	PM-2	PM-3	PM-4	PM-5	PM-6a	PM-7	PM-8	PM-9	PM-10-C	PM-11	PM-11-C	PM-12
INF	89	15	4	8	-	4	-	-	12	2	-	3	-
1SG	85	17	-	3	-	1	4	-	9	1	1	-	1
2SG	19	1	-	5	-	-	-	-	7	-	-	-	-
3SG	228	7	-	8	-	-	-	-	74	2	7	11	-
1PL	20	4	-	3	1	3	-	-	9	-	1	-	-
2PL	24	9	-	3	-	-	1	-	32	-	-	-	-
3PL	58	4	1	3	1	1	-	-	13	-	-	1	-
	523	57	5	33	2	9	5	-	156	5	9	15	1