Analysing the Verbs in Nestle 1904

The last couple of weeks, I've been working on getting my greek-inflexion code working on Ulrik Sandborg-Petersen's analysis of the Nestle 1904. The first pass of this is now done.

The motivation for doing this work was (a) to expand the verb stem database and stemming rules; (b) to be able to annotate the Nestle 1904 with additional morphological information for my adaptive reader and some similar work Jonathan Robie is doing.

My usual first step when dealing with a next text is to automatically generate as many new entries in the lexicon / stem-database as I can (see the first step in Update on LXX Progress).

In some cases, this is just a new stem for an existing verb because of a new form of an already known verb. But sometimes it's an entirely new verb.

I thought the Nestle 1904 would be considerably easier than the LXX because the text is so similar but there were numerous challenges that arose.

It became clear very quickly that there were considerable differences in lemma choice between the Nestle 1904 and the MorphGNT SBLGNT. This didn't completely surprise me: I've spend quite a bit of time cataloging lemma choice differences between lexical resources and there are considerable differences even between BDAG and Danker's Concise Lexicon.

But even these aside, there were 7,743 out of 28,352 verbs mismatching after my code had already done it's best to automatically fill in missing lexical entries and stems.

A. The normalisation column in Nestle 1904 doesn't normalise capitalisation, clitic accentuation, or moveable nu, all of which greek-inflexion assumes has been done.

Capitalisation alone accounted for 1042 mismatches. Clitic accentuation alone accounted for 1008 mismatches. Moveable nu alone accounted for 4153 mismatches.

B. Nestle 1904 systematically avoids assimilation of συν and ἐν preverbs.

Taken alone, these accounted for 91 mismatches. Mapping prior to analysis by greek-inflexion is somewhat of a hack that I'll address in later passes.

C. There were 8 spelling differences in the endings which required an update to stemming.yaml:

κατασκηνοῖν (PAN) in Matt 13:32
κατασκηνοῖν (PAN) in Mark 4:32
ἀποδεκατοῖν (PAN) in Heb 7:5
φυσιοῦσθε (PMS-2P) in 1Cor 4:6
εἴχαμεν (IAI.1P) in 2John 1:5
εἶχαν (IAI.3P) in Mark 8:7
εἶχαν (IAI.3P) in Rev 9:8
παρεῖχαν (IAI.3P) in Acts 28:2

D. The different parse code scheme (Robinson's vs CCAT) had to be mapped over.

This should have been straightforward but voice in the formal morphology field sometimes seemed to be messed up (which I corrected as part of G. below).

E. There were 182 differences (type not token) in lemma choice, mostly active vs middle forms.

See https://gist.github.com/jtauber/28ddfeee3175903026dade4ab965ac6c#file-lemma-differences-txt for the full list.

F. There were a small handful of per-form lemma corrections I made

ἐπεστείλαμεν AAI.1P ἀποστέλλω ἐπιστέλλω
ἀγαθουργῶν PAP.NSM ἀγαθοεργέω ἀγαθουργέω
συνειδυίης XAP.GSF συνοράω σύνοιδα
γαμίσκονται PMI.3P γαμίζω γαμίσκω

G. Finally, I made 69 (type not token) parse code changes.

See https://gist.github.com/jtauber/28ddfeee3175903026dade4ab965ac6c#file-parse-txt for the list.

With all this, the greek-inflexion code (on a branch not yet pushed at the time of writing) can correctly generate all the the verbs in the Nestle 1904 morphology.

There are definitely improvements I need to make in a second pass and at least a small number of corrections that I think need to be made to the Nestle 1904 analysis.

But it's now possible for me to produce an initial verb stem annotation for the Nestle 1904 and I'm a step closer to a morphological lexicon with broader coverage.

UPDATE: I've added some more parse corrections but not yet updated the gist.