Update on LXX Progress

As mentioned in previous posts, I’ve been working through the LXX, initially making sure my greek-inflexion library can generate the same analysis of verbs as the CATSS LXX Morphology and adding to the verb stem database accordingly. This is a preliminary to being able to run the code on alternative LXX editions such as Swete and provide a freely available morphologically-tagged LXX.

The general process has been, one book at a time:

programmatically expand the stem database with missing stems where the analysis given by CATSS fits what greek-inflexion stemming rules expect
where the analysis from CATSS doesn’t fit what greek-inflexion expects, evaluate if it’s
- a parse error in the CATSS (at this stage by far the most common problem, but also the most time consuming to identify and fix)
- a missing stemming rule (very rare at this stage)
- some temporary limitation of greek-inflexion (it could be smarter about some accentuation, for example)

Working a few hours a week, it took about a month to do 1 Kings (i.e. 1 Samuel), in part because it had close to 100 parsing errors in the CATSS, many of them quite inexplicable (like getting the voice wrong when the ending should make that very easy to determine).

The work up until this point covers about 35% of the LXX, but I decided for the rest to go broad rather than book-by-book.

In other words, I’ve expanded the stem database (per step one above) for the entire LXX in one go and will now work through the problem cases.

What is very encouraging is that expanding the verbs attempted from 35% to 100% only led to 731 analysis mismatches in 1,875 locations. Given the LXX has just over 100,000 verbs, that’s less than a 2% error rate.

Let me be clear, however, what I’m claiming. I’m NOT saying I can morphologically tag verbs with 98% accuracy. I’m merely saying that 98% of the CATSS LXX morphological analysis can be explained by the rules and data in greek-inflexion. The other 2% is likely to MOSTLY be errors in the CATSS analysis with a few errors in my stem database, stemming rules, or accentuation rules.

At the rate I worked through 1 Kings, going through the rest of the mismatches might take the rest of the year, but I think I can speed things up by batching similar kinds of mismatches together. For example, there are 586 forms where greek-inflexion didn’t generate the form in the CATSS analysis with the morphosyntactic properties given but was able to generate the form with different morphosyntactic properties. In almost all cases that corresponds to a mistake in the CATSS analysis. It’s the most time consuming part to deal with but batching them up together (especially dealing with the same mismatch across all remaining books at once) should speed things up.

It may also lend itself to crowd-sourcing. I could probably pretty easily whip up a little website that shows people the form and asks them to choose between the CATSS analysis and the greek-inflexion analysis (not telling them which is which).

It may be worth me spending a few hours setting that up!