Update on LXX Progress
The general process has been, one book at a time:
- programmatically expand the stem database with missing stems where the analysis given by CATSS fits what
greek-inflexionstemming rules expect
- where the analysis from CATSS doesn’t fit what
greek-inflexionexpects, evaluate if it’s
- a parse error in the CATSS (at this stage by far the most common problem, but also the most time consuming to identify and fix)
- a missing stemming rule (very rare at this stage)
- some temporary limitation of
greek-inflexion(it could be smarter about some accentuation, for example)
Working a few hours a week, it took about a month to do 1 Kings (i.e. 1 Samuel), in part because it had close to 100 parsing errors in the CATSS, many of them quite inexplicable (like getting the voice wrong when the ending should make that very easy to determine).
The work up until this point covers about 35% of the LXX, but I decided for the rest to go broad rather than book-by-book.
In other words, I’ve expanded the stem database (per step one above) for the entire LXX in one go and will now work through the problem cases.
What is very encouraging is that expanding the verbs attempted from 35% to 100% only led to 731 analysis mismatches in 1,875 locations. Given the LXX has just over 100,000 verbs, that’s less than a 2% error rate.
Let me be clear, however, what I’m claiming. I’m NOT saying I can morphologically tag verbs with 98% accuracy. I’m merely saying that 98% of the CATSS LXX morphological analysis can be explained by the rules and data in
greek-inflexion. The other 2% is likely to MOSTLY be errors in the CATSS analysis with a few errors in my stem database, stemming rules, or accentuation rules.
At the rate I worked through 1 Kings, going through the rest of the mismatches might take the rest of the year, but I think I can speed things up by batching similar kinds of mismatches together. For example, there are 586 forms where
greek-inflexion didn’t generate the form in the CATSS analysis with the morphosyntactic properties given but was able to generate the form with different morphosyntactic properties. In almost all cases that corresponds to a mistake in the CATSS analysis. It’s the most time consuming part to deal with but batching them up together (especially dealing with the same mismatch across all remaining books at once) should speed things up.
It may also lend itself to crowd-sourcing. I could probably pretty easily whip up a little website that shows people the form and asks them to choose between the CATSS analysis and the
greek-inflexion analysis (not telling them which is which).
It may be worth me spending a few hours setting that up!