Annotating the Normalization Column in MorphGNT: Part 1

Since the Series-6 release, MorphGNT has had a column that normalizes the word forms in the text for contextual things like accent changes, elision, movable nu and capitalization. I thought it would be useful to provide an annotation of exactly what normalization had been done for each word in the text and why.

I wrote a short Python script that runs some heuristics on each case where the "word" column and "norm" column differ to determine the nature of the in-context change.

In this post, I'll just report on some statistics. In later posts, I'll dive into further details that rely on actually looking at the surrounding context (rather than just the difference in one row).

There are 47,630 times where the word and norm columns differ.

38,523 times there is a change of accent (clitics, oxytones taking graves, etc).

3,721 times there is a change in capitalization.

1,221 times there is elision: 984 times a straight dropping of a final vowel, 237 times an additional aspiration of the preceding consonant.

5,223 times there is a movable nu. Note that both the existence and absence of nu is normalized to (ν) so this covers all cases where a nu could be dropped as well as the 142 times when it actually is.

226 times there is a movable sigma (20 times where it's actually dropped). This doesn't count ἐξ (another 234 times). There are also 825 times οὐκ appears and 105 times οὐχ appears.

In addition to the 47,630 cases above, there are also 32 other instances of two types of discrepancy that need to be resolved. One is ἑλπίδι with a rough accent in Romans. The other is the cases where Χριστός appears with lower case χ. I'm not sure what the solution to the former is but the latter might just involve having two distinct lemmata for Χριστός vs χριστός.

All these statistics might seem of trivial interest but they are side effects of a more important task of both verifying the normalization and, as will be covered in subsequent posts, testing context-sensitive accentuation rules.