Eliran Wong asked for a more detailed description of the “normalisation” column in MorphGNT so I promised him I’d write a blog post about it.
I first outlined the objective of the column in a 2005 blog post but enough time has passed and new work done that I thought it was worthy of a new post.
The core idea of the normalised column is to give the inflected form as it would be stated in isolation.
To use the example from the 2005 post, consider the phrase in Matthew 1.20:
τὴν γυναῖκά σου
If you were to ask someone what the accusative singular feminine definite article is, you’d expect the answer τήν and not τὴν. Similarly if you asked what the accusative singular of γυνή is, you’d expect the answer γυναῖκα and not γυναῖκά. The differences in Matthew 1.20 are contextual and, for many applications (particularly morphology) aren’t of much interest.
And so years ago, I went about adding a new column that normalised this sort of thing. Similarly μετά, μεθ’, μετ’, and μετὰ all get normalised to μετά in this separate column.
Back in the 2005 post, I enumerated the normalisations as:
- existing text may exhibit elision (e.g. μετ’ versus μετά)
- existing text may exhibit movable ς or ν
- final-acute may become grave
- enclitics may lose an accent
- word preceding an enclitic may gain an extra accent
- the οὐ / οὐκ / οὐχ alternation
When I published the SBLGNT analysis, another normalisation was added, namely the normalisation of capitalisation at the start of paragraphs or direct speech. The capitalisation is not an inherent part of the inflected form in isolation, only the particular context of the token, and so it is normalised.
In Analysing the Verbs in Nestle 1904 I covered some differences between the SBLGNT and Nestle 1904 analyses that normalisation would have smoothed over. Note that normalisation COULD go further (for example, spelling differences) but I chose not to do that in the normalisation column.
In brief, the things NOT normalised include:
- crasis (e.g. κἀγώ vs καὶ ἐγώ)
In Annotating the Normalization Column in MorphGNT: Part 1 I started talking about annotating WHY each token was normalised the way it was and you can see some counts there for how many tokens underwent normalisation of accent or capitalisation, and how many had elision or a movable nu or sigma.
In many cases, the normalisation can be automated without any need for human intervention (by having a list of elidable words, enclitics, etc). I’ll soon publish my latest Python code for doing this. In some cases, manual checking is needed (although lemmatisation generally resolves a lot of the ambiguities). In Direct Speech Capitalization and the First Preceding Head I talked about the start of some work to go through all capitalisation and identify the reason for it. Similarly New MorphGNT Releases and Accentuation Analysis discusses work on annotating the reason for all accentuation changes.
There is still lots more work to do this for the SBLGNT but I did apply the idea when working on Seumas Macdonald’s Digital Nyssa project. For that, I produced a file the first five lines of which are:
Ἦλθε ἦλθε capitalisation καὶ καί grave ἐφ’ ἐπί elision ἡμᾶς ἡμᾶς ἡ ἡ proclitic
Here each token is normalised in the second column with the third column giving the reason for any difference between the token and the normalised form (and also indicating proclitics).
The possible annotations (and there can be more than one on a token) are:
I hope to eventually be able to provide the same for the entire SBLGNT (and other Greek texts).
Doing all this normalisation has a number of benefits. It makes it easier to extract forms for studying morphology, it allows searches to work more as expected (you don’t want to have to think up all the possible ways a form could actually be written in a text to search for it), it also allows much easier searching for particular phenomena (for example particular clitic accentuation).
It also allows for more rigorous validation of things like accentuation. Work in this area has already uncovered a number of accentuation errors in the SBLGNT text, for example, and could help with automated checking of OCR, etc.