New Draft Morphological Tags for MorphGNT

I’ve finally done the work in translating the MorphGNT tagging system to a new proposal for initial feedback.

At least going back to my initial collaboration with Ulrik Sandborg-Petersen in 2005, I've been thinking about how I would do morphological tags in MorphGNT if I were starting from scratch.

Much later, in 2014, I had some discussions with Mike Aubrey at my first SBL conference and put together a straw proposal. There was a rethinking of some parts-of-speech, handling of tense/aspect, handling of voice, handling of syncretism and underspecification.

Even though some of the ideas were more drastic than others, a few things have remained consistent in my thinking:

there is value in a purely morphological analysis that doesn't disambiguate on syntactic or semantic grounds
this analysis does not need the notion of parts-of-speech beyond purely Morphological Parts of Speech
this analysis should not attempt to distinguish middles and passives in the present or perfect system

As part of the handling of syncretism and underspecification, I had originally suggested a need for a value for the case property that didn't distinguish nominative and accusative and a need for a value for the gender property like "non-neuter".

In the absence of feedback beyond a vague feeling that something like this should be done, I didn't immediately make further progress but, a year later, started gathering more notes on handling ambiguity. That then led to a more concrete proposal just around gender and case (although not without open questions).

I've now implemented those smaller-scale proposals as a first draft for the MorphGNT SBLGNT and plan to apply them to other GNT texts soon. The new-tags branch for MorphGNT SBLGNT is available at: https://github.com/morphgnt/sblgnt/tree/new-tags.

This adds a new column (the intention is not to replace existing analyses yet, just augment them) that:

makes voice formal not functional (while still using P in the aorist and future for what Carl Conrad would called MP2)
does not give morphosyntactic properties for uninflected words
implements basic nominative/accusative case syncretism in the neuter with a single value
implements basic non-neuter, non-feminine, and (in most genitive plurals) complete gender syncretism with a value for each

One immediate affect of this is that a list I have from Randall Tan of disagreements between the MorphGNT SBLGNT analysis and that of the Nestle 1904 largely goes away because many of them were merely different judgements of gender or case on non-morphological grounds. This new tag retains the uncertainty. Another benefit of the tagging scheme is that it provides a reasonable output for an automated morphological analysis system which can then, in a separate step, be disambiguated syntactically (or semantically), potentially with human input.

There are some important things to note, however, as just saying "this is a purely morphological analysis that doesn't disambiguate" oversimplifies things greatly.

Firstly, while punting distributional and semantic part-of-speech questions like "is this an adverb or a conjunction" or "what type of pronoun is this" is extremely helpful, there are still some questions that impact a purely morphological tagging such as whether to represent a fossilised verb acting as a particle as having morphological inflection.

Secondly, there are what I have called extended syncretisms not modelled where there can be uncertainty between properties taken as a pair. For example 1st person singular vs 3rd person plural in -ον, or 1st declension genitive singular vs accusative plural in -ας. It may be worth still conveying this ambiguity but just through disjunction, saying for example that a word is GSF^APF. These are almost always phonological coincidences rather than structural syncretism and so should be modelled differently.

Related to this is the "double" syncretism between accusative singular masculine and neuter on the one hand and nominative and accusative singular neuter on the other hand. If we model the latter as CSN then we've lost the former (which, if by itself could be modelled as ASY). So, in a sense CSN and ASY are syncretic (but also share an overlapping cell). CSN^ASY doesn't quite seem right because of that overlap and the fact that this isn't just a phonological coincidence as best I know.

Thirdly, I have only modelled basic syncretism, not endings in wildly different parts of the paradigms (so would definitely not be called syncretism) that also happen to have converged by phonological change. For example both -ου and -ον can be nominal endings or unrelated verbal endings (with quite a few interpretations, mind you, especially for -ου). No attempt has been made to capture this in a single tag (although a disjunctive representation might be possible).

And finally (although related to the previous point), a certain amount of lexical disambiguation is applied. There are many cases where not being familiar with the lexeme makes a form highly ambiguous but that ambiguity goes away if the lemma is known. A simple example is imperfects versus second aorists where the principal parts resolve the ambiguity. The draft new tags for MorphGNT SBLGNT effectively assume the lemmatisation has been done and is correct.

In light of this, some people might be surprised, therefore, that υἱοῦ is tagged GSY and not GSM given it's lexically masculine. My current argument (at least in my own head) is that, regardless of a specific lexeme like υἱοῦ, GSM, as a morphological tag, doesn't really make sense in the Greek paradigmatic system because, by nature, genitive singulars have the same form in the non-feminines. I think there's definitely a difference, if subtle, between true ambiguity and underspecification. It's not that υἱοῦ is ambiguous as to gender, it's just that the cell doesn't distinguish masculine from neuter. Lexical knowledge is still being used, otherwise it could be feminine (or even a middle imperative!).

So, in short, syncretism inherent to the paradigmatic system is captured well but other forms of ambiguity will need to be handled other ways (potentially via a disjunctive list of possibilities). This seems a reasonable, practical compromise.

Let me know your thoughts. There's definitely still more to do and I do plan on expressing more ambiguity with some form of disjunction. I'll probably do a post soon with some more thoughts (and stats) on that.