Handling Morphological Ambiguity

On my now page, I currently list "finalising an improved set of morphology tags to use" under Medium Term. As I find myself sometimes having to clarify the motivation for and state of this, I thought I'd share what I just wrote in the Biblical Humanities Slack.

Firstly, some background on previous notes...

Back in 2014, I wrote down some notes Proposal for a New Tagging Scheme after discussions with Mike Aubrey. In 2015, after some discussions with Emma Ehrhardt, wrote down Handling Ambiguity. Then in February 2017, after discussion on the Biblical Humanities Slack, I put forward a concrete Proposal for Gender Tagging.

Here's a slightly cleaned up version of what I wrote in Slack...

All I've done is propose a way of representing certain single-feature ambiguities (especially gender but also nom/acc in neuter). I have not proposed anything for multi-feature ambiguities nor have I actually DONE any work that uses these proposals.

Multi-feature ambiguities at the morphology level (1S vs 3P, GS vs AP, etc) are rarely ambiguous at the syntactic or semantic level for very good reason: the syntactic/semantic-level disambiguation is what allows one to tolerate the ambiguity at the morphology level (one reason that, as a cognitive scientist, I quite like discriminative models of morphology).

But if I continue with my goal to produce a purely morphology analysis, without "downward" disambiguation, then I want to be able to provide a way of representing form over function AND representing ambiguity.

I want to stress again that I think nom vs acc in neuter, or gender in genitive plurals is a DIFFERENT kind of ambiguity than 1S vs 3P or GS vs AP. For these multi-feature ambiguities (or what my wiki page calls extended syncretism although not sure I really like that term) it may come down to just providing a disjunction of codes, e.g. GSF∨APF.

Also just in terms of motivation: clearly a morphological analysis that ignores downward disambiguation from syntax or semantics is unhelpful (and potentially even misleading) for exegesis and so a lot of use cases wouldn’t want to do it. HOWEVER, my goal is three fold:

(1) I want to have a way to model the output of automated morphological analysis systems prior to either automated or human downward disambiguation;
(2) as someone studying how morphology works from a cognitive point of view, I care about modelling how ambiguity is resolved at different levels and so want a model that can handle that;
(3) because a student is quite likely to be confronted with this disambiguity, it needs to be in my learning models. I want to be able to search for cases where 1S vs 3P ambiguity or GSF vs APF ambiguity or NSN vs ASN ambiguity is resolved by syntax or semantics so they can be illustrated to the student. I want to know, for a given passage, whether such ambiguity exists so learning can be appropriately scaffolded. And note that, for me, this extends to ambiguity resolved by just accentuation as well (which is another potentially useful thing to model for various applications).

In conclusion, I want to again state I'm not at all against a functional, full-disambiguated parse code existing. I have NEVER proposed REPLACING the existing tagging schemes. I just want to add a new column useful for the reasons I've listed above in (1) – (3) and produce new resources that perhaps ONLY use that purely morphological parse code.

Finally I want to note there's an important difference between what we put in our data and how we present it to users. People should not assume that when I'm describing codes to use in data that I'm suggesting that's what end-users should see.

UPDATE: one topic I didn't discuss here is ambiguity in endings that is resolved by knowledge of the stems or principal parts. For example, without a lexicon, there are ambiguities between imperfect and aorist that are easily resolved with additional lexical-level information.