Lemmatization for the Morphological Lexicon

As I slowly expand my plans for a Morphological Lexicon of New Testament Greek to a Morphological Lexicon of Ancient Greek, I'm dealing with extra challenges in lemmatization.

One of the things I'm doing to verify my work is take existing morphologically-tagged and lemmatized Greek texts and see if my code and (more importantly) data generates the same form given the lemma and morphosyntactic properties. In particular, I'm currently doing this with noun forms in Vanessa Gorman's Greek Dependency Trees.

Along the way, I'm having to amend a number of Gorman's lemmas. Not because they are wrong per se, but because they are serving a different purpose than what I need. This is not a problem unique to the Gorman trees and I gave an entire talk at SBL 2017 about related issues.

As I said there, a lemma provides a link between a token in a text and an entry in a lexical resource. It acts as a key by which to retrieve a record in a lexical database (traditionally a print dictionary).

There are two problems with this, however:

you may wish to link to multiple independent lexical resources and the entries in each may not have a one-to-one mapping
data within the lexical entry may not be valid for all tokens linking to that lexical entry and if the lemma only identifies the lexical entry as a whole, there's no way to discern which specific properties apply in each case.

I give many examples in my SBL 2017 talk. One obvious example of a problem is words with multiple senses. Sometimes texts will be tagged with more sense information, λέγω3 rather than just λέγω for example. This typically assumes a single canonical lexical resource (like the LSJ for Greek).

But the problem exists with other information attached to a lexical entry. Notably relevant in my case is morphological information like stems or inflectional classes.

Now if your goal in lemmatizing a text is to link back to an entry in LSJ (or maybe a particular sense) that's fine but it is not precise enough a reference to hang morphological information off. And this is why I'm having to augment the lemmatization in annotated texts like Vanessa Gorman's (and later the Diorsis corpus).

Much of this has to do with dialect variation. For this reason I didn't discuss many examples in my SBL 2017 (which was primarily focused on the New Testament). I did have an example of spelling variation, though, which is similar.

The example I gave there was ἀνάπειρος versus ἀνάπηρος. And, as I said at the time, you may not care to distinguish these if you're doing lexical semantics but if you're a textual critic or phonologist, you might. And I made the following point which is particularly relevant here:

why should ἀναπείρους be lemmatized ἀνάπηρος?

Now again, it seems innocuous if your goal in lemmatizing is just to link to a canonical dictionary. But if you're doing any kind of morphological modelling, then ἀνα-πείρ-ο- and ἀνά-πηρ-ο- are different stems. Because they are different stems, there are different objects needed to hang the "stem" property off and you want the token "ἀναπείρους" to point to the object with stem = ἀνα-πείρ-ο- not the object with stem = ἀνά-πηρ-ο-.

My SBL 2017 talk briefly listed a few other examples:

ἀναλόω ~ ἀναλίσκω
ἀποκτείνω ~ ἀποκτέννω
ἑλκύω ~ ἕλκω
ἵστημι ~ ἱστάνω

as well as:

ἀλλάσσω ~ ἀλλάττω
ἁρμόττω ~ ἁρμόζω
κλαίω ~ κλάω
αὐξάνω ~ αὔξω
μείγνυμι ~ μίγνυμι
οἶμαι ~ οἴομαι

As I tried to emphasize in my talk, choosing to lump (or split) these isn't wrong per se. But for my specific morphological purposes, and to the extent that there is a difference in any morphological property, whether it be the stem or the distinguisher paradigm or the inflectional class, or whatever, there needs to be a separate object to point to.

For this reason I'm adding distinct lemmas for each dialect. So far that has meant changing the lemmas for about 5% of the forms in the Gorman trees (note that's unique forms, not tokens). And so μέλιτταν gets "lemmatized" by me as μέλιττα not as μέλισσα, μέγαθος gets lemmatized as μέγαθος not μέγεθος, μαντηίῳ gets lemmatized as μαντήιον not μαντεῖον, and so on.

That is not to do away with the lumping all together. I can collect variations together into groups and link the group to, say, the LSJ entry. This is then entirely appropriate to use for properties that are shared across dialect / spelling variations.

This is a key point about the lattice approach described in my SBL 2017 point. You have an object to point to where you need to specific AND an object to point to where you can be general.

Furthermore, sometimes inflected forms can be ambiguous as to their "narrow" lemma. A ᾱ~η alternation between dialects in an ending, for example, will be neutralized in endings with a short ᾰ. And so even in the morphologically-focused case, there is sometimes a need for lumping across dialects.

It's not just a matter of dialects and spelling variation: suppletion and heteroclisis also comes into play here and benefit from this approach.

This is all extra work but I think it's necessary for a more precise, corpus-based language description.

I want to finish working through the Gorman nouns before I share some of this data but that should happen in the coming months. And I want to emphasize that, in most cases, I'm not actually changing the lemmatization, I'm just adding to it (although I am finding the occasional error whose correction I will send upstream).

It's still early days and one thing I haven't settled on is good terminology. I'm inclined to go with the lexeme ~ flexeme distinction. But then do I call the key for the latter the "flemma"?