Over the years, when generating vocab coverage stats or orderings for graded readers, I’ve used either lemmas or inflected forms as the items being learnt.

The problem with using inflected forms is that it assumes knowing one form of a lexeme has nothing to do with knowing any other form of that lexeme. The problem with using lemmas is that it assumes knowing one form of a lexeme is enough to know all of them.

Of course, the path forward lies somewhere in between and one of the motivations for all my Morphological Lexicon work is to have the necessary data in machine-actionable form to take a much more intelligent approach to the relationship between knowing one form and knowing another.

This gets in to some very deep areas of psycholinguistics and learnability but, for now, I’m mostly just looking for a better measure of the “cost” or “effort” of learning a new form for the purposes of judging readability, etc. than just assuming all forms are equal or that learning a lemma gives you all the forms.

An initial improvement could be made by using themes and distinguishers. Consider λόγου, whose theme is λογ and distinguisher is ου. The theme identifies the lexeme (by definition it’s the part of the word shared by all cells in a paradigm for a particular lexeme). The distinguisher both identifies some morphsyntactic properties (the fact it’s a genitive singular, assuming we can tell it’s a nominal) and gives some hints as to inflectional class (i.e. it reduces the possible distinguishers other cells in the paradigm can take).

So a simple way of modeling things is to say that, in order to understand λόγου, you need to know λογ and ου. Breaking apart the themes and distinguishers is an improvement over just looking at lexemes or forms. Using the theme takes care of suppletive stems too. (Although it does raise the question: does learning that two suppletives stems are the same lexeme cost effort or save it?)

There are a few situations that need more consideration though. Firstly stems that aren’t truly suppletive but are systematically derived from one another. (e.g. λαμβαν / λαβ). To first approximation, you could just model this as full suppletion in terms of effort but a more refined approach would be to give a “discount” on the effort of learning λαμβαν if you already know λαβ or vice-versa. Even then, you’d likely only want to provide that discount once learning the nu-infix pattern had been costed.

Secondly, consider families of distinguishers for the same properties that differ because of sandhi (either in that particular cell or in others, causing the theme to have less of the stem). For example here are the 28 distinguishers for dative singular nominals according to my current analysis: -ᾳ, -αντι, -ατι, -γι, -δι, -ει, -ειρι, -ενι, -εντι, -ῃ, -ι, -ιδι, -κι, -κτι, -νι, -ντι, -οϊ, -ονι, -οντι, -οτι, -ουντι, -πι, -ρι, -τι, -τῳ, -υϊ, -ῳ, -ωντι. The reason 28 are needed are because of sandhi in other cells such as the nominative singular. The only ending is -ι so you really only need to know that one thing (plus perhaps that iota is subscripted after a long alpha, eta or omega). The distinguisher analysis is still useful (particularly for its role in hinting at inflectional class) but the cost should be massively discounted once you recognize the -ι pattern.

Thirdly, I haven’t yet talked about costs and discounts for the actual sandhi rules. Should the -ους ending in the genitive singular (for stems ending in εσ or οσ) be discounted if you know both the genitive singular ending -ος and the εσ+ος → ους / οσ+ος → ους sandhi rules?

And finally, while I’ve talked a couple of times here about the distinguisher hinting at the inflectional class, that information hasn’t been incorporated in to any costing or discounting in our discussions yet. It’s worthy of a little more research into the psycholinguistics literature, but presumably seeing something like πίνακος primes you for recognizing πίναξ. It’s also potentially useful for disambiguation: if you know the nominative plural ends in -ες, for example, then you know that -ος is a genitive singular not a nominative singular.

There’s clearly lots more to explore but it reinforces what I keep saying: having data like the distinguisher analysis opens us up to explore this sort of thing and potentially incorporate it in new learning tools.

In this post, I’ve just talked about morphology, but things can of course be extended (and need to be extended) to constructions beyond the word. That, of course, requires richer analysis beyond what I’m doing with the Morphological Lexicon but that is something I eventually want to tackle as well.