Functional Dependency in the MorphGNT Table

Often it’s useful to see whether certain columns in a table can be entirely determined by others. For example, can you unambigously get the lemma from just the form (the answer is no so a more useful question is which forms are ambiguous as to lemma)? Does knowing the part-of-speech help? Here we provide some code and give some examples.

At the end I provide the script used.

Run in the same directory as the MorphGNT SBLGNT, it runs like this:

$ ./ 6 7

What this is telling us is that there are 45 times where the value of column 6 (the normalized form) gives us multiple possible values for column 7 (the lemma). In relational database terms was say that column 7 is not functionally dependendent on or not functionally determined by column 6 because of those 45 cases.

If you run:

$ ./ -v 6 7

it will actually list all 45, starting with something like:

ἄμωμον {'ἄμωμος', 'ἄμωμον'}
ἴδε {'ἴδε', 'ὁράω'}
ὑποταγῇ {'ὑποταγή', 'ὑποτάσσω'}
καλῶν {'καλός', 'καλέω'}
Ἰουδαίας {'Ἰουδαῖος', 'Ἰουδαία'}

You can also give more than one column for either the determinant or dependent.

For example, does knowing the form AND part-of-speech determine the lemma?

Turns out there are only 8 exceptions in the current MorphGNT/SBLGNT:

$ ./ -v 6,2 7
Ἅννα N- {'Ἅννα', 'Ἅννας'}
ἀνώτερον A- {'ἀνώτερος', 'ἀνώτερον'}
ἀλάβαστρον N- {'ἀλάβαστρος', 'ἀλάβαστρον'}
χρυσᾶ A- {'χρύσεος', 'χρυσοῦς'}
μακράν A- {'μακράν', 'μακρός'}
ὕστερον A- {'ὕστερον', 'ὕστερος'}
ταχύ A- {'ταχύ', 'ταχύς'}
ἤρχοντο V- {'ἄρχω', 'ἔρχομαι'}

There are other things that can be explored with this. How many lemmas have more than one part-of-speech in the MorphGNT/SBLGNT?

$ ./ 7 2

How many forms have more than one parse analysis extant in the text, even if you know the lemma and part-of-speech:

$ ./ 6,7,2 3

Given a lemma, part-of-speech and parse analysis, how many cases are there where multiple alternative forms are seen:

$ ./ 7,2,3 6

Looking at these with the -v option, you can see some are unavoidable:

ὁράω V- 1AAI-P-- {'εἴδομεν', 'εἴδαμεν'}
κλείς N- ----APF- {'κλεῖς', 'κλεῖδας'}

whereas others are likely corrections that need to be made to the lemmatization:

τις RI ----GSM- {'τινος', 'τινός'}

The most recent set of corrections to MorphGNT/SBLGNT (which will be in release 6.07) stem from this sort of analysis.

There are still more to discuss and resolve, however. See and other issues on GitHub for details and to help in the discussion.

The script

Comments on “Functional Dependency in the MorphGNT Table”