Functional Dependency in the MorphGNT Table

Often it’s useful to see whether certain columns in a table can be entirely determined by others. For example, can you unambigously get the lemma from just the form (the answer is no so a more useful question is which forms are ambiguous as to lemma)? Does knowing the part-of-speech help? Here we provide some code and give some examples.

At the end I provide the script used.

Run in the same directory as the MorphGNT SBLGNT, it runs like this:

$ ./dep.py 6 7
45

What this is telling us is that there are 45 times where the value of column 6 (the normalized form) gives us multiple possible values for column 7 (the lemma). In relational database terms was say that column 7 is not functionally dependendent on or not functionally determined by column 6 because of those 45 cases.

If you run:

$ ./dep.py -v 6 7

it will actually list all 45, starting with something like:

ἄμωμον {'ἄμωμος', 'ἄμωμον'}
ἴδε {'ἴδε', 'ὁράω'}
ὑποταγῇ {'ὑποταγή', 'ὑποτάσσω'}
καλῶν {'καλός', 'καλέω'}
Ἰουδαίας {'Ἰουδαῖος', 'Ἰουδαία'}
...

You can also give more than one column for either the determinant or dependent.

For example, does knowing the form AND part-of-speech determine the lemma?

Turns out there are only 8 exceptions in the current MorphGNT/SBLGNT:

$ ./dep.py -v 6,2 7
Ἅννα N- {'Ἅννα', 'Ἅννας'}
ἀνώτερον A- {'ἀνώτερος', 'ἀνώτερον'}
ἀλάβαστρον N- {'ἀλάβαστρος', 'ἀλάβαστρον'}
χρυσᾶ A- {'χρύσεος', 'χρυσοῦς'}
μακράν A- {'μακράν', 'μακρός'}
ὕστερον A- {'ὕστερον', 'ὕστερος'}
ταχύ A- {'ταχύ', 'ταχύς'}
ἤρχοντο V- {'ἄρχω', 'ἔρχομαι'}
8

There are other things that can be explored with this. How many lemmas have more than one part-of-speech in the MorphGNT/SBLGNT?

$ ./dep.py 7 2
70

How many forms have more than one parse analysis extant in the text, even if you know the lemma and part-of-speech:

$ ./dep.py 6,7,2 3
903

Given a lemma, part-of-speech and parse analysis, how many cases are there where multiple alternative forms are seen:

$ ./dep.py 7,2,3 6
132

Looking at these with the -v option, you can see some are unavoidable:

ὁράω V- 1AAI-P-- {'εἴδομεν', 'εἴδαμεν'}
κλείς N- ----APF- {'κλεῖς', 'κλεῖδας'}

whereas others are likely corrections that need to be made to the lemmatization:

τις RI ----GSM- {'τινος', 'τινός'}

The most recent set of corrections to MorphGNT/SBLGNT (which will be in release 6.07) stem from this sort of analysis.

There are still more to discuss and resolve, however. See https://github.com/morphgnt/sblgnt/issues/32 and other issues on GitHub for details and to help in the discussion.

The script