Vanessa Gorman’s Lemmatisation Now in vocabulary-tools

Last year I started the Python library vocabulary-tools to consolidate the various scripts I’ve written over the years to analyse vocabulary in (particularly New Testament) texts. I’ve just added support for the vocabulary in Vanessa Gorman’s treebanks.

As part of the greek-texts project it’s important to have as many texts lemmatised as possible. For a while (for example in https://vocab.perseus.org) I used Giuseppe Celano’s automated lemmatisation of Perseus. Recently I started working on cleaning up the Diorisis Ancient Greek Corpus which is also an automated lemmatisation.

Automated lemmatisation is around 90% accurate but that’s quite low for doing vocabulary work, especially as the lemmatisation errors are often systematic and so can throw off stats in quite a significant way.

Ultimately, we need hand-curated lemmatisation and one of the goals of the greek-texts project is to help facilitate that. I obviously already have a lemmatisation of the Greek New Testament. There is also the Ancient Greek Dependency Treebank. But one of the most impressive efforts in this area (especially in light of the fact it is a solo effort) is Vanessa Gorman’s work.

There are now over 500,000 tokens of Greek prose treebanked by Professor Gorman. So not only lemmas but morphosyntactic tagging and syntactic dependency tagging as well.

But for the short term, it’s the lemmas I’m interested in. I extracted them from the XML format produced by Arethusa and built lemma counts that could be loaded into vocabulary-tools.

There’s still a lot of work to do on my library but I can now do things like generate an 80% vocabulary list for the Gorman corpus. Or see what words you’d have to learn to read Plato’s Apology if you can read Lysias’s On the Murder of Eratosthenes and the New Testament at the 95% level.

I also took the opportunity to add more features to vocabulary-tools including incorporating the code I wrote for the Subcorpus Vocabulary Statistics post (which was based on the Celano lemmatisation).

Ultimately I’d like more post-beginner Greek prose and the LXX lemmatised. I’m currently working on Plato’s Crito and Epictetus’s Enchiridion. If you’re interested in any of this stuff, please check out the greek-texts project and join our Slack workspace.