Consolidating Vocabulary Coverage and Ordering Tools

One of my goals for 2019 is to bring more structure to various disperate Greek projects and, as part of that, I’ve started consolidating multiple one-off projects I’ve done around vocabulary coverage statistics and ordering experiments.

Going back at least 15 years (when I first started blogging about Programmed Vocabulary Learning) I’ve had little Python scripts all over the place to calculate various stats, or try out various approaches to ordering.

I’m bringing all of that together in a single repository and updating the code so:

it’s all in one place
it’s usable as a library in other projects or in things like Jupyter notebooks
it can be extended to arbitrary chunking beyond verses (e.g. books, chapters, sentences, paragraphs, pericopes)
it can be extended to other texts such as the Apostolic Fathers, Homer, etc (other languages too!)

I’m partly spurred on by a desire to explore more stuff Seumas Macdonald have been talking about and be more responsive to the occasional inquiries I get from Greek teachers. Also I have a poster Vocabulary Ordering in Text-Driven Historical Language Instruction: Sequencing the Ancient Greek Vocabulary of Homer and the New Testament that got accepted for EUROCALL 2019 in August and this code library helps me not only produce the poster but also make it more reproducible.

Ultimately I hope to write a paper or two out of it as well.

I’ve started the repo at:

https://github.com/jtauber/vocabulary-tools/

where I’ve basically rewritten half of my existing code from elsewhere so far. I’ve reproduced the code for generating core vocabulary lists and also the coverage tables I’ve used in multiple talks (including my BibleTech talks in 2010 and 2015).

I’ve taken the opportunity to generalise and decouple the code (especially with regard to the different chunking systems) and also make use of newer Python stuff like Counter and dictionary comprehensions which simplifies much of my earlier code.

There are a lot of little things you can do with just a couple of lines of Python and I’ve tried to avoid turning those into their own library of tiny functions. Instead, I’m compiling a little tutorial / cookbook as I go which you can read the beginnings of here:

https://github.com/jtauber/vocabulary-tools/blob/master/examples.rst

There’s still a fair bit more to move over (even going back 11 years to some stuff from 2008) but let me know if you have any feedback, questions, or suggestions. I’m generalising more and more as I go so expect some things to change dramatically.

If you’re interested in playing around with this stuff for corpora in other languages, let me know how I can help you get up and running. The main requirement is a tokenised and lemmatised corpus (assuming you want to work with lemmas, not surface forms, as vocabulary items) and also some form of chunking information. See https://github.com/jtauber/vocabulary-tools/tree/master/gnt_data for the GNT-specific stuff that would (at least partly) need to be replicated for another corpus.