I have made a minor update to
greek-normalisation, a more significant update to
vocabulary-tools, and have started a new project
postag-convert for converting between various morphosyntactic tagging schemes.
utils module in this package previously had a function for converting U+02BC and U+1FBF to U+2019 but now (in the 0.4 release) additionally provides it as a shell command.
Once the package is installed, you can type
to2019 in > out in the shell and the file named
in will be converted to a file named
out with all the U+02BC and U+1FBF characters changed to U+2019.
I previously mentioned I’d incorporated lemma counts from Vanessa Gorman’s treebanks into
vocabulary-tools. I didn’t check the Unicode normalisation first, though, and it turns out it was inconsistent (which led to bad numbers). That’s now been fixed and the data converted to NFC.
I’ve also added the Diorisis lemma counts too and cleaned up the code to share more between the two data sets.
Thirdly, I took a pass at finding the intersection between the passages covered by Gorman and Diorisis and generated separate lemma counts for each version of the intersection. I’ll write a dedicated blog post about this later but basically I’m trying to track down systemic problems with various lemmatisations and having identical texts to compare (to make sure discrepencies aren’t just subcorpus bias) is very helpful.
Fourthly, I’ve implemented a calculation of log rank differences between lemmas in two subcorpora—in other words, a measure of how far the rank of a particular lemma differs in two subcorpora. This has (at least) two applications: one is to find which lemmas are disproportionally more common in one text versus another. For example, in the Gorman texts from Thucydides and Xenophon, the lemma Κῦρος is ranked 885th vs 30th—a log rank difference of 4.9. The lemma δράω is ranked 196th in Thucydides but 2345th in Xenophon—a log rank difference of 3.6.
The second application is to compare two lemmatisations of the same subcorpus (e.g. Gorman vs Diorisis) to try to identify systemic problems. It turns out, for example, that the log rank difference of λέγω between Gorman and Diorisis is a whopping 5.974 (you’d expect it to be at or near zero for the same corpus). Turns out that’s because Gorman distinguishes λέγω, λέγω2, λέγω3 and Diorisis doesn’t.
More on this in a future post!
You can see more examples of
log_rank_differences in action at:
Over the years (actually decades), various projects of mine have had to convert between different schemes for morphsyntactic tagging. Whether the original CCAT scheme, my own variants of that scheme, the Robinson scheme, the Morpheus/Perseus scheme, or its Logeion variant, I’ve written code at various points to do conversions from one to another. I was well overdue writing a reusable library!
I also want to be able to support Leipzig Glossing Rules and the Universal Dependencies codes as well as fully spelling out properties and values in multiple natural languages (e.g. localising terms like “case” or “accusative”).
postag-convert is still in the early days but is intended to eventually be useful for all of the above (and potentially reusable beyond Ancient Greek too).