Working with the Diorisis Ancient Greek Corpus

I’ve recently started working on cleaning up the Diorisis Ancient Greek Corpus for my own vocabulary and morphology work as well as potential use in Scaife.

I can’t remember if I simply didn’t know about the Diorisis corpus until recently or simply had put it on a list of things to look at one day and forgot to get back to to it. But it was Rodda et al’s Vector space models of Ancient Greek word meaning, and a case study on Homer in TAL 60.3 that put me (back) on to it.

The Diorisis Ancient Greek Corpus was produced by Alessandro Vatri and Barbara McGillivray and is a 10-million word corpus of 820 texts from Perseus and some other sources (in TEI XML, non-TEI XML, HTML, and apparently Microsoft Word).

Vatri and McGillivray compiled it for studying semantic change but obviously it’s useful for a number of my own research interests. I’ve been working with Giuseppe Celano’s lemmatisation (also used in Scaife) but it has some problems and I’d always planned to clean it up a bit. Diorisis excited me as a potentially better curated (if smaller) corpus.

I also liked the fact that the Diorisis corpus had work-level metadata with dates and genre (which I’ve always wanted on my corpus for a variety of reasons). And of course, it is cc-by-sa licensed.

So I started a repo to begin processing:

https://github.com/jtauber/diorisis

The first thing I did was write a script to extract the work-level metadata, with this result:

https://github.com/jtauber/diorisis/blob/master/catalog.tsv

Then I started extracting the token-level data. Like the Celano data, the XML format used for the analysed corpus is huge (over 2GB on disk) and I wanted something a little more normalised along the lines of other work I’m doing.

But here’s where I hit my first problem. Vatri and McGillivry made the odd decision to use BetaCode for the corpus word forms (although not the lemmas). What is even more odd is their paper (linked above) argues for the benefits of BetaCode over Unicode. The arguments, however, seem to stem from a misunderstanding of Unicode.

Section 3.3 of their paper begins:

All Greek characters have been converted to Beta Code, in order to adopt a uniform and consistent encoding and with a view to automatic parsing and lemmatization. For these purposes, Beta Code was chosen because of its flexibility and ease of use in the following look-up operations:

They then proceed with three arguments. The first is:

Word-forms to be automatically analysed and annotated may or may not start with a capital letter; in order to be matched to entries in a digital dictionary, forms should be converted to the formats corresponding to the entries. Greek lowercase and uppercase letters are encoded as different characters in the Unicode table (e.g. the lower-case letter α corresponds to utf-8 code 0391, the upper-case letter A corresponds to utf-8 code 03B1), which would require an ad-hoc conversion for each character between its lower-case and upper-case versions. Beta Code simply encodes capitalization through the juxtaposition of an asterisk (*) character (lower-case α is encoded as A, and upper-case A is encoded as *A), which can be easily added or removed in the look-up process.

Firstly, this confuses Unicode code points with UTF-8 encoding forms. Secondly, they get the Unicode code points for α and Α around the wrong way. α is U+03B1 (which in UTF-8 would be CE B1). Thirdly, “ad-hoc conversion for each character between its lower-case and upper-case versions” is an overstatement of the problem. A simple .lower() in Python, for example, is arguably easier than removing * (and certainly not “ad-hoc”) especially when one considers the wide range of accent and breathing placements one finds in uppercase BetaCode characters in the wild.

Their second argument is:

Diacritics such as the Greek diaeresis ( ̈) may or may not appear in dictionary entries (for instance, editors may add them to Greek words to mark hiatuses in metrical texts). Greek characters containing the diaeresis (alone or in combination with other diacritic marks) all have different utf-8 codes (e.g. ϊ = 03ca, ΐ = 0390, ῒ = 1fd2, ῗ = 1fd7), whereas Beta Code encodes the diaeresis through the juxtaposition of a plus sign (+; e.g. ϊ = I+, ΐ = I/+, ῒ = I\+, ῗ = I=+). This makes it very easy to process diacritics in the look-up process.

Again there is a confusion between Unicode code points and UTF-8. “03ca” is a code point not UTF-8. But more significantly, the argument that BetaCode is superior because it encodes the diaereses as a separate character completely ignores decomposed characters in Unicode. Ironically, copy-pasting from their paper, ϊ is already decomposed as two code points:

U+03B9 GREEK SMALL LETTER IOTA
U+0308 COMBINING DIAERESIS

In other words, Unicode provides exactly what they are asking for.

Finally, they argue:

In ag orthography, the grave accent (`) is only used to mark the alteration of the pitch normally marked by an acute accent in connected speech; thus, it never appears in dictionary entries (which only contain acute or circumflex accents). Whereas Unicode has different codes for Greek characters with an acute or a grave accent, Beta Code encodes such diacritics as forward (/) and backward (\) slashes, respectively; this makes grave accents easy to convert into acute accents in the look-up process.

Again this ignores Unicode normalisation and decomposed characters. It is really no harder to convert graves to accents with Unicode than with BetaCode.

These misunderstandings and misrepresentations of Unicode would be one thing if my argument was just that Unicode is no worse than BetaCode. But the choice of BetaCode is problematic for other reasons.

Most of these problems have to do with ( and especially ). BetaCode texts should use ' for apostrophes marking elision. The Diorisis corpus sometimes does. But it also (in around six thousand cases by my initial estimate) uses ) instead. And so we have KAT) for κατ’ instead of KAT'. Diorisis is hardly the only culprit here. In Perseus I still find cases where ) was used in BetaCode so the (incorrect) Unicode comes out κατ̓. To make matters worse, ( and ) are also used for actual parentheses.

And so in BetaCode in the wild, ) could mean smooth breathing or an apostrophe or a parenthesis. With BetaCode this has to be manually disambiguated. With Unicode it does not (unless incorrectly converted from ambiguous BetaCode).

And so my process of converting Diorisis to using Unicode is not a trivial one. My initial conversion code flagged almost ten thousand tokens that need to be manually checked. The majority of these seem to be ) for elision but some are for parentheses. Eyeballing what I have so far, there are also cases of multiple words not properly tokenised into two and also some bad text (OCR or keying errors) that needs to be corrected.

Note that some of these issues were likely problems in the upstream text and so my task ahead is partly just doing that correction. But most of the work is addressing ambiguities in the BetaCode that would not exist if Unicode had been used everywhere. This makes the fact BetaCode was chosen for unnecessary reasons even more frustrating.

One could argue I’m talking at most about 0.1% of the text so I could just ignore problematic tokens. Given the automated lemmatisation is considerably less accurate than 99.9% (more like 90% at best) it might seem like a pedantic thing to focus on. But the problematic tokens tend to be of a particular type and aren’t uniformly distributed in the corpus and so depending on the task the corpus is being used for, they can become more prominent than one might think from a figure like “0.1%”.

My goal in the coming weeks is to have a slightly cleaned up Diorisis corpus completely in Unicode. This can then be used for some initial vocabulary stats work. My next goal after that is to improve the lemmatisation, initially using curated lemmatisations that did not exist when Diorisis was done. Longer term, I plan to continue to curate the lemmatisation.

This improved Diorsis can then form the basis for a lot of the work I previously used the Celano analysis for. There will definitely be blog posts reporting status along the way.

I am extremely grateful for the work that went into producing the Diorisis corpus. It is just a shame that a misunderstanding of Unicode led to a decision that is creating extra work now. But that will be overcome soon and hopefully my incremental improvements will turn out to be useful to others over time.