I started working on some Plato texts a while ago but now I’m back to it, integrating various information and hitting some more issues with the Diorisis corpus.
About a year ago, I wrote about some vocabulary statistics I’d put together around various texts. This included a subcorpus of Plato I wanted to put together for the Greek Learner Texts Project. Based on the “core” works list I’d put together for https://vocab.perseus.org, this included:
At the time, I used Giuseppe Celano’s experimental lemmatization as the basis for the vocabulary counts.
In the intervening period, I went further with Crito and restructured the citation scheme to be based on units of dialogue and sentences. You can see HTML generated from the result at:
but the underlying data is available at https://github.com/jtauber/plato-texts/blob/master/text/crito.txt.
I also took a first pass at aligning an English translation at the sentence level and the raw data for that is available at https://github.com/jtauber/plato-texts/blob/master/analysis/crito_aligned.txt.
My plan was always to return to the other four texts and last weekend I started on that, freshly bringing in the texts (in both Greek and English) from Perseus, the Diorisis corpus tagging, and the treebanks from AGLDT (Euthyphro) and Vanessa Gorman (the Apology).
I also added:
and may add others if they seem appropriate for the Greek Learner Texts Project.
The first thing I did was produce a stripped-down tokenized version of the Greek texts from Perseus with minimal markdown. In this process, I found a small number of issues with the Perseus XML which I’ll submit corrections for shortly (mostly some stray gammas).
I then wrote a script to extract similar tokens from Diorisis for alignment. As I’ve written about before, the Diorisis corpus made the odd choice to use betacode for the tokens so I had to do a conversion. Then the real fun began.
Firstly, the Perseus text, based on the Burnet edition, has various editorial markup like
<sic>. I quickly discovered that the Diorisis text drops the
<sic> elements. That’s fine although I might seek the advice of people more familiar with Burnet and the text scholarship of Plato as to what the Greek Learner Texts edition should do.
Secondly, in Phaedo at least, named entities are marked up in the Perseus TEI XML. People and places are all appropriately tagged. I don’t happen to need that right now although it’s potentially useful information. But the Diorisis corpus drops those elements. I don’t just mean it drops the tags, it dropps the elements. So if the sentence was
<persName>John</persName> loves <persName>Mary</persName>, Diorisis would just give the sentence as
loves (at least in Phaedo). Fairly easy to work around for alignment purposes, though.
The more time consuming aspect is the odd way Diorisis handles quotations. It seems to repeat the tokens of each quotation, once in context and then once in a sentence of its own. Except sometimes the repetition is incorporated in an unrelated sentence.
For example, the Homeric quotation in 408a (Republic Book 3) is analyzed inline but then also repeated in another sentence where it’s part of the first sentence of 409a (“δικαστὴς δέ γε…”) which, unless I’m missing something considerable, is just completely wrong.
I’m manually correcting all this (it comes up as an alignment mismatch and I’m going in and editing the Diorisis XML to remove the duplication). But even without the bad sentence merges, this also means that the vocabulary counts I’ve previously generated from Diorisis (and in
vocabulary-tools) may have doubled up on any words appearing in quotations.
So there’s lots more to do with Plato, not least of all the manual curation of lemmatization. But the goal, like that of the Greek Learner Texts Project as a whole, is to have a set of openly-licensed, high-quality, lemmatized texts for extensive reading by language learners.
Collaboration always welcome. Just ping me on the Greek Learner Texts Project Slack workspace.