Dictionary Markup versus Lexical Modelling

This year I’ve been thinking about (and working on) the representation of lexical information quite a bit.

This is nothing new, but recently, thoughts and activity have been motivated on multiple fronts including:

work earlier this year starting to extract Homeric Greek information from Cunliffe
a new project to digitise Tolkien’s A Middle-English Vocabulary
work collaborating on the GreekWordNet
contributions to the Ontology-Lexica Community Group on modelling morphology (see a recent joint paper Challenges for the Representation of Morphology in Ontology Lexicons from eLex 2019)
gathering lexical information for the Greek Texts project.

Plus my long-term vision of a comprehensive, machine actionable description of Greek morphology.

One important distinction that comes up, though, and which I’ve ranted about a number of times on Twitter.

I guess Abbott-Smith for GNT too. I've long been interested in the relationship (and mismatches) between printed dictionaries and the modelling of the information therein (especially beyond just the glosses/definitions; e.g. morphology, etymology)
— James Tauber (@jtauber) August 16, 2019

In short, marking up a print dictionary is not the same as real modelling of lexical information.

Obviously the intention of a print dictionary is to convey that information but it is done so in a form ultimately only appropriate for human interpretation from the page (in print or an online facsimile or some sort). It’s not really machine actionable.

Now wait, you might think. All we need to do is markup the dictionary electronically using some format like TEI.

Just to randomly pick an early entry in the conversion of Abbott-Smith to TEI:

<entry lemma="ἀγαθοποιός" strong="G17">
  <note type="occurrencesNT">1</note>
  <form>**† <orth>ἀγαθοποιός</orth>, <foreign xml:lang="grc">-όν</foreign> = cl. <foreign xml:lang="grc">ἀγαθουργός</foreign>,</form>
  <etym>
    <seg type="septuagint">[in LXX, of a woman who deals pleasantly in order to corrupt, <ref osisRef="Sir.42.14">Si 42:14</ref>*;]</seg>
  </etym>
  <sense><gloss>doing well</gloss>, <gloss>acting rightly</gloss> (Plut.): <ref osisRef="1Pet.2.14">I Pe 2:14</ref> (Cremer, 8; MM, <emph>VGT</emph>, s.v.).†</sense>
</entry>

There’s some extractable information here, like the Strong’s number, a clear lemma, some biblical references, the number of occurrences in the NT and some glosses. But some of the information goes unanalysed and merely presented as in the print dictionary. Things like the initial **† in the entry are left unexplicated. The -όν termination indicating the inflectional class (and indirectly the part of speech) is merely marked up as a Greek word. The LXX reference is treated as etymology and yet the classical equivalent has to be decoded from the = cl..

This is not to pick specifically on the Abbott-Smith conversion. Marked-up versions of Cunliffe, LSJ, even a version I had of the Barclay Newman glossary in BetaCode in the mid-90s, are all primarily attempting to convey the typography of the printed work, sprinkling a little bit of descriptive rather than purely presentational markup over the original content (so you could at least use a stylesheet to decide to make headwords bold or something, rather than doing it inline).

It would still take a lot of work to extract morphological or etymological information from this type of markup.

A very different kind of approach is to focus on actually modelling the lexical information and only then worrying about mapping it to some visual presentation.

TEI somewhat recognises the distinction and actually offers a variety of approaches to dictionary markup, one that is focused mostly on the markup for display (like in a printed book) and one that is more focused on the underlying data (although there are other formats for that too).

Of course if you’re marking up an existing print dictionary you’re pretty much doing the former. A more abstract modelling of the lexical information in Abbott-Smith (or LSJ, or Cunliffe, or Tolkien) is no longer a marked up version of that dictionary.

It’s a challenging problem for sure (and I’m certainly experiencing it firsthand on the Tolkien project—automatic extraction of even just the etymological information from the Middle English vocabulary entries has required tens of regular expressions).

What we really need moving forward is more focus on the underlying lexical modelling and not assuming that marking up Abbott-Smith or the LSJ better is the solution. And it’s not like there isn’t a ton of work going on in good electronic representation of lexical information for modern languages.

I ranted a little bit about the more general issue in my BibleTech 2015 talk on Biblical Greek Databases where I talked about how many reference books should be thought of (and indeed produced) more as “UI on top of databases”. In other words, you focus on the data and then at some point have a largely automated process for generating printed works from that data. One of my examples in that talk was “Readers Editions” of texts with glossaries on each page. But I think it applies even more to dictionaries.

So I think it’s important that we recognise the distinction between:

the presentation of lexical information in a print dictionary (or online equivalent)
the descriptive markup of those dictionaries
the underlying linguistic information

and recognise that collaboration on and exchange of the last of those three is ultimately the most valuable.

UPDATE: Here’s an entry from the upcoming Cambridge lexicon. It’s about as close as you could get to machine-actionable descriptive markup that is still pretty much following the structure of a print lexicon:

<AE>
  <HG>
    <HL>ξανθό<hyph/>θριξ</HL>
    <Infl>τριχος</Infl>
    <PS>masc.fem.adj</PS>
    <Ety>
      <Ref>θρίξ</Ref>
    </Ety>
  </HG>
  <aS1>
    <Tr>fair<or/>golden-haired</Tr>
    <Au> Sol. Theoc.</Au>
  </aS1>
  <aS1>
    <Indic>of a horse</Indic>
    <Def>with light-coloured hair or mane</Def>
    <Tr>golden</Tr>
    <Au>B.</Au>
  </aS1>
</AE>

Notice that this XML doesn’t include the square brackets that will go around the etymology in the print version. They are treated entirely as presentation. The disjunction between the glosses ‘fair’ and ‘golden-haired’ is markup, not content. Even whitespace and punctuation have to largely come from the stylesheet. Definitions are distinguished from translation glosses and also applications like ‘of a horse’.

This still doesn’t go as far as I’m talking about in terms of truly modelling the underlying linguistic information but it’s a lot easier to do things with this computationally than just a markup of an existing print dictionary would be.