Parsing the DCC Principal Parts

This is part 6 of a series of blog posts about modelling stems and principal part lists and looks more precisely at the format of the principal parts list in the DCC verbs.

We’ve already discussed that the DCC principal parts are presented slightly differently than the Pratt or Morwood inasmuch as the latter two are in tabular form whereas the DCC list just has a string of comma-separated parts.

In Formatting of Principal Parts we touched on many of the properties of the DCC format but in the spirit of precise modeling, what I’ve done below is actually write a set of regular expressions that match and enable parsing of every entry in the DCC list.

In the regex patterns below, I’ve used {grk} for Greek words, optionally preceded by a hyphen. In my code this expands to the regex (-?[\u0370-\u03FF\u1F00-\u1FFF]+). I also have {grk2} which just allows an optional second Greek word separated with “or” or “and”. {grk2} hence expands to ({grk}( (or|and) {grk})?). And finally, in a couple of examples, I have {gloss} for glosses consisting of English words including a comma. This expands to ([a-z, ]+).

The simplest of cases just have a comma-separated list of Greek words. There may only be 1–5 rather than the full six although in these cases, the only gaps in the parts are in the final parts.

"{grk}, {grk2}, {grk}, {grk}, {grk}, {grk2}"
"{grk}, {grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, {grk}"
"{grk}, {grk}"

As mentioned in the previous blog posts, when the third part is a 2nd aorist, that’s made explicit. Again, sometimes the 5th and 6th, or 4th, 5th and 6th parts are omitted.

"{grk}, {grk}, 2 aor\. {grk2}, {grk2}, {grk}, {grk}"
"{grk}, {grk}, 2 aor\. {grk}, {grk}"
"{grk}, {grk}, 2 aor\. {grk}"

One pattern skips the second part but this is clear because of the explicit labeling of the third part.

"{grk}, 2 aor\. {grk}, {grk}, {grk}"

However, in one case, “ἔρχομαι, fut. εἶμι or ἐλεύσομαι, 2 aor. ἦλθον, ἐλήλυθα”, the second part is explictly labeled fut. because it is suppletive, even though it is unmbiguously the second part by position.

"{grk}, fut\. {grk2}, 2 aor\. {grk}, {grk}"

Sometimes both a 1st and 2nd aorist are given as separate parts. In the sigmatic case, “ἁμαρτάνω, ἁμαρτήσομαι, ἡμάρτησα, 2 aor. ἥμαρτον, ἡμάρτηκα, ἡμάρτημαι, ἡμαρτήθην”, the 1st aorist is not explicitly labeled and so the 2nd aorist is actually in the fourth position, the fourth part in the fifth position and so on.

"{grk}, {grk}, {grk}, 2 aor\. {grk}, {grk}, {grk}, {grk}"

However, sometimes the 1st aorist in this case is labeled because it is not sigmatic and so at a glance could be confused for a perfect.

"{grk}, {grk}, 1 aor\. {grk}, 2 aor\. {grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, 1 aor\. {grk}, 2 aor\. {grk}, {grk}, {grk}"
"{grk}, {grk}, 1 aor\. {grk}"

One example of this (matching the first line above) is “φέρω, οἴσω, 1 aor. ἤνεγκα, 2 aor. ἤνεγκον, ἐνήνοχα, ἐνήνεγμαι, ἠνέχθην”.

In one case, “μιμνήσκω, -μνήσω, -έμνησα, pf. μέμνημαι, ἐμνήσθην”, the fourth part is skipped and the fifth is labeled pf.. It would probably be clearer if this were labeled pf. mid. or similar.

"{grk}, {grk}, {grk}, pf\. {grk}, {grk}"

In another case, “ἥκω, ἥξω, pf. ἧκα”, the perfect active is labeled explicitly because there’s no third part and the kappa in the imperfective stem makes the perfect form perhaps harder to identify.

"{grk}, {grk}, pf\. {grk}"

Sometimes an explicit imperfect is given. This is usually at the end, after the usual parts are given.

"{grk}, {grk}, impf\. {grk}"
"{grk}, {grk}, {grk}, {grk}, impf\. {grk}"
"{grk}, {grk2}, 2 aor\. {grk}, {grk}, impf\. {grk}"
"{grk}, {grk}, 2 aor\. {grk}, {grk2}, {grk}, impf\. {grk}"

In one case, “οἴομαι or οἶμαι, οἰήσομαι, impf. ᾤμην, aor. ᾠήθην”, (perhaps inconsistently) the imperfect is given before the aorist.

"{grk2}, {grk}, impf\. {grk}, aor\. {grk}"

In one case, “ἀκούω, ἀκούσομαι, ἤκουσα, ἀκήκοα, plup. ἠκηκόη or ἀκηκόη, ἠκούσθην”, where there is no fifth part, two forms of the pluperfect are given instead.

"{grk}, {grk}, {grk}, {grk}, plup\. {grk2}, {grk}"

In another case, however, “καθίστημι, καταστήσω, κατέστησα, κατέστην, καθέστηκα, plupf. καθειστήκη, κατεστάθην”, this turns out to be a little tricky because it has both a 1st and root aorist but that fact is not made explicit.

"{grk}, {grk}, {grk}, {grk}, {grk}, plupf\. {grk}, {grk}"

Also note the inconsistent use of “plup.” vs “plupf.”.

There are four cases of just providing various non-standard parts just as imperfects, infinitives or participles (in one case three participle parts).

"{grk}, impf\. {grk2}, infin\. {grk}"
"{grk}, {grk}, impf\. {grk}, infin\. {grk}"
"{grk}, ptc\. {grk}"
"{grk}, infin\. {grk}, ptc\. {grk}, {grk}, {grk}"

In the case of εἶδον, the first part actually is the suppletive 2nd aorist of another part.

"{grk}, 2 aor\. of {grk}, act\. infin\. {grk}, mid\.infin\. {grk}"

For our purposes this may end up getting treated differently.

There are five other cases where there is additional annotation:

"{grk}, infin\. {grk}, imper\. {grk}, plupf\. used as impf\. {grk}"
"{grk}, {grk}, {grk}, {grk}, {grk} \(but usu\. {grk} instead\), {grk}"
"{grk}, {grk}, {grk}, {grk}, {grk} \(but commonly {grk} instead\), {grk}"
"{grk} \(usually mid\. {grk}\), {grk}, {grk}, {grk}, {grk}"
"{grk}, {grk}, {grk}, 2 aor\. mid\. {grk}, pf\. {grk} \(“I have utterly destroyed”\) or {grk} \(“I am undone”\)"

And finally there are five cases that are clearly typos where the crucial comma delimiter has been ommitted or accidently replaced with a .

"{grk} {grk} {gloss}, {grk} {gloss}, 2 aor\. {grk} {gloss}, {grk} {gloss}, plup\. {grk} {gloss}, {grk} {gloss}"
"{grk}, {grk}, {grk}, {grk}\. {grk}, {grk}"
"{grk} {grk}"
"{grk} {grk}, 2 aor\. {grk}"
"{grk} {grk}, {grk}, {grk}, {grk}, {grk}"

These correspond to:

ἵστημι στήσω will set, ἔστησα set, caused to stand, 2 aor. ἔστην stood, ἕστηκα stand, plup. εἱστήκη stood, ἐστάθην stood
τυγχάνω, τεύξομαι, ἔτυχον, τετύχηκα. τέτυγμαι, ἐτύχθην
προσήκω προσήξω
ἕπομαι ἕψομαι, 2 aor. ἑσπόμην
βουλεύω βουλεύσω, ἐβούλευσα, βεβούλευκα, βεβούλευμαι, ἐβουλεύθην

These cases should probably just be fixed upstream.

Now, admittedly, it probably would have been quicker for me to just manually convert the 149 strings into some completely unambiguous format rather than write regular expressions that match them all, handling typos and idiosyncracies. But the approach highlights both specific issues with the DCC list (which admittedly are quite minor, I don’t want to detract from the wonderful resource the DCC Core List is) and the value of precise modeling like this in identifying inconsistencies and potential ambiguities in the way this sort of information is presented.

While it’s outside the scope of this blog series, I’ve been exploring for a while similar tests on entire lexicon entries. This pretty quickly exposes inconsistencies. Even in cases where a markup language such as XML is used, unless it’s very fine-grained markup (like the Cambridge Lexicon is/was using) lots of inconsistencies and ambiguities can creep in.

All of this comes back to what I talked about in my 2015 SBL and BibleTech talks under the heading of Technical Aspects of Openness and what’s involved in making linguistic data truly machine-actionable.

Comments on “Parsing the DCC Principal Parts”