As part of my explicit annotation of the normalization column in MorphGNT, I started down the rabbit hole of capitalization conventions which led to an interesting experiment with direct speech and the GBI syntax trees.
Back in Annotating the Normalization Column in MorphGNT: Part 1, I talked about wanting to catalogue the reasons why a word in the text differs from the normalized form, and annotate the text on a per-case basis. One difference mentioned was capitalization.
In Greek texts printed now-a-days, there are three reasons why a word might start with an uppercase letter:
- it’s a proper noun
- it’s the start of a paragraph
- it’s the start of direct speech
So I obviously want to be able to explictly say in each case, which it is (of course it could be more than one or even all three, potentially).
The heuristic for the proper nouns is easy if you actually have tagged the proper nouns or lemmatized the text (although there are some inconsistencies as I’ve already mentioned which need to get cleaned up in MorphGNT).
The start of a paragraph heuristic should be straight forward as the electronic SBLGNT text has paragraphs indicated but there are some oddities I’m looking at (including 30 cases where a word after a paragraph break is not capitalized, some of which are inconsistencies in SBLGNT itself).
The direct speech is most interesting. I started by assuming that, if the lemma isn’t capitalized and the word isn’t at the start of a paragraph, it must be the start of direct speech. There are 2,225 cases of this in the SBLGNT text underlying the MorphGNT.
Then I implemented a little heuristic where I traversed up the heads from the start of the direct speech (using the dependency version of the GBI Syntax Trees) until hitting a word that preceded the direct speech. Let’s call that the first preceding head.
My hypothesis was that the first preceding head would be some verb of communication (saying, writing, etc). In theory one might also expect a complementizer but the GBI Syntax Trees don’t treat complementizers as heads so they don’t come up in practice.
In 1,641 instances, the first preceding head was a form of λέγω. In much rarer instances (no lexeme with more than 64 instances) there were other verbs like γράφω, ἀποκρίνομαι, φημί, ἐπερωτάω, or κράζω.
In some cases the first preceding head was clearly not a verb of communication (and often not a verb at all). Going through the first half of Matthew so far, here are the explanations I’ve discovered:
- in Matt 6.31, three instances of direct speech are disjoined and the GBI Trees model disjunction in such a way the second and third instance are linked to the first rather than the actual verb of communication, λέγοντες
- in Matt 8.9, the verb of communication is elided in the second and third cases so the GBI Tree attaches the direct speech elsewhere
- Matt 9.13 has “μάθετε τί ἐστιν” and Matt 12.7 has “εἰ δὲ ἐγνώκειτε τί ἐστιν” and the GBI Trees end up hanging the direct speech (or “meaning”) off τί
There were 118 cases in the entire text where there was no first preceding head. Going through the first half of Matthew again, the majority of these are cases where there is no direct speech but a word has been capitalized without an actual paragraph break. However, there are a couple of other interesting scenarions:
- in Matt 11.21, we might expect ἤρξατο ὀνειδίζειν to be linked to the direct speech with a participle of saying but none is provided
- similarly in Matt 13.33, there is direct speech but no participle linking to ἐλάλησεν
My plan is to go through the rest of the text and describe all the scenarios, but as this is somewhat of an unexpected rabbit hole, it might take me a while.
If anyone is interested in a raw dump of the data with my explanations (covered above) so far, see https://gist.github.com/jtauber/39d85cff34c71a2df169.