A post to the graded-reader mailing list from April 12, 2010.
Until this weekend, all the GNT graded reader work I’d done has used clause boundaries from OpenText.org.
With the availability of the PROIEL dependency tree analysis, I thought I’d give that a go.
I’ve uploaded to github code for extracting the clauses in John’s Gospel and generating a very basic reading programme from that.
Clauses were extracted by looking at any ‘pred’ arc and linearizing all nodes from that point down. If there were embedded preds then clauses corresponding to both inner and outer preds were generated.
Note that the current code is just based on forms with use made of syntactic or morphological information. I also can’t do inline replacement into an English context because I don’t have an English text mapped to the PROIEL analysis.
However, my initial impression is that the PROIEL analysis will be preferable to work with moving forward.
Then Patrick Narkinsky asked:
Could you clarify in what ways you see the PROIEL data being superior to the opentext data? One obvious one that leaps to mind is that OpenText seems to be a dead project…
It’s actively maintained, is redistributable under a CC license, is based on a freely redistributable text and is a less idiosyncratic analysis.
Admittedly, I haven’t spent THAT much time with it but it seems that it will be easier to extract the kind of syntactic information I’m interested in from it.