All Subtrees Not Just Clauses

A post to the graded-reader mailing list from April 14, 2010.

I just ran a quick experiment where I treated the targets to learn not just as the clauses but any subtree in the dependency tree that has more than one word.

This results in 8209 targets in John's gospel instead of 3206.

Obviously it means learning common noun phrases and prepositional phrases first.

In particular, these are the first things learnt when using the next-best algorithm:

ὁ Ἰησοῦς
ἐν αὐτῷ
τοῦ θεοῦ
ἐκ θεοῦ
λέγει αὐτῷ
λέγει αὐτῷ Ἰησοῦς
λέγει αὐτῷ ὁ Ἰησοῦς
καὶ λέγει αὐτῷ
εἰς αὐτόν
πρὸς αὐτόν
τὸν πατέρα
πρὸς τὸν πατέρα
τὸν πατέρα μου
καὶ τὸν πατέρα μου
ἐν αὐτοῖς
λέγει αὐτοῖς
καὶ λέγει αὐτοῖς
λέγει αὐτοῖς ὁ Ἰησοῦς
εἶπεν αὐτῷ
καὶ εἶπεν ὁ Ἰησοῦς

Compare this with the first things learnt when the targets are clauses only (i.e. only subtrees rooted on "pred"):

εἶπεν
εἶπεν αὐτῷ
ἀπεκρίθη Ἰησοῦς
ἀπεκρίθη αὐτῷ Ἰησοῦς
ἀπεκρίθη Ἰησοῦς αὐτῷ
λέγει
λέγει αὐτῷ
λέγει αὐτῷ Ἰησοῦς
λέγει αὐτῷ ὁ Ἰησοῦς
εἶπεν αὐτῷ ὁ Ἰησοῦς
ἀπεκρίθη ὁ Ἰησοῦς
λέγει αὐτοῖς
λέγει αὐτοῖς Ἰησοῦς
λέγει αὐτοῖς ὁ Ἰησοῦς
εἶπεν αὐτοῖς
ἀπεκρίθη αὐτοῖς
ἀπεκρίθη αὐτοῖς Ἰησοῦς
ἀπεκρίθη αὐτοῖς ὁ Ἰησοῦς
εἶπεν αὐτοῖς Ἰησοῦς
εἶπεν αὐτοῖς ὁ Ἰησοῦς

(note, these are just based on surface form in text with no reference to any other linguistic information)

While it's kind of nice seeing the noun phrases emerge in the first list, I worry about learning prepositional phrases in isolation from their verb. Thoughts? Of course, when combined with inline replacement into English, the verb will be shown, albeit in English.

I also realise now, the former list should include one-word subtrees if the word is a "pred".

James