In my previous post, I talked about the legal / licensing aspects of open linguistic data but there are technical aspects in order for linguistic data to be open too.

To illustrate, consider an out-of-copyright, printed lexicon. From a licensing point of view, it’s open—it can be redistributed with or without modifications, etc. But that doesn’t make it particularly usable for computational work.

A while ago I came across something Greg Crane had written where he talked about things being machine-actionable. I like this a lot more than “machine-readable” because it isn’t just about being able to “read” the work, it but to actually do interesting things with it.

There are various facets of this so I thought I’d try to enumerate some of them.

  • correctable — can I make corrections if I find mistakes?
  • verifiable — can I write code to check for errors?
  • reproducible — can I reproduce the results others have found?
  • extensible — can I extend it with my own data or data from other sources?
  • queryable — can I search, filter, or sort the data to get subsets of interest?
  • reusable — can I use the same data for multiple applications?
  • repurposable — can I use the data for purposes not conceived of initially?
  • adaptable — can I produce different variants of the data applicable to different users?

My BibleTech 2015 talk touched on a number of these.

I should note that it’s entirely possible to have works that are proprietary from a licensing point of view but completely open technically. I may be able to purchase a database that I can’t redistribute but which is in a clean, consistent format I can write software to process. It has the disadvantage that I can’t make corrections available to others or redistribute derivative works, but it’s better than a closed-license work that’s also closed with regard to facets discussed in this post.