In my previous post, I talked about the legal / licensing aspects of open linguistic data but there are technical aspects in order for linguistic data to be open too.
To illustrate, consider an out-of-copyright, printed lexicon. From a licensing point of view, it’s open—it can be redistributed with or without modifications, etc. But that doesn’t make it particularly usable for computational work.
A while ago I came across something Greg Crane had written where he talked about things being machine-actionable. I like this a lot more than “machine-readable” because it isn’t just about being able to “read” the work, it but to actually do interesting things with it.
There are various facets of this so I thought I’d try to enumerate some of them.
- correctable — can I make corrections if I find mistakes?
- verifiable — can I write code to check for errors?
- reproducible — can I reproduce the results others have found?
- extensible — can I extend it with my own data or data from other sources?
- queryable — can I search, filter, or sort the data to get subsets of interest?
- reusable — can I use the same data for multiple applications?
- repurposable — can I use the data for purposes not conceived of initially?
- adaptable — can I produce different variants of the data applicable to different users?
My BibleTech 2015 talk touched on a number of these.
I should note that it’s entirely possible to have works that are proprietary from a licensing point of view but completely open technically. I may be able to purchase a database that I can’t redistribute but which is in a clean, consistent format I can write software to process. It has the disadvantage that I can’t make corrections available to others or redistribute derivative works, but it’s better than a closed-license work that’s also closed with regard to facets discussed in this post.