Why a need for uniform data?
a) The web is currently converging around web applications and mobile devices, a lot of focus is being placed on sensor networks, internet of things, and augmented reality to display information. Simply, how can these applications make use of published data readily from multiple sources if that data is not in a uniform standard?
b) The core web which people use on a daily basis is ever more silo focussed, and the size of those silo's is ever increasing - the social sector is a great example of this, and whilst there are core movements to create a more federated and distributed social web, a key blockage in the way is a lack of uniform data, often new formats are being developed, or poorly modelled application (rather than domain) specific models are making it out on to the web, and interoperability is several times harder than it could be, given the presence of uniform data. This has significant social and economic repercussions.
c) Time, a significant amount of time is invested daily by thousands (if not millions) in to re-solving the same old problems, creating a schema for this, a model for that, learning the same lessons countless people have learned before them, often the learning curve spans several years. A standard way to publish and share reusable model specific schemas (/not/ format specific like XML schema and JSON schema) would save vast amounts of developer time per annum. In addition to having significant economic impacts this would also lend to far more innovation (since more time free to innovate!) within an already important and innovative sector.
Why not "plain" RDF?
RDF has failed to be understood, adopted or loved by the general masses of the web, even many who use RDF often do not fully understand it and have many issues. Adoption has been... let's just say not good.
There are 3196 APIs on ProgrammableWeb, out of those:
- 2152 produce XML
- 1255 produce JSON
- 36 produce RDF
Perhaps more indicative though, is that those 36 are spread over 6 years, with only 1 updated so far this year, meanwhile there have been 58 new JSON based APIs in the last month alone.
Over on stack overflow, there have been 1,569,512 questions asked, 273 that's 0.017% of them, are RDF related.
The numbers are pretty clear, for all RDF's merits, and the countless benefits of the uniformity of RDF, it's just not being adopted.
To use RDF correctly requires RDF tooling, and not just tooling to parse the data (like JSON, and common usage of XML), but to use the data, to handle triples and graphs and queries, all of which requires significant investment in skills, time, and deployable technologies.
Further more, RDF data published using multiple different ontologies is difficult for people to use, the infrastructure and tooling simply doesn't exist to follow ones nose around the web and make practical use of several thousand different ontologies, that level of understanding is a good generation away, and for now all it does is serve as a blockage to adoption, and primarily as a blockage to people actually using or presenting the data. Time and time again we have seen a rallying around core ontologies, with successful mixing and matching happening more at the ontology level, than the data level. For now applications will be looking for mentions of Classes and Properties they "understand" (have a hard coded usage for).
Additionally, these difficulties in usage have lead to a second layer of centralization on the web, one which was borne from RDF, and rather ironically many of the architectural benefits of uniformity and universality are being lost. That is SPARQL, we are seeing a huge increase in SPARQL enabled datastores on the web, each of which holds a specific set of data, and each of which has key resource limitations. Practically this means that:
- clients are tightly coupled to servers
- all processing and storage weight is being handled by the servers
- data on the wire is non uniform
- clients are not using the web of data, rather they are using a datasource on the web, a datasilo.
This is a pattern which is not optimized for anybody, servers, clients, developers, data, the web, the network.
The core benefits of a web of linked data have not realized, RDF has failed to deliver them, primarily due to complexity and tooling requirements. SPARQL (positioned on the server/silo) is only compounding matters. That's not to say it cannot deliver them, or that these technologies are bad, only that they have not delivered the core benefits, yet.
Perhaps another way to put it, is that if you break things like RDBMS and Classes and Objects down you can get to triples of some sort (EAV, RDF, or to atomic relations / predicate based logic), and RDF did just this, however it was done in such a way that the data format (RDF) required a full new stack of technologies to /use/ the data, rather than being a uniform data format acting as a bridge between say classes and objects and RDBMS, a webized data model; that is to say, you can't really use "it" (RDF, the model people don't really speak of) with 95% of the deployed technology out there, you can provide an RDF view of the data from that technology, map it to RDF, but you cannot easily pull it back in and use it, and unusable data, isn't much use. There are many shades of grey between, but it's certainly more at the unusable end of the spectrum.
What can we do?
If we look at what people already do, a large proportion of web developers (most) continue to publish data via web services as XML and JSON, the common process is simple, create a schema, document it somewhere out of band (perhaps call it API documentation), publish data using that schema in some arbitrary way as XML and JSON. On the client side the same process continues, find a new API, get an XML or JSON parser, map the data as described by the API to some classes and start using it. All of this is needless work, they are showing us what works, what they can do, and how they can work with data easily. Tersely, they are missing the benefits of Uniform Data.
We can bring the benefits of uniform data to the current web 2.0, class and objects, rdbms, xml and json focussed web.
We can not only address these core issues, and bring the benefits of linked data and the semantic web to the general developer population, but we can also:
- ensure it's RDF and traditional semantic web compatible (giving "us" mountains of useful every-day data)
- provide that clear migration path to the "full" semantic web that's missing now.
- increase semantic web adoption exponentially, bringing big benefits without the high cost.
Approaches
There are two key approaches I can personally see to this:
- Webize Classes and Objects (Java style POJOs, Data Objects, subset of UML)
- Provide a Classes and Objects view over RDF
The first of these approaches - providing an abstract syntax for classes and objects and then defining mappings for that to XML and JSON - would bring the benefits of OWL 2 and XSD to schemas, and the benefits of "linked data" to both the schemas (/class blueprints) and instance data. It would allow data validation rules to be augmented on from sources external to the schema, it could be codified in libraries across multiple languages, it could also serve as a translation layer between Classes and Objects, NoSQL, and RDBMS, and other formats such as CSV. Additionally it would lend each schema openly being mapped to vendor specific databases, as well as vendor neutral schemas such as ANSI SQL. Furthermore, it would also lend to innovation in each layer, for example standardized queries for each kind of data could be created, with translations of those to each specific vendor or to well defined standardized languages, and even codified to work in memory in libraries (for example within instance methods or to run on GPU enabled hardware and languages). Many benefits could come from webizing what the masses already do. Other examples include providing an opportunity to refine the core datatypes on the web in a serialization agnostic way (think xsd types merged with webidl types), ensuring the correct entailments for equality are baked in to the core, providing first level support for things like lists and sets, providing a foundation upon which diff, patch, versioning can all be accomplished, providing canonicalized forms so that encryption and a data signing can be accomplished... and more I'm sure.
The second of these approaches has less wide scale benefits, but would provide a more usable abstraction layer on top of RDF, which is currently (dare I say painfully) missing. This would ultimately make working with data more familiar, a codified example could be:
var person = new Class('foaf:Person'); // external class definitions loaded from the web
person.load('http://example.org/bob#me'); // instance data loaded
print(person.name); // simple access to pre-known properties
person.validate(); // in built validation from OWL 2
// and XSD data type restrictions
// work with a schema class at a time..
var man = new Class('gender:Male'); // different class for different data
man.load('http://example.org/bob#me'); // same data
print(man.wife); // different, domain specific properties
man.expand(); // full entailment regimes support to get
// the most from schema definitions
The best approach will become clear as time progresses, for now I'm keen and happy to work on either or both.
Just some musings..















