<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>webr3.org &#187; virtuoso</title>
	<atom:link href="http://webr3.org/blog/category/virtuoso/feed/" rel="self" type="application/rss+xml" />
	<link>http://webr3.org/blog</link>
	<description>brain&#039;s on fire!</description>
	<lastBuildDate>Mon, 30 Aug 2010 00:11:38 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>linked data extractor prototype details</title>
		<link>http://webr3.org/blog/experiments/linked-data-extractor-prototype-details/</link>
		<comments>http://webr3.org/blog/experiments/linked-data-extractor-prototype-details/#comments</comments>
		<pubDate>Tue, 13 Apr 2010 18:53:43 +0000</pubDate>
		<dc:creator>nathan</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[internet]]></category>
		<category><![CDATA[linked data]]></category>
		<category><![CDATA[semantic web]]></category>
		<category><![CDATA[virtuoso]]></category>
		<category><![CDATA[Computing]]></category>
		<category><![CDATA[DBpedia]]></category>
		<category><![CDATA[Education]]></category>
		<category><![CDATA[extractor]]></category>
		<category><![CDATA[Open access]]></category>
		<category><![CDATA[World Wide Web]]></category>

		<guid isPermaLink="false">http://webr3.org/blog/?p=308</guid>
		<description><![CDATA[I recently released a prototype linked data semantic extraction demo which combines OpenCalais, Zemanta and Openlink Virtuoso to effectively categorize and work out what a given peice of text / document is about.
OpenCalais and Zemanta usage details and service comparison.
The demo leverages OpenCalais in order to pick up references to things, which are returned in [...]]]></description>
			<content:encoded><![CDATA[<p>I recently released a <a href="http://extractor.data.fm/?test">prototype linked data semantic extraction</a> demo which combines <a href="http://www.opencalais.com/">OpenCalais</a>, <a href="http://developer.zemanta.com/">Zemanta</a> and <a href="http://virtuoso.openlinksw.com/">Openlink Virtuoso</a> to effectively categorize and work out what a given peice of text / document is about.</p>
<h3>OpenCalais and Zemanta usage details and service comparison.</h3>
<p>The demo leverages OpenCalais in order to pick up references to things, which are returned in most cases as string literals; OpenCalais can also be configured to return back socialtags which give a broad stroke idea of what the document is about, again with string literal "tags". With regards the references (semantic metadata, Entities, Facts, Events etc.) which OpenCalais returns, whilst it is generally string literals, it also returns back vital Type and Relevance information, so in the case of "London" it will also assert that London is a City. Even in the case where it doesn't previously know what a thing is, it can work out that say "Frank Neverbeenheardofbefore" is a Person.</p>
<p>Zemanta is also leveraged, the primary difference between Zemanta and OpenCalais (and thus the need for both services) is that Zemanta focuses more on accurate tagging of text. Primarily though, Zemanta tags (again string literals) are meaningful tags which are commonly known and are referenced to either existing Linked Data identifiers such as http://dbpedia.org/resource/London and further information about the tag (or thing), in the case of the aforementioned London, then it will often also provide links to the wikipedia page for London, the official homepage to the city of London and a link to show the position of London on google maps.</p>
<p>I should point out that ever increasingly OpenCalais also returns back Linked Data too, for instance in the case of London they have given it an HTTP URI which can be dereferenced to retrieve more information about London. At a very crude estimation I would suggest that (depending on the subject matter) OpenCalais returns Linked Data URIs for about 15% of all references it finds to well known "things".</p>
<p>Weighing up the two services I couldn't say that one is better than the other, both have advantages and disadvantages, the only way to get a decent overall picture is to use both. for the benefits of feedback to both of these great services though, here is a general comparison:</p>
<p>note: none of these figures are from exact tests, they are from extensive developer usage of both services as I've used them both since they were made public.</p>
<p>Zemanta is generally 2x as fast for average texts (the size of this post for instance) and as much as 5x as fast for longer texts. Average for Zemanta being 0.7 to 2 seconds. Average for OpenCalais being 1.5 to 10 seconds. It may also be worth noting that the availability of Zemanta is somewhat higher than that of OpenCalais, perhaps 1 in 250 calls to OpenCalais will fail.</p>
<p>OpenCalais does a lot more heavy work than Zemanta though, and *really* semantically analyzes the text to figure out a wealth of information. In this respect the tables are completely turned and Zemanta consitently deals with providing a few high quality known tags; where as OpenCalais often provides at least 10x as much information about a given text, including relevance and type as mentioned before. OpenCalais also extracts Facts / Events, and further it can figure out that "Jim" is also "Jim Bob", and that Jim said X about Y on date D.</p>
<p>Generally you can trust the data from Zemanta 99% as it deals with "known" things, however due to this in some cases very new topics (such as IPad for the first few days after its announcement) remain unknown. Due to the nature of OpenCalais and it's dealing with the unknown you need to take more time to verify what it has returned, however when OpenCalais assigns a LinkedData identifier to something or provides more information you can 99.99% trust that it is entirely accurate.</p>
<p>It's worth noting that both of these services do different things though, and both do it extremely well, Zemanta "tags" and OpenCalais "semantically extracts information", in some respects I was hesitant about comparing the two, as in the context of what I'm doing both are needed and both are equal, however in different contexts both do different jobs and there is a need for people to select one over the other.</p>
<p>Out of all the competition though, I would highly recommend both Zemanta and OpenCalais over their respective competitors, and do hope that neither of these great services ever decide to target each others markets. (e.g. they compliment each other well and both do so well because they stick to what they are good at).</p>
<h3>extractor.data.fm details</h3>
<p>This demo deals primarily with figuring out what a document is about; in that it aims to provide back a list of:</p>
<ul>
<li><strong>Categories</strong><br />A list of 1-5 dbpedia (and therefore wikipedia) categories which the provided document would be categorized under if it were a wikipedia article and had been categorized by a huma who was knowledgeable in the subject domain(s) of the text.</li>
<li><strong>General Topics</strong><br />A short list of the general and broad-strokes Subjects covered by the document, these can are distinct from the primary specific subjects covered and the categories, and in many ways can be seen as the most common intersections between the primary specific subjects discussed.</li>
<li><strong>Primary Subjects</strong><br />These are the specific subjects covered in the document, not just the things mentioned, but the things actively discussed within the document, the primary subject matter as it were.</li>
<li><strong>Related or Mentioned Subjects</strong><br />Whilst I've termed them "related" as in dcterms:related, these are simply things which have been detected in the document or text and which are not primary subjects; in many ways "mentions" may be a more appropriate term.</li>
</ul>
<p>Out of the above list, the two services do the heavy lifting to give the demo it's Primary Subjects and Related Subjects; in short OpenCalais' SocialTags and Zemanta's Tags give us back our Primary Subjects. Whilst OpenCalais by way of the semantic extraction provide us with the Related Subjects, namely all those extracted semantics which have the Type of a real thing (not an IndustryTerm or Event) and which are not all ready a Primary Subject; additionally those extracted semantics which are not tags but have a relevance higher than a certain score are boosted up to be Primary Subjects too.</p>
<p>A primary and initial function of the demo is to associate the tags returned by both services together, and figure out when each is talking about the same thing; this is covered first by dealing with the linked data they return; where both services are talking about the same thing you simply know this unambiguously due to the nature of http URIs and them both being the "sameAs" each other. After this two chunks of unhandled data remain, Zemanta tags which have not determined to be the sameas OpenCalais ones; and OpenCalais semantics which we have a string literal name for and a type.</p>
<p>In step <a href="http://virtuoso.openlinksw.com/">Openlink Virtuoso</a> 6.1 (<a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSIndex">open source edition</a>!) with most of dbpedia 3.4 loaded in to do the heavy lifting from here on; Virtuoso is a really powerful bit of kit and has replaced  mysql/sql server/postgres, rdf store and web dav server in my typical server stack. The public lod and dbpedia endpoints really do no justice as to just how powerful and fast Virtuoso is, queries which take a few seconds on the public endpoint return in hundredths of a second on my local (low spec) server, and the comparative performance to the aforementioned RDBMS solutions is not to be sniffed at.</p>
<p>To handle the typed string literals from OpenCalais, I built a custom dbpedia lookup service (using sparql over the aforementioned Virtuoso + dbpedia setup) which tries to unambiguously determine the identifier for a string literal, if it is known; the results are pretty good and I'd safely say that it gets it right in 98% of cases. This essentially turns the remaining unknown string literals in to known Linked Data URIs, and as a side benefit gives the correct full Name for the thing identified along with the correct casing and obviously much more linked data.</p>
<p>Remaining now the demo has a few OpenCalais semantics which are still unknown, but we know the Type and have a name for the thing; and as URIs are given to things that can be Named, I simply mint my own uri's for these and specify the OpenCalais identifier as a "seeAlso" (to be future compatible with a time where they do associate there own hash uris through to linked data).</p>
<p>At this point the demo has all of the Primary Subjects and Related Subjects determined and where possible linked through to additional LinkedData and human readable web documents about the subjects.</p>
<h4>Categorization</h4>
<p>This is where the script comes in to it's own and really leverages virtuoso, up till this point it's all been about cleaning, validating, looking up, associating and suchlike.</p>
<p>Given that we now have linked data HTTP URIs for all the subjects we are dealing with, and in all Primary Subject cases we also have dbpedia.org URIs the demo can start to use some of Virtuoso's more powerful features. First point of call is to get the Category intersection of all primary subjects (including the inferred categories!) via a slightly complex transitive sparql query over the dbpedia dataset. From here the demo calculates a set of primary categories which the text is related to, then it finds the general category intersection (again including inferred categories) between the primary categories, and the primary subjects. with the results returned is a wealth of numerical information which the demo dually considers and can then infer which are the General Subjects and the Categories for the text.</p>
<p>At some point I'll cover this part of the script in more details and give some virtuoso specific transitive SPARQL queries for you to use in your own such creations, but for now the above will have to do.</p>
<h3>Conclusion</h3>
<p>This extractor demo is something I've been working on and trying to achieve for about 5 years, and whilst it is still early days it's the first time the technologies have been available to both make it possible, and to utilize the results correctly to achieve what I'm aiming for overall.</p>
<p>The overall goal is to create a system which allows users to simply drop in content, and the system "files" it in the correct categories, lists it under the correct subjects and interlinks it with other resource via typed links such as "related resources" and looser resource lists of "also mentioned here", further benefits of such a system are that you can accurately figure out what readers are interested in and promote new content to them, you can give users the option of content streams where they can watch specific subjects or combination of subjects to be notified of their "ideal" reading. On the flip side you can also identify users and contributers interests and expertise, and correlate these together (with geo-location) to suggest others users who they may wish to collaborate with, other organisations doing the same work in the same fields and many such uses. In reality I have much of this implemented in a site I've been working on for the last year, which is just being rolled out again, and the system works extremely well with huge benefits to all involved, the site you see deals with climate adaptation and both provides a service to the general adaptation community where they can share and find knowledge, and more importantly serves organisations working on critical issues by letting them see which people / organisations / projects are doing what, where and allows them to both co-ordinate efforts and perhaps more importantly not duplicate efforts and waste resources where it counts most. This has a positive impact on the worlds poorest nations and those suffering people who these organisations are trying to work with and help.</p>
<p>Back to the demo, and with the context described, the extractor.data.fm demo is a quick UI around an API which is in many ways the backbone of the aforementioned system. The API is used in a semi-automated way, where the data returned by it is verified in a UI by the content author / admins who remove any unambiguous data and then hit save, from there everything is automated again and the system functions as above.</p>
<p>I'm unsure whether this kind of system will ever be able to be fully automated (or whether its wise to allow this) as certain scenarios just can't be covered yet, a real life example of this is an initiative called "TEA", ambiguity at this level, and with entities which are unknown to systems or even the web of data, will always be an issue at some point, as things progress it may be they are only ambiguous once, on their first discovery, but that is still once; hence why this may always have to be a semi-automated process.</p>
]]></content:encoded>
			<wfw:commentRss>http://webr3.org/blog/experiments/linked-data-extractor-prototype-details/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Virtuoso 6, SPARQL + GEO, Sample Queries</title>
		<link>http://webr3.org/blog/linked-data/virtuoso-6-sparqlgeo-and-linked-data/</link>
		<comments>http://webr3.org/blog/linked-data/virtuoso-6-sparqlgeo-and-linked-data/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 23:24:40 +0000</pubDate>
		<dc:creator>nathan</dc:creator>
				<category><![CDATA[linked data]]></category>
		<category><![CDATA[virtuoso]]></category>
		<category><![CDATA[Computing]]></category>
		<category><![CDATA[Edinburgh]]></category>
		<category><![CDATA[Filter]]></category>
		<category><![CDATA[FOAF]]></category>
		<category><![CDATA[Group action]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[London]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[New York City]]></category>
		<category><![CDATA[Oxford]]></category>
		<category><![CDATA[RDBMS]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[RDF Schema]]></category>
		<category><![CDATA[semantic web]]></category>
		<category><![CDATA[SPARQL]]></category>
		<category><![CDATA[text search]]></category>
		<category><![CDATA[United Kingdom]]></category>
		<category><![CDATA[World Wide Web]]></category>
		<category><![CDATA[XML]]></category>
		<category><![CDATA[York]]></category>

		<guid isPermaLink="false">http://webr3.org/blog/?p=183</guid>
		<description><![CDATA[Along side a whole host of improvements, the latest version of Virtuoso (Virtuoso 6) has added support for Geo data! One small sentence, one huge leap for mankind; it's vastly importany IMHO because it brings a new kind of link to Linked Data; a location based one.
Very brief intro: SPARQL is a fantastic query language [...]]]></description>
			<content:encoded><![CDATA[<p>Along side a whole host of improvements, the latest version of Virtuoso (<a href="http://bit.ly/dgbAXS">Virtuoso 6</a>) has added support for Geo data! One small sentence, one huge leap for mankind; it's vastly importany IMHO because it brings a new kind of link to Linked Data; a location based one.</p>
<p>Very brief intro: SPARQL is a fantastic query language which works over RDF and thus Linked Data, Virtuoso amongst other things has a powerful QuadStore which can be queried via SPARQL, and Virtuoso's implementation of SPARQL + the extensive suite of extensions they have implemented makes it the most usable and powerful query langauge available (again, in my honest opinion). In short this combination was enough to make me drop normal RDBMS systems and never look back.</p>
<p>Rather than rambling on about how fantastic it is though; here are some Virtuoso specific sample SPARQL (+GEO) queries, which should hopefully wet your appetite and give you some inclination of what can be done.</p>
<h2>Basic Geo Lookups</h2>
<p><strong>Things within 20km of New York City : <a href="http://bit.ly/9IBiVW" target="_blank">RESULTS</a></strong><br />
<code>  SELECT DISTINCT ?resource ?label ?location<br />
  WHERE<br />
  {<br />
    &lt;http://dbpedia.org/resource/New_York_City> geo:geometry ?sourceGEO .<br />
    ?resource geo:geometry ?location ; rdfs:label ?label .<br />
    FILTER( bif:st_intersects( ?location, ?sourceGEO, 20 ) ) .<br />
    FILTER( lang(?label) = "en" )<br />
  }</code></p>
<p><strong>Distance between New York City and London, England : <a href="http://bit.ly/bYNfWO" target="_blank">RESULTS</a></strong><br />
<code>  SELECT (bif:st_distance(?nyl,?ll)) as ?distanceBetweenNewYorkCityAndLondon<br />
  WHERE<br />
  {<br />
    &lt;http://dbpedia.org/resource/New_York_City> geo:geometry ?nyl .<br />
    &lt;http://dbpedia.org/resource/London> geo:geometry ?ll .<br />
  }<br />
 </code></p>
<h2>Querying Time and Space</h2>
<p><strong>All Educational Institutions within 10km of Oxford, UK; ordered by date of establishment : <a href="http://bit.ly/biZEHA" target="_blank">RESULTS</a></strong><br />
<code>SELECT DISTINCT ?thing as ?uri ?thingLabel as ?name ?date as ?established ?matchGEO as ?location<br />
WHERE<br />
{<br />
&lt;http://dbpedia.org/resource/Oxford&gt; geo:geometry ?sourceGEO .<br />
?resource geo:geometry ?matchGEO .<br />
FILTER( bif:st_intersects( ?matchGEO, ?sourceGEO, 5 ) ) .<br />
?thing ?somelink ?resource ; &lt;http://dbpedia.org/ontology/established&gt; ?date ; rdfs:label ?thingLabel . FILTER( lang(?thingLabel) = "en" )<br />
} ORDER BY asc( ?date )<br />
</code><br />
<strong>Historical cross section of events related to Edinburgh and the surrounding area (within 30km) during the 19th century : <a href="http://bit.ly/dfZU43" target="_blank">RESULTS</a></strong><br />
<code>SELECT DISTINCT ?thing ?thingLabel ?dateMeaningLabel ?date ?matchGEO WHERE {<br />
{<br />
SELECT DISTINCT ?thing ?matchGEO<br />
WHERE<br />
{<br />
&lt;http://dbpedia.org/resource/Edinburgh&gt; geo:geometry ?sourceGEO .<br />
?resource geo:geometry ?matchGEO .<br />
FILTER( bif:st_intersects( ?matchGEO, ?sourceGEO, 30 ) ) .<br />
?thing ?somelink ?resource<br />
}<br />
}<br />
{?property rdf:type owl:DatatypeProperty ; rdfs:range xsd:date } .<br />
?thing ?dateMeaning ?date . FILTER( ?dateMeaning in( ?property ) ) . FILTER( ?date &gt;= xsd:gYear("1800") &amp;&amp; ?date &lt;= xsd:gYear("1900") )<br />
?dateMeaning rdfs:label ?dateMeaningLabel . FILTER( lang(?dateMeaningLabel) = "en" ) .<br />
?thing rdfs:label ?thingLabel . FILTER( lang(?thingLabel) = "en" )<br />
} ORDER BY asc( ?date )</code></p>
<h2>Transitivity and Inference (v5 compatible)</h2>
<p><strong>Finding the shortest route between two "things" (HTML and XML in the example) : <a href="http://bit.ly/cJjsBL" target="_blank">RESULTS</a></strong><br />
<code>SELECT ?route ?jump WHERE<br />
{<br />
 { SELECT ?x ?y WHERE { ?x foaf:page ?xpage ; ?predicate ?y . filter( isURI(?y) ) } }<br />
 OPTION ( TRANSITIVE, T_DISTINCT, T_SHORTEST_ONLY, t_in(?x), t_out(?y), t_max(10), t_step('path_id') as ?path, t_step(?x) as ?route, t_step('step_no') AS ?jump )<br />
 . FILTER ( ?y = &lt;http://dbpedia.org/resource/HTML> &#038;& ?x = &lt;http://dbpedia.org/resource/XML> )<br />
}<br />
</code></p>
<p><strong>..and all routes between the two "things" : <a href="http://bit.ly/cQV4AW" target="_blank">RESULTS</a></strong><br />
<code>SELECT ?route ?path ?jump WHERE<br />
{<br />
 { SELECT ?x ?y WHERE { ?x foaf:page ?xpage ; ?predicate ?y . filter( isURI(?y) ) } }<br />
 OPTION ( TRANSITIVE, T_NO_CYCLES, t_in(?x), t_out(?y), t_max(5), t_step('path_id') as ?path, t_step(?x) as ?route, t_step('step_no') AS ?jump )<br />
 . FILTER ( ?y = &lt;http://dbpedia.org/resource/HTML> &#038;& ?x = &lt;http://dbpedia.org/resource/XML> )<br />
}</code></p>
<p><strong>Traversing Ontologies and (Sub)Classes; all subclasses of Person down the hierarchy  : <a href="http://bit.ly/aZ0oOM">RESULTS</a></strong><br />
<code>SELECT DISTINCT ?x WHERE<br />
{<br />
 { SELECT ?x ?y WHERE { ?x rdfs:subClassOf ?y } }<br />
 OPTION ( TRANSITIVE, T_DISTINCT, t_in(?x), t_out(?y), t_step('path_id') as ?path, t_step(?x) as ?route, t_step('step_no') AS ?jump, T_DIRECTION 2 )<br />
 FILTER ( ?y = &lt;http://dbpedia.org/ontology/Person> )<br />
}</code></p>
<h2>Free text search, scores and IRI Ranks (v5 compatible)</h2>
<p><strong>Searching over labels, with text match scores and additional ranks for each iri / resource  : <a href="http://bit.ly/bMNweO">RESULTS</a></strong><br />
<code>SELECT ?s ?page ?label ?textScore (<LONG::IRI_RANK>(?s)) as ?iriRank WHERE {<br />
  ?s foaf:page ?page ; rdfs:label ?label . FILTER( lang(?label) = "en" ) .<br />
  ?label bif:contains 'adobe and flash' option (score ?textScore ) .<br />
}</code></p>
<p><strong>Virtuoso 6.1 (Open Source Edition) released. For features &#038; bug fix details see: <a href="http://bit.ly/dgbAXS">link</a></strong></p>
<p><img src="http://webr3.org/blog/wp-content/uploads/2010/02/spo.jpg" alt="spo" title="spo" width="600" height="250" class="alignnone size-full wp-image-226" /></p>
]]></content:encoded>
			<wfw:commentRss>http://webr3.org/blog/linked-data/virtuoso-6-sparqlgeo-and-linked-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
