Greg Boutin wrote a fairly in-depth piece on SemanticProxy. In this article Greg reviews SemanticProxy’s performance and asks a number of questions about whether it’s truly “Semantic”. So – second in a series of cheating by republishing responses I’ve written… here we go.
Greg’s original article is located here
I thought I had responded to this post – but it appears it was one of those many responses I’ve composed in my head while driving or whatever and never actually gotten down in writing.
First, a couple of things that may need clarification.
SemanticProxy is Calais. What SemanticProxy does is to take the burden of fetching a web page, cleaning HTML, calling Calais and all that off the developer. It does all of that for you and returns the results as RDF – or as HTML for demonstration purposes. So – any functionality in Calais is automatically reflected in SemanticProxy. The main technical challenge with SemanticProxy other than engineering for scalability is simply HTML cleaning. One thing we’re thinking of is the creation of a simple tag publishers can embed to indicate the start/stop of the “core” content on a page.
The second area is around the engine underlying Calais. In your post you mention that you assume it’s a statistical engine – it isn’t. The Calais engine is built on core Natural Language processing (NLP) technology augmented by lexicons and statistical methods. It works by parsing out the parts of speech into core elements and then applying a three-tiered set of pattern recognition and rule-based approaches wrapping up with a voting and scoring system that selects from the candidate entities, facts and events. The rules and pattern recognition techniques are tuned to identify specific types of entities (people, places, organizations, etc), facts (Person:JobPosition, Person:PoliticalAffiliation, etc) and events (NaturalDisaster, SportingGame, EarningsAnnouncement, etc). The specific elements that Calais understands are documented on our site and expand by 5-15 each month.
Calais also supports “Semi-Exhaustive Extraction” (SEE) for those that want to dive into the deep end of the semantics pool. In SEE we extract all relationships between Thing1 and Thing2 if we can type at least one of the things.
Entity recognition will always be a “IS A” type predicate. “John Doe” “IS A” “Entity Type Person” – so all of our entity recognition will automatically fall into this category.
Facts and events are a little more complicated. For example let’s take something simple like Calais extracting that a person has a particular job title at a particular company. I’m not going to even attempt to write out the RDF – but the basics of that type of relationship would look like:
“John Doe” “IS A” “Person”
“John Doe” “Has the Title” “Chief Wrangler” “AFFILIATED WITH” “ACME”
“ACME” “IS A” “Company”
That’s not even close to RDF – but you get the idea.
So – are we using “Smart” predicates – I think so. Everything we identify (other than simple entity recognition – which is the easy part) is represented in RDF as a series of relationships and attributes. Every fact we identify is, in essence, it’s own smart predicate. Every event is built of of facts and entities.
What we don’t do is deliver any level of analysis beyond what’s presented to us. We don’t dip into the global linked data brain or Dbpedia or other assets to find and deliver more information about what we’ve extracted. If we tell you someone is a “Person” – we don’t tell you that people are mammals. As far as I’m concerned – that’s where linked data and large scale “describe the world” ontologies come in.
So – in summary. Entity recognition (the relatively easy part of what we do) is always about “IS A” type relationships. The harder (and cooler in the long run) stuff is much more sophisticated.
Also – one (well two) exceptions to the “we don’t augment with external data” statement above. In our current technology preview release we’ve rolled out disambiguation around companies and geographies. What this means is that if an article says IBM, IBM Research, IBM Limited or IBL Labs – we’ll tell you it’s really “IBM” and give you the appropriate identifying information (Ticker, web site, etc). We do this using a BIG table – but we also go beyond that and look for contextual clues like industries and geographies that will help us narrow things down.
Geographies are similar – “Longhorns” are more likely the be associated with Paris, TX than with Paris, France.
Long response – but I felt a few of these things were worth clarifying. We’re really enjoying the widespread adoption of Calais (almost 1.5M transactions per day and climbing) – but at this point most of the use cases are barely scratching the surface of what Calais provides. Once people have gotten over the current focus on entity recognition (tag clouds anyone?) we hope they’ll step back and explore some of the more powerful semantic capabilities Calais has to offer.