context:forge

improving the signal to noise ratio. information in context. web as knowledge.

context:forge RSS Feed
 
 
 
 

Greg Boutin @ Semantics Incorporated

Greg Boutin

Greg Boutin

Greg Boutin wrote a fairly in-depth piece on SemanticProxy. In this article Greg reviews SemanticProxy’s performance and asks a number of questions about whether it’s truly “Semantic”. So – second in a series of cheating by republishing responses I’ve written… here we go.

Greg’s original article is located here

Greg:

I thought I had responded to this post – but it appears it was one of those many responses I’ve composed in my head while driving or whatever and never actually gotten down in writing.

First, a couple of things that may need clarification.

SemanticProxy is Calais. What SemanticProxy does is to take the burden of fetching a web page, cleaning HTML, calling Calais and all that off the developer. It does all of that for you and returns the results as RDF – or as HTML for demonstration purposes. So – any functionality in Calais is automatically reflected in SemanticProxy. The main technical challenge with SemanticProxy other than engineering for scalability is simply HTML cleaning. One thing we’re thinking of is the creation of a simple tag publishers can embed to indicate the start/stop of the “core” content on a page.

The second area is around the engine underlying Calais. In your post you mention that you assume it’s a statistical engine – it isn’t. The Calais engine is built on core Natural Language processing (NLP) technology augmented by lexicons and statistical methods. It works by parsing out the parts of speech into core elements and then applying a three-tiered set of pattern recognition and rule-based approaches wrapping up with a voting and scoring system that selects from the candidate entities, facts and events. The rules and pattern recognition techniques are tuned to identify specific types of entities (people, places, organizations, etc), facts (Person:JobPosition, Person:PoliticalAffiliation, etc) and events (NaturalDisaster, SportingGame, EarningsAnnouncement, etc). The specific elements that Calais understands are documented on our site and expand by 5-15 each month.

Calais also supports “Semi-Exhaustive Extraction” (SEE) for those that want to dive into the deep end of the semantics pool. In SEE we extract all relationships between Thing1 and Thing2 if we can type at least one of the things.

Entity recognition will always be a “IS A” type predicate. “John Doe” “IS A” “Entity Type Person” – so all of our entity recognition will automatically fall into this category.

Facts and events are a little more complicated. For example let’s take something simple like Calais extracting that a person has a particular job title at a particular company. I’m not going to even attempt to write out the RDF – but the basics of that type of relationship would look like:

“John Doe” “IS A” “Person”
“John Doe” “Has the Title” “Chief Wrangler” “AFFILIATED WITH” “ACME”
“ACME” “IS A” “Company”

That’s not even close to RDF – but you get the idea.

So – are we using “Smart” predicates – I think so. Everything we identify (other than simple entity recognition – which is the easy part) is represented in RDF as a series of relationships and attributes. Every fact we identify is, in essence, it’s own smart predicate. Every event is built of of facts and entities.

What we don’t do is deliver any level of analysis beyond what’s presented to us. We don’t dip into the global linked data brain or Dbpedia or other assets to find and deliver more information about what we’ve extracted. If we tell you someone is a “Person” – we don’t tell you that people are mammals. As far as I’m concerned – that’s where linked data and large scale “describe the world” ontologies come in.

So – in summary. Entity recognition (the relatively easy part of what we do) is always about “IS A” type relationships. The harder (and cooler in the long run) stuff is much more sophisticated.

Also – one (well two) exceptions to the “we don’t augment with external data” statement above. In our current technology preview release we’ve rolled out disambiguation around companies and geographies. What this means is that if an article says IBM, IBM Research, IBM Limited or IBL Labs – we’ll tell you it’s really “IBM” and give you the appropriate identifying information (Ticker, web site, etc). We do this using a BIG table – but we also go beyond that and look for contextual clues like industries and geographies that will help us narrow things down.

Geographies are similar – “Longhorns” are more likely the be associated with Paris, TX than with Paris, France.

Long response – but I felt a few of these things were worth clarifying. We’re really enjoying the widespread adoption of Calais (almost 1.5M transactions per day and climbing) – but at this point most of the use cases are barely scratching the surface of what Calais provides. Once people have gotten over the current focus on entity recognition (tag clouds anyone?) we hope they’ll step back and explore some of the more powerful semantic capabilities Calais has to offer.

Regards,

Tom

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Mark Gould @ Brand 3.0

Mark Gould

Mark Gould

Mark Gould wrote a nice overview of Calais and SemanticProxy.com here http://bit.ly/gL1Aq. Because this was an introduction to Calais for a new audience oriented toward brand and marketing – I though it was worthwhile to respond with a basic overview of what Calais is about and why we’re doing it. Given that the response ended up being fairly lengthy – I though I’d share it here as well. Some general thoughts on the Semantic Web vs. The Semantic Stack, barriers to adoption, getting to critical mass and reality vs. philosophy.

First, thanks for taking note of Calais. We’re still deep in the learning curve and the more that different people with different needs think about it, try it out and give us feedback the better.

If you’re just starting to look into this area – a word of warning. It’s very important to distinguish between the vision of the Semantic Web and the stack – the defined set of standards – that will enable the Semantic Web. In my view the Semantic Web is an aspiration comprised of 1) use of the semantic stack and 2) a critical mass of adoption across the web. While we’re seeing many instances of adoption of the technologies – we have a long ways to go before we reach critical mass.

So – how do we move toward critical mass? What Calais is trying to do is address what we see as the central rate-limiting factor for adoption: the generation of high quality semantic metadata for unstructured content such as news, reports, novels – whatever. While the standards are well defined for how to represent this metadata we’re still left with one simple issue: it takes time and it costs money. Given that the “semantic consumer” end of the story is still relatively undeveloped, few writers and publishers can afford to invest that time and money.

Calais doesn’t solve this problem – but it does throw some fuel on the fire. By automating the generation of semantic metadata with a very high degree of accuracy we hope to jumpstart the adoption curve. If there’s lots of semantic content out there people will build great semantically enabled applications. If there are great applications people will invest in semantically enabled content.

The best way to take it for an initial spin is with the Calais viewer application at http://bit.ly/4DXfKw . Copy a news article or such, paste it in and see how we do. In general you’ll see better results with the viewer than with SemanticProxy.com because the proxy has additional work to do such as cleaning HTML pages. This work can create noise that reduces accuracy.

One last point. You don’t have to believe in or even agree with all of the philosophy around the Semantic Web to take advantage of it. There are a well-defined set of standards from RDF to SPARQL and capabilities such as Calais that can add value to what you’re doing today. Grab a piece of that stack and make something cool happen.

Regards,

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Semantic Search Means ….?

We’re in the year of the Semantic Web. Or maybe it’s the year when the semantic stack starts to add value to real users experiences. Or maybe it’s the year before the year when ….

We’ve all been to the conferences, we’ve all had the meetings, whether we’re builders or consumers – it’s clear that something is in the air around this topic.

We’re also impatient. The Semantic Web (stack, apps, whatever) has been right around the corner for a little while now. That impatience is causing us to spend an inordinate amount of time casting around for the application that’s going to prove the naysayers wrong, change the game, change the world.

And because we’re humans, tool users and pattern matchers – we end up landing at an answer that feels safe, that we know works, that people understand, that’s generated a bunch of billions of dollars: Search. And then we tie a bow on it so it feels new and ….. we have Semantic Search.

Let’s put aside the whole issue of whether semantic search is the killer app for the moment.  I personally think it may be one of the functions that see dramatic improvement through semantic technologies – but it doesn’t feel, today, like the application that’s going to knock our socks off.

I’d also like to take off the table the applicability of semantic search to tightly constrained, well defined, rigidly controlled knowledge domains. We all know it can do some great stuff when applied to questions about gene expression in the nasal epithelial cells of the South African Tree Frog under ultraviolet stimulation – but I think it might be a little more interesting to concentrate on searches that the other 99% of the bell curve care about.

Part of the problem may be that we’re using the term Semantic Search. I have no idea what it means. When I’m talking with someone about it we have no shared understanding. I absolutely cannot explain it to non semantageeks. So, let’s deconstruct semantic search into it’s constituent components and talk a bit about how and whether semantic technologies might actually make it better.  The results of the dissection are here on the table….

  1. What kinds of questions can we ask? Can we embed logic in our questions? Do we expect inference in our results?
  2. How can we ask them – keywords, natural language and all that jazz.
  3. Generating the “right” result set for the query.
  4. Displaying the result set in the most effective manner
  5. Making money from doing all that

So – my challenge to myself is to write a brief (well, maybe not too brief) post about each of these subtopics and talk about how semantics can – or cannot – make it better. Until we get down to this level of granularity “semantic search” is just a catchphrase without, well … semantics.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Bookmarks for August 30th through August 31st

Links for August 30th through August 31st:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Bookmarks for August 27th through August 29th

Links for August 27th through August 29th:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Calais Ecosystem: Gnosis for Firefox & IE

One in a series of posts on cool tools that have been built using the Calais service from Thomson Reuters. I promise a big post on what Calais is, what it does, why we’re doing it and all that jazz in the near future. In the meantime feel free to visit the site (above) or my really quick Calais overview in my last post on Drupal.

If you’re like me you spend a lot of your time on the web reading the news, reviews and blog postings. It’s great – but sometimes I wish I had my own research assistant to highlight the important stuff and do a little research for me. If I’m reading about a person or place or company I’m interested in I find myself doing a lot of copying, going to Google or Wikipedia, pasting, searching, finding the tab I was originally on, finding my place in the article, etc, etc. And I’m mostly just reading because I’m interested -  researchers, bloggers and journalists spend many hours at a stretch doing this.

Gnosis isn’t quite as good as your own personal research assistant – but it’s a step in the right direction.

Built as a plugin for both Firefox and IE, Gnosis sits in the background and analyzes what you’re reading. Using the Calais web service it finds the people, companies, organizations, locations and quite a few other things in the text and marks them with a fairly subtle underline.

When you hover over one of those items Gnosis pops up a smart and contextually relevant information box that lets you search for companies in places that know about companies, people in places that know about people, locations in things that know about locations. You get the idea.

You can do this on demand when you’re reading something – or you can update the Gnosis preferences and tell it to do it automatically on specific sites. I’ve set mine for automatic tagging on most of the major news sites, a few blogs and Wikipedia. A small warning – Gnosis sometimes breaks on Ajax heavy sites like the Google RSS reader. We’re working on that.

Speaking of Wikipedia – Gnosis is a great tool for use there. While the individuals creating Wikipedia articles try to do a good job hyperlinking items in the article to other relevant Wikipedia articles – they often miss the boat. Many of the items in the article that should be hyperlinked are not – forcing you once again into a cycle of cut, paste, search, etc. Gnosis solves that by automatically hyperlinking relevant items and allowing you to navigate directly to the appropriate Wikipedia page.

If you want a quick snapshot of all of the people, places, things, etc mentioned in an article then just open the Gnosis sidebar. It will give you a quick overview of everything it has found and allow you to navigate directly to the things you’re interested in.

That’s the description: here’s what’s cool. Gnosis let’s you apply the power of high end natural language processing and semantic analysis in a simple way to an everyday task – reading on the web. You don’t need to understand RDF triples or the semantic stack – it just helps you get something done. And – the current version of Gnosis is just the start. Future releases will draw on the expanded capabilities of Calais to tell you what the most relevant items are in what you’re reading and to link those items to the growing linked data ecosystem. Stay tuned.

The Gnosis homepage is locate here.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Sitting right in the middle

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Bookmarks for August 26th from 19:05 to 20:51

Links for August 26th from 19:05 to 20:51:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Bookmarks for August 24th through August 25th

Links for August 24th through August 25th:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati

Calais Ecosystem: Calais for Drupal

Time to start talking about great tools that have been built on top of Calais.

Calais is an initiative by Thomson Reuters to provide one of the core building blogs of the Semantic Web: semantic metadata generation. At the core of Calais is a web service that ingests text content, analyzes using natural language processing, machine learning, lexicons and statistical analysis to extract semantic data from the text and return it as structured information – primarily as RDF. Enough about Calais – I’ll write a big long post about it in the near future.

One of our biggest goals with Calais is to develop – or help others develop – tools that translate this from geekdom to real world usability. One of the areas of focus for that is to integrate Calais within a variety of content presentation and management platforms. There’s a wide range of those platforms – but Drupal stands out as being one of the fastest growing ones in the mid-tier publishing space.

Shortly after Calais was released two members of the Phase2Technology team – Frank Febbraro and Irakli Nadareishvili just stepped up and made it happen by building the Calais Modules for Drupal.

These modules provide a strong building block for construction semantically-enabled Calais applications. The modules provide seamless integration between a range of Drupal node types and the Calais service.

From their description…

The Calais module lets you configure which Content Types you want to request Calais metadata on update. The entities returned can then be automatically assigned to vocabularies related to the Content Types, or it can only suggest terms based on the Calais metadata and allow the user to select the terms you want to associate (think of del.icio.us recommending tags). A flexible set of hooks allows 3rd party modules to make modifications before or after Calais terms have been applied. There are many level of configuration and integration and this is just the beginning.

The Calais Tag Modifier module allows for basic blacklisting of tags, so that you never get terms suggested that you don’t care about. The term substitution mechanism also allows you to modify returned metadata before it gets assigned or suggested.

Beyond what Phase2 has developed to date, the Calais Initiative and Phase2 have agreed to work together over the coming six months to release a series of significant enhancements built on the Calais modules. These enhancements will be oriented toward even tighter integration of Calais with Drupal and providing a comprehensive Calais-powered set of capabilities such as topic hubs and other publisher-oriented features.

So – that’s the description: here’s what’s cool. One of the hottest publishing platforms in the world is integrated with Calais. Users can get access to Calais’ capabilities with essentially zero effort. And – all of this was built buy two highly motivated guys that saw a need and just moved in and got it done.

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ma.gnolia
  • SphereIt
  • Technorati