Goodbye OpenCalais, Thanks and Stay in Touch


OpenCalais community members:

I’ll be leaving Thomson Reuters and hence ending my involvement with OpenCalais in the next few weeks. I wanted to take a moment to thank everyone – from users to journalists to convention organizers – for making OpenCalais the success it’s been.

The initial years of OpenCalais were among the most amazing in my career. The level of passion, the number of smart people I met and the number of interesting ideas I heard was amazing. While OpenCalais was never a full time job for me – it was certainly where the majority of my passion lay.

Please stay in touch at or via LinkedIn ( I have no idea where I’ll be heading next – but I’ll keep LinkedIn up to date. Recently I’ve been running all Product and Engineering for Reuters Media – I’m looking for a similar level of challenge.

OpenCalais is being left in good hands. We’ve transferred ownership of the initiative to Philip Kardos –  one of my most senior product managers – and I am absolutely confident in the continued stability and growth of the system. And – our fantastic and responsive community manager Fran Sansalone’s role remains unchanged.

Thanks again. It’s been a wild ride and I’ve loved every minute of it.


Posted in Uncategorized | Leave a comment

Spring Cleaning and Some Touch-ups (OpenCalais)

A New OpenCalais Release On the Way

In just about one month we’re going to open up the next release of OpenCalais for beta testing. While the upgrade should be 100% backwards compatible – it’s worth setting aside a little time for testing as well as exploring some new features.

What’s Coming?

Under the covers there are a number of improvements to our processing pipeline. As an end user you won’t see these – but they set the stage for greater flexibility in the future.

On the user-facing side of the equation you’ll see a number of new entities, facts and events related primarily to politics and intra and international conflict. Doesn’t look like either of those will be going away soon – so we thought they were worth implementing. You’ll see new information in candidates, Party Affiliations, Arms Purchases and a number of others.

In addition to these new items, we’ve also enhanced our SocialTags feature for greater accuracy – in fact, you’ll see a number of accuracy improvements across the board.

Next Steps

So – set aside a little time for a quick test in about a month. If you care about elections and conflict – take a look at the new features. We’ll run the beta for approximately a month to gather any issues and will roll out into production following that.

Posted in Calais | Tagged , , | Leave a comment

“News Ninjas” and Cryptic Twitter Posts

A few days ago I tweeted that I was building a team of News Ninjas and was looking for candidates with a good mix of news (newspaper or broadcast) and technology capabilities. That was sufficiently cryptic to generate a few questions – now I’ll try to give a few answers.

We want to hire several great people! Now!

I’m in the market to hire (yes, in response to *many* inquiries, with actual real dollars and desks and benefits and all that stuff and in response other inquiries – yes OpenCalais is involved but we’re going way beyond just that) a small team of news passionate people to help us with our mission to transform the Reuters News Agency into the next generation partner for our clients.

For well over 150 years Reuters has been a leading provider of news to the world. Our customers cover the globe and include broadcasters and newspapers from gigantic to mid-sized. We provide them with a full range of media including text, images, video and still photography. That’s our core and our heritage – and we’ll be working hard over the next few years to expand and improve on it.

But – there are more opportunities to deliver value to our clients. The news industry is changing dramatically – and we believe we have the opportunity to offer an increased range of services and products to help make our clients successful. In some cases these new capabilities will benefit just our clients – in others we hope to leverage our investments to benefit the news industry as a whole. We know some of what we need to get built and we’re engaging with our clients, prospects, advisers and the community as a whole to understand where there might be additional opportunities.

We’re Getting ready to do New & Cool Stuff

Obviously I can’t talk about specifics in a public forum (at least not yet) – but we’re looking at solutions that range from archive monetization to more flexible content syndication to better newsroom workflow capabilities to tools to enable investigative journalism – basically anything that helps improve our customer’s business.

What do we need to get all of that done? The answer is pretty simple: we need great people that understand the business and the technology that we can bring to bear. That’s what I’m looking for. In these positions you’ll be a member of a team inventing and responding to potential business propositions / capabilities, evaluating them in the marketplace, building, delivering and evangelizing them – basically from concept to execution. Our list of requirements is pretty straightforward:

- You need experience in the news industry – broadcast or newspapers – online/digital is a big plus
- You need to have some technical background. It’s important our team members bridge the business / technology gulf themselves
- You have a diversity of experience. You’ve played different roles in different projects.
- You can work as a member of a team. Really. I mean it.
- You’re comfortable with and have a track record in public-facing evangelism of your ideas.
- You’re (probably) already located in NYC
- You know how to get things done – by managing, leading and pitching in and working yourself

That’s it.

If you’d like to be considered (and I promise to carefully read every CV submitted) please go here and drop an application. This is one place where we need to follow the process – I’m happy to answer questions via email address below – but applications need to come through the machine.  If you don’t think you are a candidate yourself but would like to pass along a name please drop me a note at Please feel to drop questions there as well.

Big changes are coming. Come make it happen with us.


Posted in Uncategorized | Tagged , , , , | Leave a comment

Why OpenCalais?

(Re-purposing a post of mine from

Over the last few months you’ve probably seen a number of announcements about how OpenCalais has been chosen by one organization or another to support its business.

In a number of recent meetings I’ve been asked the (very fair) question, Why OpenCalais and not one of the other entity extraction services out there?

Given that the question seems to be coming up more often as the number of extraction services increases, I thought I’d get my best understanding of why many major players we’ve announced (and an equal number we haven’t) have chosen to go with OpenCalais. And – at the end – I’ll mention a few reasons why others haven’t chosen OpenCalais.

So, in no particular order, why do organizations choose Calais?

Thomson Reuters

OpenCalais is provided by Thomson Reuters – the largest professional information organization in the world.

If you’re interested in kicking around some semantic technologies in your spare time this doesn’t really matter. If you’re incorporating those technologies deep within your business – or, as is true with many users – actually building a new business on top of them, this becomes pretty important. Basically – you need to know that the service is going to be there for you.

Facts & Events

With the increase in structured content assets like Wikipedia / DBpedia, it’s become pretty easy to knock out a basic entity extraction tool. And – while we like entity extraction as much as anyone else – it’s really just the tiniest starting point in what you can and will need to do.

OpenCalais extracts a wide range of facts and events from unstructured content and lets you know what’s happening in your content – not just tags for things.

  • Facts are things like “John Doe is CEO of XYZ Corporation.”
  • Events are things like “XYZ Corporation today announced that it would acquire ACME Corporation.”

OpenCalais is the only service that does this in a production-strength manner.


OpenCalais stays up. It’s hosted in mirrored data centers thousands of miles apart from each another. It’s monitored 7*24. It basically doesn’t go down – even during system upgrades and maintenance. We stopped adding 9s after we got beyond 99.99% uptime.


We’ve been building the tools underneath OpenCalais for over a decade. They’ve been used by hundreds of organizations and many many thousands of end users. One of the things we’ve learned is that accuracy matters. While no NLP system is perfect, we’re convinced ours is the best and we have some ideas in the pipeline to increase accuracy even more.


We basically focus on providing great semantic plumbing. But we know that not everyone wants to be a plumber. We’ve worked to integrate (or motivate others to integrate) OpenCalais with a wide range of tools including Drupal, WordPress, WordPress Multiuser, Oracle, Lucene, Coldfusion, Flash, Firefox, Prolog, Lisp, Django, Java, PHP, Python, Alfresco, Perl, .NET, Ruby, TopBraid and a few others.

From content management systems to language-specific libraries – there are lots of ways to get started quickly.

Linked Data

We’re serious about Linked Data. We’re also worried about the proliferation of incorrect links and the effects of link rot. So, rather than just pointing to Linked Data assets out on the cloud and risking that they’ll go stale, we host our own Linked Data cloud, which is kept up to date with both Thomson Reuters contributed content as well as regularly validated links to other sources such as DBpedia, Freebase and others.


Pure semantic extraction is great – but sometimes you need more. If you’re writing about Porsches and Ferraris you’d probably like to have categorization concepts like “sports cars” and “automobiles” returned to you with your semantic metadata. OpenCalais does this via our ever-improving SocialTags concept tagging capability. It’s good now, and it’s going to get a lot better soon.


OpenCalais is here to provide great semantic plumbing. We’re not trying to sell ads. We’re not trying to provide the prettiest decorations for blogs. We build the plumbing – you architect the solutions.

Now, in a spirit of transparency, here’s why some people don’t choose OpenCalais:


We’re great in English and okay in French and Spanish (we extract entities but neither facts nor events in these two languages). We intend to implement more languages in the future – but for the time being we’re concentrating our efforts on improved functionality and accuracy in English.


OpenCalais isn’t a simple tagging tool. What it returns to the calling application is a reasonably complex RDF construct. It takes a little time to get up to speed on RDF and how to use it in your applications. We think it’s worth it because it’s the most flexible and powerful format we know of.

Performance in Knowledge Domain ‘x’

Where ‘x’ is fashion or square dancing or rugby. OpenCalais is optimized for performance in the general world of business – that’s where we excel.

We have extended OpenCalais to take steps in other areas (such as sports, media, etc.) – but if you need deep semantic extraction capabilities related to protein binding – there are better places to look.

Posted in Calais | Tagged , , , , | Leave a comment

Life in the Linked Data Cloud: Calais Release 4

(Re purposed from the blog post on

The Gist: Release 4 of Calais will be a big deal. In that release we’ll go beyond the ability to extract semantic data from your content. We will link that extracted semantic data to datasets from dozens of other information sources, from Wikipedia to Freebase to the CIA World Fact Book. In short – instead of being limited to the contents of the document you’re processing, you’ll be able to develop solutions that leverage a large and rapidly growing information asset: the Linked Data Cloud.

The goal of this post is just to give our community a heads-up to start thinking and planning.

During the course of 2008 we’ve had three significant releases of Calais, with additional point releases nearly each month along the way. We’ve added new knowledge domains, improved performance, delivered integration with a range of tools and developed new user-facing applications. It’s been a year of amazing growth in our developer community and the capabilities of the Calais service.

While every previous release has accomplished something significant, Release 4 is going to introduce something that we think is game changing – and that’s life in the Linked Data cloud. It’s important enough that we want to give all the members of our community time to think about it, prepare for it and get your brains in gear on how you might use it.

Every release of Calais up to this point has focused on meeting the need to extract semantic information from text. Release 4 builds on this by creating the ability to harvest the Linked Data cloud using that semantic data.

For this all to make sense we need to introduce a few things. If you already know about de-referenceable URIs and the Linked Data cloud – skim ahead. If not – please take a moment to ingest the background you need.

When you send text to Calais it returns several things: entities, facts, events and categories. For purposes of today’s discussion we’re going to focus in on entities. Entities are just what they sound like – they are things. Some specific examples are people, companies, organizations, geographies, sports teams and music albums.

When Calais extracts an entity from your text it returns (at least) a few things. It tells you the name of the entity and it tells you what type of entity it is. Unlike other extraction services we don’t just return a list of things – Calais tells you it found a thing of type=Company and a value=IBM or type=Person and value=Jane Doe. But – there’s something else Calais returns that hasn’t meant very much up until now: it returns a Uniform Resource Identifier (URI) for that entity. There’s nothing magic about URIs – they are simply a unique identifier for every entity that Calais discovers. Here’s an example (it’s not pretty) of what the URI for the Company IBM looks like:

Well, that doesn’t look very useful does it? If you were to pull up that URI (when Release 4 is out) all you’d see is RDF with links to places called DBpedia and Freebase and Reuters. But keep those links in mind: they’re the key to a whole new world.

Linked Data is the name of a movement underway (not too surprisingly, initiated by Sir Tim Berners-Lee) that sets a standard and expected behavior for publishing and connecting data on the web. This isn’t about publishing web pages – this is about turning those web pages into data that’s accessible to programs to work with. We’ll give you a quick example to make it real: Wikipedia is one of the single largest sets of information across a broad range of topics in the world. It’s really great if I’m a person who’s casually looking for information on a particular topic – but it’s not so great if I’m a computer program that wants to use that data. Why? Because it’s formatted and organized for people – not computers – to read.

But Wikipedia has a twin – in fact a Linked Data twin – called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format called RDF and accessible via the Linked Data standards. And, Wikipedia is not alone. A growing cloud of information sets from DBpedia to the CIA World Fact Book to U.S. Census data to Musicbrainz – and many others – is becoming available. What’s important is that this cloud is 1) growing, and 2) interoperable. There are “pointers” from entries in DBpedia to entries in Musicbrainz and back to entries in Geonames – it’s another big Web – but this time it’s a Web of Data.

So – lots of words and arcane concepts. Let’s try to bring it all together into something that makes sense. We’ll put one sentence out there – and then we’ll give a few examples.

Beginning with Calais Release 4 you and the programs you develop will be able to go from many of the entities Calais extracts directly to the Linked Data Cloud.

A simple example:

I want to process today’s business news. For each article I want to extract all of the companies mentioned – but only if the article also mentions a merger or acquisition. I am only interested in companies whose headquarters (or those of their subsidiaries) are located in New York State. Do all of that and give me a widget for my news site titled “Merger Activity for NY Consulting Companies”. And oh, by the way, this isn’t a research project – I want you to do it real time for the 10,000 pieces of news I process every day.

How would you do that? Option 1 is to hire a bunch of researchers, give them a fast internet connection and teach them to type very very fast.  Option 2 is to write some code that looks like this:

For each Article

Submit to Calais, get response

If MergerAcquisition exists then

For each Company

Retrieve Calais Company URI, extract DBpedia link

Send Linked Data inquiry to DBpedia, get response

If CompanyIndustry contains “Consulting”

If CompanyHeadquarters = “New York”

Put them on the list

For each subsidiary

Send Linked Data query to Dbpedia, get result

If CompanyHeadquarters = “New York”

Put them on the list

(lots of endif’s)

Print the list

That really is a pretty straightforward example. How about companies in the news with at least one subsidiary doing business in an area that the CIA Factbook considers dangerous? Or books released by authors who attended Harvard who live in Ohio? Or … . We think you get the idea.

So. The summary. The combination of semantic data extraction (generic extraction, tags, keywords won’t do the trick) + de-referenceable URIs (entity identifiers you and your programs can retrieve) + the Linked Data Cloud = amazing stuff.

We’d like you to start thinking about it.

Posted in Uncategorized | Tagged , , , | Leave a comment

Developers! Developers! Developers!

One of the really fun parts of working on the Calais Initiative is our community of developers. They toil in quiet and then – surprise! – they release something really cool and interesting. So – I wanted to take just a moment to highlight two new Calais R3.1 applications that popped up this weekend.

iPlayerist by Geography

iPlayerlist is an interesting application that takes shows available via the BBC iPlayer and allows you to find them by topics, times and other attributes. Andy @ has just rolled out an enhancement that uses the new Calais geo-location capabilities to find shows based on the locations mentioned in their descriptions. Available here I think it’s a great example of a simple, clean way to improve the user experience using semantic metadata extraction. Unfortunately viewing many of the resulting videos won’t work unless you’re in the U.K. This isn’t iPlayer’s fault – it’s a limitation the BBC has put in place.






Calais Geo Location Tutorial and Demo App

Guilhem Vellut has put together a nice demonstration app that shows the Calais geo-location features in action. While I really like the application (you can see it here) it’s the blog post he wrote giving the details of exactly how he built the applications – including code samples – that’s really great. By investing the time to document what he did and how he got everything working together he’s provided a great jumpstart for anyone else wanting to experiment with Calais geo-location. Thanks!

Posted in Calais | Tagged , | Leave a comment

What is Web 3.0?

After participating in yet another “What is Web 3.0″ panel I decided to strip my answer down to Twitterable size. Here it is:

Web 2.0 created a problem – overwhelming content overload. Web 3.0′s job is solve that problem. That’s it.

Maybe later on I’ll write a few thousand more words around the details. But that’s what they are: details. Figure out how to decrease content overload in publishing, in user generated content, in social networks and in search. Stop worrying about the killer app. Just make things better.

Posted in Uncategorized | Leave a comment

Greg Boutin @ Semantics Incorporated

Greg Boutin

Greg Boutin

Greg Boutin wrote a fairly in-depth piece on SemanticProxy. In this article Greg reviews SemanticProxy’s performance and asks a number of questions about whether it’s truly “Semantic”. So – second in a series of cheating by republishing responses I’ve written… here we go.

Greg’s original article is located here


I thought I had responded to this post – but it appears it was one of those many responses I’ve composed in my head while driving or whatever and never actually gotten down in writing.

First, a couple of things that may need clarification.

SemanticProxy is Calais. What SemanticProxy does is to take the burden of fetching a web page, cleaning HTML, calling Calais and all that off the developer. It does all of that for you and returns the results as RDF – or as HTML for demonstration purposes. So – any functionality in Calais is automatically reflected in SemanticProxy. The main technical challenge with SemanticProxy other than engineering for scalability is simply HTML cleaning. One thing we’re thinking of is the creation of a simple tag publishers can embed to indicate the start/stop of the “core” content on a page.

The second area is around the engine underlying Calais. In your post you mention that you assume it’s a statistical engine – it isn’t. The Calais engine is built on core Natural Language processing (NLP) technology augmented by lexicons and statistical methods. It works by parsing out the parts of speech into core elements and then applying a three-tiered set of pattern recognition and rule-based approaches wrapping up with a voting and scoring system that selects from the candidate entities, facts and events. The rules and pattern recognition techniques are tuned to identify specific types of entities (people, places, organizations, etc), facts (Person:JobPosition, Person:PoliticalAffiliation, etc) and events (NaturalDisaster, SportingGame, EarningsAnnouncement, etc). The specific elements that Calais understands are documented on our site and expand by 5-15 each month.

Calais also supports “Semi-Exhaustive Extraction” (SEE) for those that want to dive into the deep end of the semantics pool. In SEE we extract all relationships between Thing1 and Thing2 if we can type at least one of the things.

Entity recognition will always be a “IS A” type predicate. “John Doe” “IS A” “Entity Type Person” – so all of our entity recognition will automatically fall into this category.

Facts and events are a little more complicated. For example let’s take something simple like Calais extracting that a person has a particular job title at a particular company. I’m not going to even attempt to write out the RDF – but the basics of that type of relationship would look like:

“John Doe” “IS A” “Person”
“John Doe” “Has the Title” “Chief Wrangler” “AFFILIATED WITH” “ACME”
“ACME” “IS A” “Company”

That’s not even close to RDF – but you get the idea.

So – are we using “Smart” predicates – I think so. Everything we identify (other than simple entity recognition – which is the easy part) is represented in RDF as a series of relationships and attributes. Every fact we identify is, in essence, it’s own smart predicate. Every event is built of of facts and entities.

What we don’t do is deliver any level of analysis beyond what’s presented to us. We don’t dip into the global linked data brain or Dbpedia or other assets to find and deliver more information about what we’ve extracted. If we tell you someone is a “Person” – we don’t tell you that people are mammals. As far as I’m concerned – that’s where linked data and large scale “describe the world” ontologies come in.

So – in summary. Entity recognition (the relatively easy part of what we do) is always about “IS A” type relationships. The harder (and cooler in the long run) stuff is much more sophisticated.

Also – one (well two) exceptions to the “we don’t augment with external data” statement above. In our current technology preview release we’ve rolled out disambiguation around companies and geographies. What this means is that if an article says IBM, IBM Research, IBM Limited or IBL Labs – we’ll tell you it’s really “IBM” and give you the appropriate identifying information (Ticker, web site, etc). We do this using a BIG table – but we also go beyond that and look for contextual clues like industries and geographies that will help us narrow things down.

Geographies are similar – “Longhorns” are more likely the be associated with Paris, TX than with Paris, France.

Long response – but I felt a few of these things were worth clarifying. We’re really enjoying the widespread adoption of Calais (almost 1.5M transactions per day and climbing) – but at this point most of the use cases are barely scratching the surface of what Calais provides. Once people have gotten over the current focus on entity recognition (tag clouds anyone?) we hope they’ll step back and explore some of the more powerful semantic capabilities Calais has to offer.



Posted in Calais | Leave a comment

Mark Gould @ Brand 3.0

Mark Gould

Mark Gould

Mark Gould wrote a nice overview of Calais and here Because this was an introduction to Calais for a new audience oriented toward brand and marketing – I though it was worthwhile to respond with a basic overview of what Calais is about and why we’re doing it. Given that the response ended up being fairly lengthy – I though I’d share it here as well. Some general thoughts on the Semantic Web vs. The Semantic Stack, barriers to adoption, getting to critical mass and reality vs. philosophy.

First, thanks for taking note of Calais. We’re still deep in the learning curve and the more that different people with different needs think about it, try it out and give us feedback the better.

If you’re just starting to look into this area – a word of warning. It’s very important to distinguish between the vision of the Semantic Web and the stack – the defined set of standards – that will enable the Semantic Web. In my view the Semantic Web is an aspiration comprised of 1) use of the semantic stack and 2) a critical mass of adoption across the web. While we’re seeing many instances of adoption of the technologies – we have a long ways to go before we reach critical mass.

So – how do we move toward critical mass? What Calais is trying to do is address what we see as the central rate-limiting factor for adoption: the generation of high quality semantic metadata for unstructured content such as news, reports, novels – whatever. While the standards are well defined for how to represent this metadata we’re still left with one simple issue: it takes time and it costs money. Given that the “semantic consumer” end of the story is still relatively undeveloped, few writers and publishers can afford to invest that time and money.

Calais doesn’t solve this problem – but it does throw some fuel on the fire. By automating the generation of semantic metadata with a very high degree of accuracy we hope to jumpstart the adoption curve. If there’s lots of semantic content out there people will build great semantically enabled applications. If there are great applications people will invest in semantically enabled content.

The best way to take it for an initial spin is with the Calais viewer application at . Copy a news article or such, paste it in and see how we do. In general you’ll see better results with the viewer than with because the proxy has additional work to do such as cleaning HTML pages. This work can create noise that reduces accuracy.

One last point. You don’t have to believe in or even agree with all of the philosophy around the Semantic Web to take advantage of it. There are a well-defined set of standards from RDF to SPARQL and capabilities such as Calais that can add value to what you’re doing today. Grab a piece of that stack and make something cool happen.


Posted in Calais | Leave a comment

Semantic Search Means ….?

We’re in the year of the Semantic Web. Or maybe it’s the year when the semantic stack starts to add value to real users experiences. Or maybe it’s the year before the year when ….

We’ve all been to the conferences, we’ve all had the meetings, whether we’re builders or consumers – it’s clear that something is in the air around this topic.

We’re also impatient. The Semantic Web (stack, apps, whatever) has been right around the corner for a little while now. That impatience is causing us to spend an inordinate amount of time casting around for the application that’s going to prove the naysayers wrong, change the game, change the world.

And because we’re humans, tool users and pattern matchers – we end up landing at an answer that feels safe, that we know works, that people understand, that’s generated a bunch of billions of dollars: Search. And then we tie a bow on it so it feels new and ….. we have Semantic Search.

Let’s put aside the whole issue of whether semantic search is the killer app for the moment.  I personally think it may be one of the functions that see dramatic improvement through semantic technologies – but it doesn’t feel, today, like the application that’s going to knock our socks off.

I’d also like to take off the table the applicability of semantic search to tightly constrained, well defined, rigidly controlled knowledge domains. We all know it can do some great stuff when applied to questions about gene expression in the nasal epithelial cells of the South African Tree Frog under ultraviolet stimulation – but I think it might be a little more interesting to concentrate on searches that the other 99% of the bell curve care about.

Part of the problem may be that we’re using the term Semantic Search. I have no idea what it means. When I’m talking with someone about it we have no shared understanding. I absolutely cannot explain it to non semantageeks. So, let’s deconstruct semantic search into it’s constituent components and talk a bit about how and whether semantic technologies might actually make it better.  The results of the dissection are here on the table….

  1. What kinds of questions can we ask? Can we embed logic in our questions? Do we expect inference in our results?
  2. How can we ask them – keywords, natural language and all that jazz.
  3. Generating the “right” result set for the query.
  4. Displaying the result set in the most effective manner
  5. Making money from doing all that

So – my challenge to myself is to write a brief (well, maybe not too brief) post about each of these subtopics and talk about how semantics can – or cannot – make it better. Until we get down to this level of granularity “semantic search” is just a catchphrase without, well … semantics.

Posted in Uncategorized | Tagged | Leave a comment