window office RSS

sporadic ramblings of a comp sci grad student studying information retrieval
Me @ CMU

Archive

Apr
15th
Tue
permalink

Powerset & Natural Language Search

I had the pleasure of hearing Powerset’s Ronald Kaplan speak this afternoon.  Kaplan is the CTO of Powerset and has an impressive vitae as an NLP researcher.  I’ve had access to Powerset’s private beta (alpha?) for a few months and I was looking forward to hearing what he had to say about the company and where he sees this type of search fitting into the web at large.

Powerset provides “natural language search” currently over Wikipedia documents, but plan to extend this to the web at large sometime “within this year”.  This entails deep natural language processing over the web page text to enable query matching at a semantic level rather than simple keyword+proximity matching.  The goal, as Matt Hurst said, is to provide “information retrieval” rather than “document retrieval”. Dig around Barney Pell’s (Powerset founder) and Tim Converse’s weblogs  for a few posts on why proximity features in IR are a poor approximation to semantics and a good description of natural language search and why it might be the future of web search.

My first impressions with Powerset were mixed.  Their initial demos were interesting, but awkward to use.  I had a hard time formulating queries that Powerset would understand beyond the simplest factoid type questions: “when did X die” or “who wrote Y”.  The private beta demos have been greatly improved over the past few weeks and no longer suffer to such a degree from this awkwardness.   The interface is slick and they’ve integrated a bit of vertical search into their system — a recognition that Wikipedia doesn’t hold the answer for everything, and sometimes hitting a structured database is really all you need.

Powerset is clearly built on a best-of-class natural language processing pipeline (based on PARC’s XLE).   When the system works, its impressive and fast: queries “how many times was Elizabeth Taylor married”, “what is prilosec” and “who played Travis in Taxi Driver”  all produce reasonable results in a second or less.

But, it is still clear that NLP is far from perfection — when mistakes are made, they really pop out.  Kaplan was adamant that Powerset is not a question answering system, primarily because of the factual reliability that implies.  

When comparing Powerset-style search with the dominating keyword search paradigm on the web now, it is hard to know whether NL search can really take hold.  On the surface, it seems that Powerset maybe able to tap into a huge unexploited market — with deep semantic understanding of documents and a natural language query a searcher may be able to more easily satisfy their information needs.  But, I think this is only the tip of the iceberg.  A few of the (non-technical) issues I see in this direction:

  1. Powerset might be underestimating the power of document retrieval.  Passages, snippets or factoids may in fact contain a specific answer to your question, but most of the time what I’m really looking for is the document or web site.  Even when I need a specific fact, the effort in finding that fact in a list of retrieved documents is often pretty low.  By bypassing a document retrieval step, I would miss an enormous amount of background information, context and serendipitous discovery.
  2. Powerset might be underestimating the power of keyword search.  One case for natural language search is that its easier for users to formulate natural language questions or expressions than to construct a keyword query.  Paraphrasing from Kaplan: our thoughts are formed of natural language statements and it requires some amount of cognitive load to convert that to a succinct set of keywords.  Is this really true?  I found it at least as difficult to formulate a “natural language” query on Powerset’s demo than an analogous query on a keyword-based search engine.  Query reformulation with a natural language search engine is also much more tedious than adding/removing keywords or adding quotes around your query — I found that often the sentence must be rewritten, from passive to active voice or made into a question, which is a much harder task than adding a couple keywords.
  3. Powerset might be underestimating the effect of “user training” or the effort of “user retraining”.  We’ve all been trained by Google to provide 2-3 keywords and add/remove for subsequent queries to tailor the results to our liking.  How can this be undone?  Some brief directions are provided by Powerset: “express your query as a question or well-formed phrase”, “use connector words” and “don’t over-specify your query”.  But, this is a huge departure from what we’re all used to.  Can web searchers be taught these new tricks?
I in no way intend to predict the failure of Powerset and I am fully in support of shaking up the status quo.  I’m eager to see how the service will evolve and how their upcoming public release will be received.  But, for me anyway, its hard to see how this type of service will play an ongoing role in my day-to-day web searching.
blog comments powered by Disqus