window office RSS

sporadic ramblings of a comp sci grad student studying information retrieval Me @ CMU

Archive

Mar
10th
Wed
permalink

Relevance assessment with MTurk & statAP

As many of you know, I’ve been building a test collection with Amazon Mechanical Turk.  This involves these deceptively simple sounding steps:

  1. Simulate a diverse set of systems and run queries
  2. Sample retrieved documents
  3. Collect relevance assessment on the sampled documents.

Although it sounds straightforward, many many questions have come up over the course of building this collection: How many systems?  How do we know they’re diverse enough?  How many queries?  How many documents to judge per query?  Which sampling method to use (MTC, statAP or traditional pooling)?  How to control for assessor quality in the task design?  How to identify poor-quality assessment after collection?  How will a test collection built like this generalize to unseen systems?

We’ve based a lot of our decisions on how to build this test collection on our own intuition and conversations with various people who’ve been involved in test collection creation at TREC (thanks Ian and Virgil!).  Here’s a summary of some of the semi-informed decisions we’ve made.

  • We simulated systems with Indri and Terrier, with several different retrieval models in each.  These models range from strong baseline retrieval models, to  exact-match, to pseudo-relevance feedback to oddball document expansion methods.  We simulated 19 different systems in total, retrieving to depth 100, resulting in about 500 unique documents per query to sample.  My hunch is that the number of systems is a bit high, and it would have been nice to use at least one other underlying retrieval engine (eg. Lucene).
  • We use NEU’s statAP sampling method (v4).  We chose this over MTC primarily because a batch evaluation fit into the AMT model better than an active document selection.  An active document selection would have required us to host our own HITs — not necessarily a big deal, but is one more level of complexity to deal with.
  • We had 60 documents per query annotated at $0.65 per query.  This was about the most documents we could get loaded into a single HIT and have assessed.  Its also a good level for distinguishing systems, according to those involved in recent TREC evaluations.  This payment level is higher than average for a single hit, but our HITs are a bit more time consuming. The payment comes out to about $6.40 / hour.
  • We collected graded relevance levels (0=bad … 3=perfect) with the intention of using something like NDCG for evaluation.  I’ll probably be collapsing into binary relevance (0-1 vs. 2-3) for my evaluations because the statAP tools don’t compute variance of evaluation measures other than MAP.
  • We had one assessor evaluate all the documents for one query, and didn’t collect multiple assessment per query.  There’s generally a high level of disagreement across assessors (about 30% on relevant documents at TREC, and likely more with MTurk assessors).  I don’t know how to resolve disagreements across relevance assessors, especially when there is ambiguity in the query.
  • We didn’t do anything special in the HIT design to control for assessment quality.  We probably should have done something like planting obviously non-relevant documents.  I’ve got some ideas on how to evaluate assessor quality without gold standard judgements, but that will have to be another post.
  • We didn’t attempt to limit the number of queries one assessor could annotate, but we should have.  We ended up having one assessor annotating about 25% of the queries.  Unfortunately, AMT doesn’t provide any way to automatically control this.  You either need to host your own HITs and keep track of it yourself or manually block Turkers if they’ve done too many queries.

You can see one of the HITs here:

I’m still evaluating the quality of the collection, both with regard to assessor quality and to the generalizability to retrieval systems that weren’t used in the original pool.  I’m planning on writing posts on both of those as I do more analysis.  My guess is that I’ll probably throw out about 10% of the queries because of poor assessor quality.  I also see a clear correlation between collecting assessments on more queries and reducing variance of the statMAP estimate on systems not used in the pool.

A couple notes on existing work along these lines:

A recent paper at WSDM looks at the reliability of evaluation metrics on systems not used for the original pool of documents to be judged.

I should also note that I’m definitely not the first person doing this with AMT — see, for example, Omar Alonso’s paper in SIGIR forum.  That paper and others of his give some good guidelines on using MTurk for relevance assessment.  But, his work specifically seeks out multiple assessment per document and multiple assessors per query.   This is something that is typically avoided in document relevance assessment, and I’ve tried to avoid here.

blog comments powered by Disqus