window office RSS

sporadic ramblings of a comp sci grad student studying information retrieval Me @ CMU

Archive

Apr
23rd
Fri
permalink

Yahoo LETOR Challenge upload format confusion

A few of us here at CMU have been playing with the Yahoo LETOR challenge data and uploaded a couple submissions.  Our performance on the hold-out set has been fairly dismal — barely outperforming a random ranking of documents.  Just today we realized that our script to format the submission was to blame.

The upload format is a text file, one line per query.  Each line contains a space-delimited list of numbers, corresponding to the predicted ranks of the documents in the input file.  Their instructions give the example:

 

suppose that a query has 4 documents and that your predicted relevance scores for these documents are:

  • 3.8
  • 1.7
  • 4.9
  • 2.5

Then the corresponding line in your submission should be:

  • 2 4 1 3

As it turns out, this list of numbers in the example is not only the predicted ranks of the documents, but also the documents’ line numbers in ascending score order.  These two things are not always the same.  For example, reversing the scores of the first and second documents, giving [1.7, 3.8, 4.9, 2.5] yields the predicted ranks [4, 2, 1, 3] but the document ID order [1, 4, 2, 3].  Our formatting code produced the line number order, not the predicted ranks.

I’ve posted my python code to format a submission given the score predictions of some ranker.

blog comments powered by Disqus