I got back from Singapore a few days ago, and have been working on a couple posts. Here’s the first, on the learning to rank sessions & workshop
In LETOR 1 session:
Seems like a reasonable framework for studying what is becoming the “holy grail” in LETOR for the web. Their framework seems to be a useful way to think about loss functions & bounds in while learning to rank. Unfortunately from the talk, it wasn’t clear if they had a novel approach for learning — they focused on a different loss function that seemed to be just a re-scaling of the RankSVM-style hinge-loss. But, from a quick glance at the paper, it appears there’s more to the algorithm than that.
Query Dependent Ranking Using K-Nearest Neighbor by Geng et. al.
A clever idea, and in my opinion a good step towards understanding query-level adaptability in LETOR. The authors use KNN in the “query space” to find related queries to a given test query. They then use a ranking model trained from those neighbor queries to rank documents for the given query.
Of course, this method is extremely sensitive to the feature space (and distance measure). The authors used some features that had been previously applied to query classification and don’t really go beyond that. I would certainly be nice to see some deeper analysis of the effect of different features on this task — seems like this is a ripe area for research.
Their method is simple, and nice for this reason. But, performance gains are smaller than I would’ve hoped — the third decimal point in NDCG. This level of flexibility in the ranking model has a huge potential, and this seems like really just a first step. Other areas that I’d like to see explored are ways to combine previously learned models (eg. smoothing a specific model with a general one), and unsupervised approaches to adjusting feature weights. I’m also curious to what degree the more complex nonlinear ranking models (RankNet, LambdaRank) are doing this implicitly. With query features in their input, these models could be adapting per query & doing it more intelligently than this simple 2-stage (hard) KNN approach.
(One pet peeve in both of these papers — there’s loads of bar charts showing performance, but very few real numbers. Please show the data! Charts are OK to show general trends, but a full page of charts without the data behind them is somewhat annoying.)
I unfortunately missed the LETOR-2 session, and looks like there were some great papers there. Two of the papers (Duh & Kirchhoff and Veloso et. al.) address exactly my comments on query time adaptability. The Duh paper looks particularly interesteing, and I haven’t had time to give it the attention it needs.
LR4IR workshop keynote from Stephen Robertson
Stephen Robertson, predictably, was terrific at the learning to rank workshop. His keynote focused on two “misconceptions” in current rank learning research. The first is that directly optimizing some target metric (like NDCG or MAP) on the training set is not always what you want to do. He implied that it may be better to somehow model the underlying “ranking distribution” from the training set and then make predictions that maximize your target metric on your model rather than on the training set. Very similar reasoning to the generative vs. discriminative learning debate. He gave some compelling (toy) examples of situations where predicting the statistic that minimizes the loss function on training data doesn’t yield the most accurate predictions on unseen data when you know the form of the underlying data distribution.
The second misconception is that many metrics like NDCG@k are not informative enough to train your model parameters against. A similar argument has been echoed in Taylor et. al. at WSDM2008, their point being that you’re throwing away too much information by strictly optimizing on the top ranks when learning. Recent work by Yilmaz and Aslam also sheds some light in this direction.
Although somewhat vague, both of these points echo some suspicions I’ve had when working in this area. Again and again, it seems that the selecting the model that maximizes a target metric on the training set doesn’t seem to generalize to the test set. I’ve blamed this on the small-ish datasets I’ve been using, but I do believe there’s more to it.
I’ve posted my slides from SIGIR on slideshare. Enjoy!presentation (tags: sigir2008)