I occasionally get questions from aspiring IR students asking for advice on getting started as an IR researcher.* Here’s an attempt at some pointers for foundation material and resources:
There’s a lot of background material that any IR researcher should be familiar with:
- A good textbook to get your head around the fundamentals of the field is essential. There’s a long history of finely tuned mathematical models of information retrieval, which still strongly influence most modern IR research. Understand these models and become familiar with the issues they address. Recently at CMU, the graduate-level introductory IR classes have used Introduction to Information Retrieval by Manning, Raghavan & Schütze, which I highly recommend. But, there are other good books out there, also: Search Engines: Information Retrieval in Practice by Croft, Metzler & Strohman is geared more towards the undergraduate audience. When I took the IR class at CMU, we used Moden Information Retrieval by Baeza-Yates and Ribeiro-Nero (which despite the name may be a little dated now). I still refer back to Managing Gigabytes by Whitten, Moffat & Bell when digging into the guts of an indexing problem.
- Read some classic IR papers. There’s some gems listed in the 2005 SIGIR Forum article “Recommended reading for IR research students”. Lots of these are somewhat dated (eg. the IR world has pretty much moved beyond LSI), but some are still very heavily cited (eg. the PageRank paper, Lavrenko’s relevance models paper).
- A solid foundation in machine learning is becoming increasingly important. Tom Mitchell’s classic book is great. I’m also a big fan of Andrew Moore’s tutorial slides. (There’s clearly a CMU bias here — I was fortunate enough to take a machine learning class taught by both Tom and Andrew.)
- Pay attention to what’s hot. Follow the top conferences (SIGIR, CIKM, ECIR, WSDM, WWW, ASIST) and journals (Information Retrieval, TOIS, JASIST) read all the papers that get best paper awards, and attend the conferences if you can. (I know I’m missing some great conferences in this list.)
- IR as a field has always strongly valued solid evaluation methodologies. The Text REtrieval Conference, run by NIST, has provided many researchers with invaluable datasets and a forum for testing retrieval algorithms. Familiarize yourself with the tasks, datasets and tools used at TREC.
- Don’t forget that IR is not just a CS research topic — it has its origins in Library Science. There is a strong (and IMO under-appreciated) Information & Library Science IR research community. IR isn’t just about the mathematics behind retrieval models — we need to understand the searchers and the user interfaces.
- Check out Video Lectures’s archive of IR talks. Not really reading, but you can see what kind of research is presented at IR conferences, as well as some IR tutorials by well-known IR researchers.
Effective presentation of your ideas is essential in any field. As a reviewer, I’ve seen a lot of papers that haven’t been anywhere close to publication quality. As a conference attendee, I’ve heard a lot of terrible talks. This advice applies to any technical field, not just IR.
- If English is not your first language, find (befriend, hire) a native speaker who can edit your work and/or tutor you. For better or worse, you need to write fluently in English to get published.
- Writing well takes practice. Try to write some each week. Trevor Strohman recommends writing some every day, along with giving a lot of other good general research tips. Find other researchers to swap paper drafts with to act as a reviewer. Read Writing for Computer Science by Zobel (via @ssn).
- Giving good presentations takes practice. If you can give a good conference presentation, people will remember you and your work. Find some advice on good presentation skills and follow it. I like this guy’s advice. I also like to keep words and equations to a minimum on my slides. But, everyone has their own presentation style. Most importantly, practice your talks. And please, don’t just read your slides.
You aren’t going to get far in IR research without getting your hands dirty. This is an applied field and its rare for someone to get a PhD without actually creating some software; dealing with some large, messy datasets; or *shudder* users!
- Learn how to perform a retrieval experiment, and do it. You’ll need a document collection, a set of queries, and a set of relevance judgements. Terrier’s “Quick Start” guide gives a good overview of the process.
- As mentioned above, TREC is the premiere forum for IR evaluation. Participate in a TREC track if possible, and go to the conference. TREC is an excellent venue for hands-on experimentation and is very low-risk — you can’t get rejected, and you’ll have a (non-refereed) publication on your CV.
- You don’t need to write a search engine from scratch, although this can be a great learning exercise. There are quite a few very good open source research search engines. See Jeff Dalton’s reasonably up-to-date list. Many of these have been used over the years at TREC and support TREC-formatted document collections.
- Understanding the searcher is still an wide-open area of IR research. Perform a user study to explore how people formulate queries, interpret results, etc.
- There are many unsolved real-world IR problems all around us. Although we see a lot of publications from large web search companies, web search isn’t the only search problem. Find a tractable problem and attempt a solution. For example, build a search engine for your department’s publications; build a better search interface for Wikipedia; tackle people-search in Twitter.
Hope this helps answer some of the new IR researcher’s questions. And, seasoned researchers, please leave a comment if you spot any omissions.
* I’m probably not the best person to ask this question, but I do get a request every couple of months. Why they ask me is a mystery — maybe all grad students at top US universities get these requests?
Its probably no surprise to anyone who’s paid attention to the work on statAP sampling, but it is somewhat disconcerting that AP estimates produced in this way can be greater than 1.0.
For example, consider a document which is sampled with probability 0.1 and is found to be relevant. A system ranking this document at position 1, which should get a precision @ 1 value of 1.0, gets 10.0 instead. See the StatAP paper for details on estimating precision at cutoffs. In this example, we can compute an exact P@1 value since the only document of interest has actually been judged.
I’m a little uneasy basing my analysis on AP estimates that don’t really resemble what I’m used to seeing as AP values. In some cases, particularly for “easy” queries with lots of relevant documents, I’m commonly seeing statAP estimates well over 5. Is statAP broken?