October 2011
1 post
Announcing: Ancestry.com Online Forum Test... →
The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active...
August 2010
3 posts
Thomson Reuters Releases TRC2 News Corpus Through... →
So you want to study IR?
I occasionally get questions from aspiring IR students asking for advice on getting started as an IR researcher.* Here’s an attempt at some pointers for foundation material and resources:
Reading
There’s a lot of background material that any IR researcher should be familiar with:
A good textbook to get your head around the fundamentals of the field is essential. There’s a...
July 2010
4 posts
Djoerd Hiemstra on Keith van Rijsbergen's... →
Click through download the classic 1976 Information Retrieval book in .epub format.
New Microsoft Learning to Rank Datasets →
(via JeffD)
Top authors in Information Retrieval - Microsoft... →
I’m 142 (tied with many others). What’s your rank?
Google's Amit Singhal tells us about the dreams... →
June 2010
3 posts
statAP can be > 1.0
Its probably no surprise to anyone who’s paid attention to the work on statAP sampling, but it is somewhat disconcerting that AP estimates produced in this way can be greater than 1.0.
For example, consider a document which is sampled with probability 0.1 and is found to be relevant. A system ranking this document at position 1, which should get a precision @ 1 value of 1.0, gets 10.0...
Smarter Than You Think - I.B.M.'s Supercomputer... →
IBM’s QA efforts on Jeopardy.
Probably Irrelevant » Query logs and information... →
Latest post on Probably Irrelevant by Fernando Diaz
May 2010
6 posts
Given that we have gathered the equivalent of less than 6 seconds of Google...
– Community Query Log Project Results
FXPAL Blog » Blog Archive » Impossible to find →
Google, Yahoo, others work to make search engines... →
Some quotes from Jamie Callan & Liz Liddy in this Washington Post article.
TREC-BLOG - 2010 Guidelines →
Probably Irrelevant » Vintage Cornell/SMART Tech... →
Home : ClueWeb09 Wiki →
If you use the ClueWeb data, check out the wiki: PageRank scores, the Web Graph, Spam scores.
April 2010
8 posts
TREC 2010 Web Track Guidelines →
Web track guidelies have been posted. Interesting changes this year: spam filtering task, and using ERR as the primary evaluation metric.
Microsoft Web N-gram Services Now in Public Beta... →
MIREX: MapReduce Information Retrieval Experiments →
Djoerd Hiemstra has also posted a MapReduce library for IR experiments.
Anchor text for ClueWeb09 Category A →
Djoerd Hiemstra’s group has posted anchor text resources for the ClueWeb dataset.
Yahoo LETOR Challenge upload format confusion
A few of us here at CMU have been playing with the Yahoo LETOR challenge data and uploaded a couple submissions. Our performance on the hold-out set has been fairly dismal — barely outperforming a random ranking of documents. Just today we realized that our script to format the submission was to blame.
The upload format is a text file, one line per query. Each line contains a...
SIGIR2010 Accepted Papers List →
JavaItertools →
I’ve been working on open-sourcing some of my code. This first release is a library for decorating, composing & manipulating Java Iterators that I’ve developed over the past couple years at CMU. Nothing particularly research-y in here, but this code might be useful to others. See the examples for a few use cases.
Its a work in progress, and not an official 1.0 (or 0.1) release...
Title: Graduate Student Grooming Habits: Automatic Annotation
Date: Thursday,...
– Talk announcement in my inbox today.
March 2010
6 posts
SIGIR Meta | Google Groups →
Google group hosting ongoing discussion on the SIGIR review process. There’s already loads of excellent contributions from many seasoned researchers who’ve been heavily involved in SIGIR’s organization for years.
SIGSIGIR « IREvalEtAl →
Will Webber nails it.
Not Relevant →
The birth of a new IR publication venue, primarily targeted at disgruntled rejects. Brilliance or folly?
Java & the ClueWeb09 Webgraph
This post has been sitting in my drafts queue since before TREC 2009 submissions were due. Although I’m not currently working with this data, someone somewhere might be interested in various problems dealing with the ClueWeb link graph in Java.
We’ve been wrestling with the Web graph for the ClueWeb09 dataset the past couple weeks, in hopes of using this data in our TREC Relevance...
Relevance assessment with MTurk & statAP
As many of you know, I’ve been building a test collection with Amazon Mechanical Turk. This involves these deceptively simple sounding steps:
Simulate a diverse set of systems and run queries
Sample retrieved documents
Collect relevance assessment on the sampled documents.
Although it sounds straightforward, many many questions have come up over the course of building this collection:...
February 2010
4 posts
Yahoo! Learning to Rank Challenge →
(via Jeff D)
Take the TREC Survey →
Homage papers
Aardvark recently announced the acceptance of a paper describing their system at WWW: “The Anatomy of a Large-Scale Social Search Engine”. This is a clear (and somewhat brash) shout-out to the original Google paper, “The Anatomy of a Large-scale Hypertextual Web Search Engine”.
This got me thinking — are there other notable examples of an academic paper paying...
Google Buzz vs. Twitter & why Buzz might be a huge...
I’m digging Google Buzz. Here’s why:
1) Threaded discussion. The #1 thing I dislike about Twitter is the lack of structure inherent in the system to support threaded discussion. Sure, you have @-replies and re-tweets, but they’re really an afterthought. And they take up precious space in your tweet, which brings me to…
2) No character limits. The #2 thing I dislike...
January 2010
1 post
CFP [ACM SIGIR 2010]
→
SIGIR 2010 CFP is up.
December 2009
3 posts
FXPAL Blog » Blog Archive » The mystery of the... →
A seriously funny review of the Nook’s interface design.
Massively Collaborative Mathematics in the NYTimes... →
you probably need to scroll down the page to see it, but there’s a fascinating blurb on collaborative theorem proving in this year’s “year in ideas”
WSDM 2010 - Accepted Papers - Third ACM... →
November 2009
4 posts
Le Zhao's research tricks →
One of my colleagues at CMU has posted quite a few nice tips & tricks for conducting CS & IR research.
Yisong Yue on Self-Improving Systems that Learn... →
(via hunch.net)
another good InfoVis at xkcd →
October 2009
9 posts
M45 Enables Web-Scale Information Extraction... →
A post by a couple of my CMU colleagues on the Yahoo! Developer Network blog.
CFP ACM SIGIR 2010
→
Jan. 22 paper deadline.
The On-Line Encyclopedia of Integer Sequences →
A clever retrieval system over some very interesting data.
Binary marble adding machine →
Got the wrong Bob? →
New GMail Labs feature, looks very similar to a paper written by a good friend and CMU grad.
(via William Cohen)
United States Gross National Happiness on Facebook →
Large scale sentiment analysis. (via FlowingData)
They label the spikes in happiness, but I really wish they’d labeled the dips, too.