October 2011
1 post
Announcing: Ancestry.com Online Forum Test... →
The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active...
Oct 5th
10 notes
August 2010
3 posts
Aug 5th
11 notes
Thomson Reuters Releases TRC2 News Corpus Through... →
Aug 4th
10 notes
So you want to study IR?
I occasionally get questions from aspiring IR students asking for advice on getting started as an IR researcher.*  Here’s an attempt at some pointers for foundation material and resources: Reading There’s a lot of background material that any IR researcher should be familiar with: A good textbook to get your head around the fundamentals of the field is essential.  There’s a...
Aug 3rd
60 notes
July 2010
4 posts
Djoerd Hiemstra on Keith van Rijsbergen's... →
Click through download the classic 1976 Information Retrieval book in .epub format.
Jul 22nd
4 notes
New Microsoft Learning to Rank Datasets →
(via JeffD)
Jul 22nd
10 notes
Top authors in Information Retrieval - Microsoft... →
I’m 142 (tied with many others).  What’s your rank?
Jul 18th
3 notes
Google's Amit Singhal tells us about the dreams... →
Jul 16th
4 notes
June 2010
3 posts
statAP can be > 1.0
Its probably no surprise to anyone who’s paid attention to the work on statAP sampling, but it is somewhat disconcerting that AP estimates produced in this way can be greater than 1.0.   For example, consider a document which is sampled with probability 0.1 and is found to be relevant.  A system ranking this document at position 1, which should get a precision @ 1 value of 1.0, gets 10.0...
Jun 17th
3 notes
Smarter Than You Think - I.B.M.'s Supercomputer... →
IBM’s QA efforts on Jeopardy.
Jun 16th
3 notes
Probably Irrelevant » Query logs and information... →
Latest post on Probably Irrelevant by Fernando Diaz
Jun 2nd
3 notes
May 2010
6 posts
“Given that we have gathered the equivalent of less than 6 seconds of Google...”
– Community Query Log Project Results
May 26th
3 notes
FXPAL Blog » Blog Archive » Impossible to find →
May 25th
3 notes
Google, Yahoo, others work to make search engines... →
Some quotes from Jamie Callan & Liz Liddy in this Washington Post article.
May 25th
3 notes
TREC-BLOG - 2010 Guidelines →
May 18th
3 notes
Probably Irrelevant » Vintage Cornell/SMART Tech... →
May 5th
3 notes
Home : ClueWeb09 Wiki →
If you use the ClueWeb data, check out the wiki: PageRank scores, the Web Graph, Spam scores.
May 4th
3 notes
April 2010
8 posts
TREC 2010 Web Track Guidelines →
Web track guidelies have been posted.  Interesting changes this year: spam filtering task, and using ERR as  the primary evaluation metric.
Apr 30th
3 notes
Microsoft Web N-gram Services Now in Public Beta... →
Apr 28th
3 notes
MIREX: MapReduce Information Retrieval Experiments →
Djoerd Hiemstra has also posted a MapReduce library for IR experiments.
Apr 28th
3 notes
Anchor text for ClueWeb09 Category A →
Djoerd Hiemstra’s group has posted anchor text resources for the ClueWeb dataset.
Apr 28th
3 notes
Yahoo LETOR Challenge upload format confusion
A few of us here at CMU have been playing with the Yahoo LETOR challenge data and uploaded a couple submissions.  Our performance on the hold-out set has been fairly dismal — barely outperforming a random ranking of documents.  Just today we realized that our script to format the submission was to blame. The upload format is a text file, one line per query.  Each line contains a...
Apr 23rd
3 notes
SIGIR2010 Accepted Papers List →
Apr 13th
3 notes
JavaItertools →
I’ve been working on open-sourcing some of my code.  This first release is a library for decorating, composing & manipulating Java Iterators that I’ve developed over the past couple years at CMU.  Nothing particularly research-y in here, but this code might be useful to others.  See the examples for a few use cases. Its a work in progress, and not an official 1.0 (or 0.1) release...
Apr 9th
3 notes
“Title: Graduate Student Grooming Habits: Automatic Annotation Date: Thursday,...”
– Talk announcement in my inbox today.
Apr 1st
3 notes
March 2010
6 posts
SIGIR Meta | Google Groups →
Google group hosting ongoing discussion on the SIGIR review process.  There’s already loads of excellent contributions from many seasoned researchers who’ve been heavily involved in SIGIR’s organization for years.
Mar 29th
3 notes
SIGSIGIR « IREvalEtAl →
Will Webber nails it.
Mar 27th
3 notes
Not Relevant →
The birth of a new IR publication venue, primarily targeted at disgruntled rejects.  Brilliance or folly?
Mar 26th
3 notes
Java & the ClueWeb09 Webgraph
This post has been sitting in my drafts queue since before TREC 2009 submissions were due.   Although I’m not currently working with this data, someone somewhere might be interested in various problems dealing with the ClueWeb link graph in Java.  We’ve been wrestling with the Web graph for the ClueWeb09 dataset the past couple weeks, in hopes of using this data in our TREC Relevance...
Mar 15th
11 notes
Relevance assessment with MTurk & statAP
As many of you know, I’ve been building a test collection with Amazon Mechanical Turk.  This involves these deceptively simple sounding steps: Simulate a diverse set of systems and run queries Sample retrieved documents Collect relevance assessment on the sampled documents. Although it sounds straightforward, many many questions have come up over the course of building this collection:...
Mar 10th
3 notes
Mar 4th
3 notes
February 2010
4 posts
Yahoo! Learning to Rank Challenge →
(via Jeff D)
Feb 25th
Take the TREC Survey →
Feb 22nd
Homage papers
Aardvark recently announced the acceptance of a paper describing their system at WWW: “The Anatomy of a Large-Scale Social Search Engine”.  This is a clear (and somewhat brash) shout-out to the original Google paper, “The Anatomy of a Large-scale Hypertextual Web Search Engine”. This got me thinking — are there other notable examples of an academic paper paying...
Feb 17th
Google Buzz vs. Twitter & why Buzz might be a huge...
I’m digging Google Buzz.  Here’s why: 1) Threaded discussion.  The #1 thing I dislike about Twitter is the lack of structure inherent in the system to support threaded discussion.  Sure, you have @-replies and re-tweets, but they’re really an afterthought.  And they take up precious space in your tweet, which brings me to… 2) No character limits.  The #2 thing I dislike...
Feb 11th
January 2010
1 post
CFP [ACM SIGIR 2010]  →
SIGIR 2010 CFP is up.
Jan 2nd
December 2009
3 posts
FXPAL Blog » Blog Archive » The mystery of the... →
A seriously funny review of the Nook’s interface design.
Dec 15th
Massively Collaborative Mathematics in the NYTimes... →
you probably need to scroll down the page to see it, but there’s a fascinating blurb on collaborative theorem proving in this year’s “year in ideas”
Dec 14th
WSDM 2010 - Accepted Papers - Third ACM... →
Dec 7th
November 2009
4 posts
Nov 25th
Le Zhao's research tricks →
One of my colleagues at CMU has posted quite a few nice tips & tricks for conducting CS & IR research.
Nov 8th
Yisong Yue on Self-Improving Systems that Learn... →
(via hunch.net)
Nov 7th
another good InfoVis at xkcd →
Nov 2nd
October 2009
9 posts
M45 Enables Web-Scale Information Extraction... →
A post by a couple of my CMU colleagues on the Yahoo! Developer Network blog.
Oct 28th
CFP ACM SIGIR 2010  →
Jan. 22 paper deadline.
Oct 24th
Oct 24th
The On-Line Encyclopedia of Integer Sequences →
A clever retrieval system over some very interesting data.
Oct 24th
Binary marble adding machine →
Oct 17th
Got the wrong Bob? →
New GMail Labs feature, looks very similar to a paper written by a good friend and CMU grad. (via William Cohen)
Oct 14th
United States Gross National Happiness on Facebook →
Large scale sentiment analysis. (via FlowingData) They label the spikes in happiness, but I really wish they’d labeled the dips, too.
Oct 5th