<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>sporadic ramblings of a comp sci grad student studying information retrieval
Me @ CMU



new TWTR.Widget({
  version: 2,
  type: 'profile',
  rpp: 4,
  interval: 6000,
  width: 160,
  height: 300,
  theme: {
    shell: {
      background: '#f3f3f3',
      color: '#000000'
    },
    tweets: {
      background: '#f3f3f3',
      color: '#000000',
      links: '#6498cc'
    }
  },
  features: {
    scrollbar: false,
    loop: false,
    live: false,
    hashtags: true,
    timestamp: true,
    avatars: false,
    behavior: 'all'
  }
}).render().setUser('jelsas').start();
</description><title>window office</title><generator>Tumblr (3.0; @windowoffice)</generator><link>http://windowoffice.tumblr.com/</link><item><title>Announcing: Ancestry.com Online Forum Test Collection</title><description>&lt;a href="http://New Test Collection: Ancestry.com Online Forum"&gt;Announcing: Ancestry.com Online Forum Test Collection&lt;/a&gt;: &lt;blockquote&gt;
&lt;p&gt;The Ancestry.com Forum Dataset was created with the cooperation of &lt;a href="http://ancestry.com/"&gt;Ancestry.com&lt;/a&gt; in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, &lt;a href="http://boards.ancestry.com/"&gt;boards.ancestry.com&lt;/a&gt;, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://windowoffice.tumblr.com/post/11077203774</link><guid>http://windowoffice.tumblr.com/post/11077203774</guid><pubDate>Wed, 05 Oct 2011 19:06:00 -0400</pubDate></item><item><title>2010 Google Faculty Summit: The Anatomy of a Large Scale Social...</title><description>&lt;iframe width="400" height="245" src="http://www.youtube.com/embed/bsuGQHAteN8?wmode=transparent&amp;autohide=1&amp;egm=0&amp;hd=1&amp;iv_load_policy=3&amp;modestbranding=1&amp;rel=0&amp;showinfo=0&amp;showsearch=0" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;br/&gt;&lt;br/&gt;&lt;p&gt;&lt;a href="http://www.youtube.com/watch?v=bsuGQHAteN8&amp;feature=youtube_gdata"&gt;2010 Google Faculty Summit: The Anatomy of a Large Scale Social Search Engine&lt;/a&gt; (Aardvark)&lt;/p&gt;
&lt;p&gt;(via &lt;a href="http://youtube.com/user/GoogleTechTalks"&gt;GoogleTechTalks&lt;/a&gt;)&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/905272595</link><guid>http://windowoffice.tumblr.com/post/905272595</guid><pubDate>Wed, 04 Aug 2010 20:52:43 -0400</pubDate></item><item><title>Thomson Reuters Releases TRC2 News Corpus Through NIST - Dr. Jochen L. Leidner's Blog</title><description>&lt;a href="http://jochenleidner.posterous.com/thomson-reuters-releases-research-collection"&gt;Thomson Reuters Releases TRC2 News Corpus Through NIST - Dr. Jochen L. Leidner's Blog&lt;/a&gt;</description><link>http://windowoffice.tumblr.com/post/902873634</link><guid>http://windowoffice.tumblr.com/post/902873634</guid><pubDate>Wed, 04 Aug 2010 09:15:41 -0400</pubDate></item><item><title>So you want to study IR?</title><description>&lt;p&gt;I occasionally get questions from aspiring IR students asking for advice on getting started as an IR researcher.*  Here&amp;#8217;s an attempt at some pointers for foundation material and resources:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reading&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There&amp;#8217;s a lot of background material that any IR researcher should be familiar with:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;A good textbook to get your head around the fundamentals of the field is essential.  There&amp;#8217;s a long history of finely tuned mathematical models of information retrieval, which still strongly influence most modern IR research.  Understand these models and become familiar with the issues they address.  Recently at CMU, the graduate-level introductory IR classes have used &lt;a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html"&gt;Introduction to Information Retrieval by Manning, Raghavan &amp;amp; Schütze&lt;/a&gt;, which I highly recommend. But, there are other good books out there, also: &lt;a href="http://www.search-engines-book.com/"&gt;Search Engines: Information Retrieval in Practice by Croft, Metzler &amp;amp; Strohman&lt;/a&gt; is geared more towards the undergraduate audience.  When I took the IR class at CMU, we used &lt;a href="http://people.ischool.berkeley.edu/~hearst/irbook/"&gt;Moden Information Retrieval by Baeza-Yates and Ribeiro-Nero&lt;/a&gt; (which despite the name may be a little dated now).  I still refer back to &lt;a href="http://ww2.cs.mu.oz.au/mg/"&gt;Managing Gigabytes by Whitten, Moffat &amp;amp; Bell&lt;/a&gt; when digging into the guts of an indexing problem.&lt;/li&gt;
&lt;li&gt;Read some classic IR papers.  There&amp;#8217;s some gems listed in the &lt;a href="http://portal.acm.org/citation.cfm?id=1113344"&gt;2005 SIGIR Forum article &amp;#8220;Recommended reading for IR research students&amp;#8221;&lt;/a&gt;.  Lots of these are somewhat dated (eg. the IR world has pretty much moved beyond LSI), but some are still very heavily cited (eg. the &lt;a href="http://dx.doi.org/10.1016/S0169-7552(98)00110-X"&gt;PageRank paper&lt;/a&gt;, &lt;a href="http://doi.acm.org/10.1145/383952.383972"&gt;Lavrenko&amp;#8217;s relevance models paper&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;A solid foundation in machine learning is becoming increasingly important.  &lt;a href="http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html"&gt;Tom Mitchell&amp;#8217;s classic book&lt;/a&gt; is great.  I&amp;#8217;m also a big fan of &lt;a href="http://www.autonlab.org/tutorials/"&gt;Andrew Moore&amp;#8217;s tutorial slides&lt;/a&gt;.  (There&amp;#8217;s clearly a CMU bias here &amp;#8212; I was fortunate enough to take a machine learning class taught by both Tom and Andrew.)&lt;/li&gt;
&lt;li&gt;Pay attention to what&amp;#8217;s hot.  Follow the top conferences (&lt;a href="http://sigir.org/"&gt;SIGIR&lt;/a&gt;, &lt;a href="http://www.cikmconference.org/"&gt;CIKM&lt;/a&gt;, &lt;a href="http://kmi.open.ac.uk/events/ecir2010/"&gt;ECIR&lt;/a&gt;, &lt;a href="http://www.wsdm-conference.org/"&gt;WSDM&lt;/a&gt;, &lt;a href="http://www.iw3c2.org/conferences/"&gt;WWW&lt;/a&gt;, &lt;a href="http://www.asis.org/conferences.html"&gt;ASIST&lt;/a&gt;) and journals (&lt;a href="http://www.springer.com/computer/database+management+%26+information+retrieval/journal/10791"&gt;Information Retrieval&lt;/a&gt;, &lt;a href="http://tois.acm.org/"&gt;TOIS&lt;/a&gt;, &lt;a href="http://www.asis.org/jasist.html"&gt;JASIST&lt;/a&gt;) read all the papers that get best paper awards, and attend the conferences if you can.  (I know I&amp;#8217;m missing some great conferences in this list.)  &lt;/li&gt;
&lt;li&gt;IR as a field has always strongly valued solid evaluation methodologies.  The &lt;a href="http://trec.nist.gov/"&gt;Text REtrieval Conference, run by NIST&lt;/a&gt;, has provided many researchers with invaluable datasets and a forum for testing retrieval algorithms.  Familiarize yourself with the tasks, datasets and tools used at TREC.  &lt;/li&gt;
&lt;li&gt;Don&amp;#8217;t forget that IR is not just a CS research topic &amp;#8212; it has its origins in Library Science.  There is a strong (and IMO under-appreciated) Information &amp;amp; Library Science IR research community.  IR isn&amp;#8217;t just about the mathematics behind retrieval models &amp;#8212; we need to understand the searchers and the user interfaces.&lt;/li&gt;
&lt;li&gt;Check out &lt;a href="http://videolectures.net/Top/Computer_Science/Information_Retrieval/"&gt;Video Lectures&amp;#8217;s archive of IR talks&lt;/a&gt;.  Not really reading, but you can see what kind of research is presented at IR conferences, as well as some IR tutorials by well-known IR researchers.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;&lt;strong&gt;Communicating&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Effective presentation of your ideas is essential in any field.  As a reviewer, I&amp;#8217;ve seen a lot of papers that haven&amp;#8217;t been anywhere close to publication quality.  As a conference attendee, I&amp;#8217;ve heard a lot of terrible talks.  This advice applies to any technical field, not just IR.&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;If English is not your first language, find (befriend, hire) a native speaker who can edit your work and/or tutor you.  For better or worse, you need to write fluently in English to get published.&lt;/li&gt;
&lt;li&gt;Writing well takes practice.  Try to write some each week.  Trevor Strohman &lt;a href="http://ciir.cs.umass.edu/~strohman/ciir-research-guide"&gt;recommends writing some every day&lt;/a&gt;, along with giving a lot of other good general research tips.  Find other researchers to swap paper drafts with to act as a reviewer.  Read &lt;a href="http://www.justinzobel.com/"&gt;Writing for Computer Science by Zobel&lt;/a&gt; (via &lt;a href="http://twitter.com/ssn"&gt;@ssn&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Giving good presentations takes practice.  If you can give a good conference presentation, people will remember you and your work.  Find some advice on good presentation skills and follow it.  I like &lt;a href="http://www.randsinrepose.com/archives/2008/02/03/out_loud.html"&gt;this guy&amp;#8217;s advice&lt;/a&gt;.  I also like to keep words and equations to a minimum on my slides.  But, everyone has their own presentation style.  Most importantly, practice your talks.  And please, don&amp;#8217;t just read your slides. &lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;&lt;strong&gt;Doing&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You aren&amp;#8217;t going to get far in IR research without getting your hands dirty.  This is an applied field and its rare for someone to get a PhD without actually creating some software; dealing with some large, messy datasets; or *shudder* users!&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Learn how to perform a retrieval experiment, and do it.  You&amp;#8217;ll need a document collection, a set of queries, and a set of relevance judgements.  &lt;a href="http://terrier.org/docs/v3.0/quickstart.html"&gt;Terrier&amp;#8217;s &amp;#8220;Quick Start&amp;#8221; guide&lt;/a&gt; gives a good overview of the process.&lt;/li&gt;
&lt;li&gt;As mentioned above, &lt;a href="http://trec.nist.gov"&gt;TREC&lt;/a&gt; is the premiere forum for IR evaluation.  Participate in a TREC track if possible, and go to the conference.  TREC is an excellent venue for hands-on experimentation and is very low-risk &amp;#8212; you can&amp;#8217;t get rejected, and you&amp;#8217;ll have a (non-refereed) publication on your CV.&lt;/li&gt;
&lt;li&gt;You don&amp;#8217;t &lt;em&gt;need&lt;/em&gt; to write a search engine from scratch, although this can be a great learning exercise.  There are quite a few very good open source research search engines.  See &lt;a href="http://www.searchenginecaffe.com/2007/03/open-source-search-engines-in-java-and.html"&gt;Jeff Dalton&amp;#8217;s reasonably up-to-date list&lt;/a&gt;.  Many of these have been used over the years at TREC and support TREC-formatted document collections.&lt;/li&gt;
&lt;li&gt;Understanding the searcher is still an wide-open area of IR research.  Perform a user study to explore how people formulate queries, interpret results, etc.&lt;/li&gt;
&lt;li&gt;There are many unsolved real-world IR problems all around us.  Although we see a lot of publications from large web search companies, web search isn&amp;#8217;t the only search problem.  Find a tractable problem and attempt a solution.  For example, build a search engine for your department&amp;#8217;s publications; build a better search interface for Wikipedia; tackle people-search in Twitter. &lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Hope this helps answer some of the new IR researcher&amp;#8217;s questions.  And, seasoned researchers, please leave a comment if you spot any omissions.&lt;/p&gt;
&lt;p&gt;* I&amp;#8217;m probably not the best person to ask this question, but I do get a request every couple of months.  Why they ask me is a mystery &amp;#8212; maybe all grad students at top US universities get these requests?&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/898277337</link><guid>http://windowoffice.tumblr.com/post/898277337</guid><pubDate>Tue, 03 Aug 2010 10:58:00 -0400</pubDate></item><item><title>Djoerd Hiemstra on Keith van Rijsbergen's retirement</title><description>&lt;a href="http://wwwhome.cs.utwente.nl/~hiemstra/2010/keith-van-rijsbergen-retired.html"&gt;Djoerd Hiemstra on Keith van Rijsbergen's retirement&lt;/a&gt;: &lt;p&gt;Click through download the classic 1976 Information Retrieval book in .epub format.&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/844916485</link><guid>http://windowoffice.tumblr.com/post/844916485</guid><pubDate>Thu, 22 Jul 2010 08:08:55 -0400</pubDate></item><item><title>New Microsoft Learning to Rank Datasets</title><description>&lt;a href="http://research.microsoft.com/en-us/projects/mslr/default.aspx"&gt;New Microsoft Learning to Rank Datasets&lt;/a&gt;: &lt;p&gt;(via &lt;a href="http://www.searchenginecaffe.com/2010/07/microsoft-releases-learning-to-rank.html"&gt;JeffD&lt;/a&gt;)&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/844874559</link><guid>http://windowoffice.tumblr.com/post/844874559</guid><pubDate>Thu, 22 Jul 2010 07:52:40 -0400</pubDate></item><item><title>Top authors in Information Retrieval - Microsoft Academic Search</title><description>&lt;a href="http://academic.research.microsoft.com/CSDirectory/author_category_8_last5.htm"&gt;Top authors in Information Retrieval - Microsoft Academic Search&lt;/a&gt;: &lt;p&gt;I’m 142 (tied with many others).  What’s your rank?&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/828988655</link><guid>http://windowoffice.tumblr.com/post/828988655</guid><pubDate>Sun, 18 Jul 2010 16:39:42 -0400</pubDate></item><item><title>Google's Amit Singhal tells us about the dreams search engines are made of</title><description>&lt;a href="http://www.engadget.com/2010/07/16/googles-amit-singhal-tells-us-about-the-dreams-search-engines-a/"&gt;Google's Amit Singhal tells us about the dreams search engines are made of&lt;/a&gt;</description><link>http://windowoffice.tumblr.com/post/820444228</link><guid>http://windowoffice.tumblr.com/post/820444228</guid><pubDate>Fri, 16 Jul 2010 15:21:04 -0400</pubDate></item><item><title>statAP can be &gt; 1.0</title><description>&lt;p&gt;Its probably no surprise to anyone who&amp;#8217;s paid attention to the work on statAP sampling, but it is somewhat disconcerting that AP estimates produced in this way can be greater than 1.0.  &lt;/p&gt;
&lt;p&gt;For example, consider a document which is sampled with probability 0.1 and is found to be relevant.  A system ranking this document at position 1, which should get a precision @ 1 value of 1.0, gets 10.0 instead.  See &lt;a href="http://www.ccs.neu.edu/home/jaa/tmp/statAP.pdf"&gt;the StatAP paper&lt;/a&gt; for details on estimating precision at cutoffs.  In this example, we can compute an exact P@1 value since the only document of interest has actually been judged.&lt;/p&gt;
&lt;p&gt;I&amp;#8217;m a little uneasy basing my analysis on AP estimates that don&amp;#8217;t really resemble what I&amp;#8217;m used to seeing as AP values.  In some cases, particularly for &amp;#8220;easy&amp;#8221; queries with lots of relevant documents, I&amp;#8217;m commonly seeing statAP estimates well over 5.  Is statAP broken?&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/708407763</link><guid>http://windowoffice.tumblr.com/post/708407763</guid><pubDate>Thu, 17 Jun 2010 12:34:42 -0400</pubDate></item><item><title>Smarter Than You Think - I.B.M.'s Supercomputer Challenges 'Jeopardy!' Champions - NYTimes.com</title><description>&lt;a href="http://www.nytimes.com/2010/06/20/magazine/20Computer-t.html?hp"&gt;Smarter Than You Think - I.B.M.'s Supercomputer Challenges 'Jeopardy!' Champions - NYTimes.com&lt;/a&gt;: &lt;p&gt;IBM’s QA efforts on Jeopardy.&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/705385488</link><guid>http://windowoffice.tumblr.com/post/705385488</guid><pubDate>Wed, 16 Jun 2010 15:40:24 -0400</pubDate></item><item><title>Probably Irrelevant » Query logs and information retrieval research</title><description>&lt;a href="http://probablyirrelevant.org/2010/06/query-logs-and-information-retrieval-research/"&gt;Probably Irrelevant » Query logs and information retrieval research&lt;/a&gt;: &lt;p&gt;Latest post on Probably Irrelevant by Fernando Diaz&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/655244584</link><guid>http://windowoffice.tumblr.com/post/655244584</guid><pubDate>Tue, 01 Jun 2010 22:57:19 -0400</pubDate></item><item><title>"Given that we have gathered the equivalent of less than 6 seconds of Google traffic (assuming 500..."</title><description>“Given that we have gathered the equivalent of less than 6 seconds of Google traffic (assuming 500 million queries per day) in one year, we have decided to terminate the project.”&lt;br/&gt;&lt;br/&gt; - &lt;em&gt;&lt;a href="http://lemurstudy.cs.umass.edu/"&gt;Community Query Log Project Results&lt;/a&gt;&lt;/em&gt;</description><link>http://windowoffice.tumblr.com/post/634308427</link><guid>http://windowoffice.tumblr.com/post/634308427</guid><pubDate>Wed, 26 May 2010 08:22:38 -0400</pubDate></item><item><title>FXPAL Blog » Blog Archive » Impossible to find</title><description>&lt;a href="http://palblog.fxpal.com/?p=3776"&gt;FXPAL Blog » Blog Archive » Impossible to find&lt;/a&gt;</description><link>http://windowoffice.tumblr.com/post/631482753</link><guid>http://windowoffice.tumblr.com/post/631482753</guid><pubDate>Tue, 25 May 2010 11:12:08 -0400</pubDate></item><item><title>Google, Yahoo, others work to make search engines better at scanning the Web</title><description>&lt;a href="http://www.washingtonpost.com/wp-dyn/content/article/2010/05/24/AR2010052402609.html"&gt;Google, Yahoo, others work to make search engines better at scanning the Web&lt;/a&gt;: &lt;p&gt;Some quotes from Jamie Callan &amp; Liz Liddy in this Washington Post article.&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/631436013</link><guid>http://windowoffice.tumblr.com/post/631436013</guid><pubDate>Tue, 25 May 2010 10:49:11 -0400</pubDate></item><item><title>TREC-BLOG - 2010 Guidelines</title><description>&lt;a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG#head-bebcc8f6d640b285d8ae2eab67bc2196788440f1"&gt;TREC-BLOG - 2010 Guidelines&lt;/a&gt;</description><link>http://windowoffice.tumblr.com/post/610844046</link><guid>http://windowoffice.tumblr.com/post/610844046</guid><pubDate>Tue, 18 May 2010 16:02:00 -0400</pubDate></item><item><title>Probably Irrelevant » Vintage Cornell/SMART Tech Reports?</title><description>&lt;a href="http://probablyirrelevant.org/2010/05/vintage-cornellsmart-tech-reports/"&gt;Probably Irrelevant » Vintage Cornell/SMART Tech Reports?&lt;/a&gt;</description><link>http://windowoffice.tumblr.com/post/574258520</link><guid>http://windowoffice.tumblr.com/post/574258520</guid><pubDate>Wed, 05 May 2010 16:41:04 -0400</pubDate></item><item><title>Home : ClueWeb09 Wiki</title><description>&lt;a href="http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=ClueWeb09%20Wiki"&gt;Home : ClueWeb09 Wiki&lt;/a&gt;: &lt;p&gt;If you use the ClueWeb data, check out the wiki: PageRank scores, the Web Graph, Spam scores.&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/571198452</link><guid>http://windowoffice.tumblr.com/post/571198452</guid><pubDate>Tue, 04 May 2010 12:54:00 -0400</pubDate></item><item><title>TREC 2010 Web Track Guidelines</title><description>&lt;a href="http://plg.uwaterloo.ca/~trecweb/2010.html"&gt;TREC 2010 Web Track Guidelines&lt;/a&gt;: &lt;p&gt;Web track guidelies have been posted.  Interesting changes this year: spam filtering task, and using ERR as  the primary evaluation metric.&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/560695591</link><guid>http://windowoffice.tumblr.com/post/560695591</guid><pubDate>Fri, 30 Apr 2010 07:18:36 -0400</pubDate></item><item><title>Microsoft Web N-gram Services Now in Public Beta Worldwide</title><description>&lt;a href="http://blogs.msdn.com/msr_er/archive/2010/04/28/microsoft-web-n-gram-services-now-in-public-beta-worldwide.aspx"&gt;Microsoft Web N-gram Services Now in Public Beta Worldwide&lt;/a&gt;</description><link>http://windowoffice.tumblr.com/post/556295266</link><guid>http://windowoffice.tumblr.com/post/556295266</guid><pubDate>Wed, 28 Apr 2010 13:36:07 -0400</pubDate></item><item><title>MIREX: MapReduce Information Retrieval Experiments</title><description>&lt;a href="http://mirex.sourceforge.net/"&gt;MIREX: MapReduce Information Retrieval Experiments&lt;/a&gt;: &lt;p&gt;&lt;span&gt;Djoerd Hiemstra has also posted a MapReduce library for IR experiments.&lt;/span&gt;&lt;/p&gt;</description><link>http://windowoffice.tumblr.com/post/554589222</link><guid>http://windowoffice.tumblr.com/post/554589222</guid><pubDate>Tue, 27 Apr 2010 20:29:39 -0400</pubDate></item></channel></rss>
