15th
Java & the ClueWeb09 Webgraph
This post has been sitting in my drafts queue since before TREC 2009 submissions were due. Although I’m not currently working with this data, someone somewhere might be interested in various problems dealing with the ClueWeb link graph in Java.
We’ve been wrestling with the Web graph for the ClueWeb09 dataset the past couple weeks, in hopes of using this data in our TREC Relevance Feedback Track submission. Jamie’s group has extracted & compressed the webgraph for the Category B subset using the excellent Java graph compression package from Sebastiano Vigna.
[Update: the webgraph data has been posted.]
(on a side note — his research group has released some excellent Java software for dealing with large amounts of data, including a Java port of the MG search engine, MG4J)
The graph compression package seems exceptionally well written and documented, is blazingly fast and easy to use, and worked quite well with the GOV2 collection. But, it seems that this package relies heavily on storing the graph nodes in int arrays (at least in intermediate stages), and there are some related limitations of Java that are really slowing us down.
Most importantly, in Java, you can’t create an int array longer than 536,870,912 elements (a 2 Gig int array), even when allocating obscene amounts of memory to the JVM (32G running as root on linux). The Category B subset contains 428,136,613 pages, and when including linked-to pages from those, we run into this limit.
A second limitation is that Java only indexes arrays by int’s, and the MAX_INT in Java is 2,147,483,647 (231-1). The whole ClueWeb09 dataset contains 4,780,950,903 web pages, more than double the possible addressable Java array size. So, even if we were able to get around the first issue, it seems that we might need a different solution for the full dataset.