Thursday 15 April 2010

BitTorrenting biology, getting the big picture in search

The biosciences, like other branches of research, are being dragged into the digital era. This is in part because traditional mediums of communications, including journal articles, are migrating online, and in part because high-throughput approaches to biological research are producing staggering amounts of data that can only be stored in digital form. A couple of papers released by PLoS ONE have presented new approaches to both aspects of digitization that, in essence, simply involve modifying tools that are common outside of the field specifically for use by biologists.

BitTorrenting genomes

The paper that describes the BioTorrents project lays out some staggering figures on the scale of the problem, based on a query posed to someone at the National Center for Biotechnology Information. In a recent month, the NCBI served the following data sets: 1,000 Genomes (9TB, served 100,000 times), Bacterial genomes (52GB, 30,000 downloads), and Gen Bank (233GB, 10,000 downloads), in addition to tens of thousands of retrievals of smaller datasets. That's a lot of bandwidth by anyone's standards, all of it served by a relatively small portion of the NIH.

As the paper points out, some of this is almost certainly redundant, as some labs are probably grabbing data that another group at the same institute—possibly in the same building—has already obtained. Fortunately, the larger community of Internet users has already figured out a system for sharing the burden of handling large files: BitTorrent.

Although it would be possible to start dumping files onto existing networks, there are two drawbacks to that, the authors argue: those networks are at risk of getting shut down due to copyright violations, and finding biological databases among the other content (read: porn, movies, and TV shows) is a needle-in-a-haystack issue. So, they've modified a GPLed client, and set up their own server at UC Davis, where they work. Anyone can download, but people have to register to upload data, allowing the administrators to police things for appropriate content. The server also mirrors the data immediately, in order to assure there's at least one copy available at all times.

The BioTorrent site enables people to find data based on metadata like categories and license type, and a basic text search is also available. Versioning options and RSS feeds are available for datasets that are frequently updated. Overall, it seems like a well-designed project, and I'm sure the NCBI appreciates having someone else shoulder the bandwidth load.

Making search visual

NCBI also happens to host PubMed, a database that contains the abstract of every significant biomedical journal (and a large portion of the less significant ones, too). Since relatively few of the journals published, at least until recent years, were open access, however, it doesn't allow full searching of an article's contents. A team just a bit down Route 80 from the Davis group (at UC Berkeley) have been doing some testing to find out whether biologists are missing out when they use PubMed.

Their project, online at BioText, uses a search engine that's indexed the full contents of a few hundred open access journals. Not only does it handle full text, but it also identifies when terms appear in figure legends, which describe the contents of images. This group's PLoS ONE paper focuses on user testing with a system that identifies relevant images based on the search terms, and displays those. It's a nice system, and the twenty or so biologists they tested it on seemed to like it.

Of course, user satisfaction may not be the right metric, if other studies are to be believed. The paper cites one that showed that blank squares improve the use of search results as much as images do, and that people tend to believe a result is more relevant simply if it comes with an image. So, although the users may be happier with the thumbnails, they are likely to be working less effectively.

Should a service of this sort actually prove more useful, it would be tempting to conclude that open access journals would end up having a greater impact on the scientific discourse, simply because it'll be easier to find things in them. Still, a couple of things may limit the impact. Google scholar is one of them; since the company's scanning operations deal with hardcover issues in university libraries, they won't be as up-to-date, though.

There's also a commercial company, PubGet, that wraps PubMed searches in a user interface that will inline any PDFs you have access to. Since most scientists work at institutions with extensive journal access, that means they can easily see the full text of a paper to decide if a search result is relevant. That still doesn't overcome the inability of PubMed to index the full text of a paper, however.

The end result (for now, at least) is that researchers will probably need to use several search tools in order to do an exhaustive check for relevant content. Unlike the open question of whether image thumbnails help or hurt, there's little doubt that this isn't great for productivity.