Published: Thu 11 September 2014 In Programming. tags: text mining bionlp pubmed

If you are a computational linguist, data analyst, or bioinformatician working with biological text corpora (on medicine, neuroscience, molecular biology, etc.), you will rather sooner than later need access to MEDLINE. Right now, the MEDLINE subset ("baseline") of PubMed contains nearly 23 million records, all with titles, author names, etc.. The majority of those records (officially called citations) also have an abstract (on average about 5-8 sentences long). This means you are looking at a significantly sized text collection with plenty of metadata (links, author names, MeSH terms, chemicals, etc.) you will need to handle if you want to make use of all this information your own data mining application.

In my now about eight years of BioNLP (natural language processing for biology) work, I have not been able to locate a simple, up-to-date set of command-line tools to manage a MEDLINE DB mirror with all its metadata. As I am an innately lazy guy, I have worked out a number of useful shell scripts I regularly use to work with MEDLINE data that I am documenting here. To make this all work locally, and during a less lazy week, I wrote a tool called medic to create and manage an up-to-date (i.e., running daily, automatic updates) mirror of MEDLINE with the option of storing the citations either in a Postgres or SQLite database.

Synchronizing the MEDLINE archives First you will most likely need to actually download the MEDLINE archives, given that your institute has a (free) subscription to the archives. If you are on a Mac, you are already provided with the necessary tools: mount_ftp and rsync should be installed on every Mac. If you are on Linux, the curlftpfs package simulates a file system for a FTP site accessed with with the cURL library, doing the same thing as mount_ftp on a Mac. In other words, we will mount the MEDLINE baseline and updates directories from the FTP site as a file system and then synchronize them with our local copy of each directory. To do the synchronization step, we will be using rsync . On a GNU/Linux machine, you can use your package manager to install curlftpfs and rsync , if they are not already present. Once you have the packages installed, create a directory that acts as mount point, e.g., ftpfs , as well as the baseline directory, and the updates directory. Then, the relevant commands are: To synchronize the baseline directory (replace curlftpfs with mount_ftp on a Mac): curlftpfs ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/ ftp_mount/ rsync -r -t -v --progress ftp_mount/* baseline/ To synchronize the updates directory (replace curlftpfs with mount_ftp on a Mac): curlftpfs ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz/ ftp_mount/ rsync -r -t -v --progress ftp_mount/* updates/ To unmount the FTP directories once rsync is done, you can use fusermount -u ftp_mount on Linux and umount ftp_mount on a Mac. If you want to do the latter process regularly (MEDLINE sports daily updates), you might consider placing the update command series into a script and add it to your daily cron jobs. With this, you now have created the foundations to easily maintain a 24-hourly updated copy of all of MEDLINE on your site. And because of using rsync, you do not have to worry about broken connections or communication errors - if the process breaks halfway, rsync will restart exactly where it left off.

Creating a local MEDLINE database from the archives Once the MEDLINE files are installed, it would be possible to parse (or grep...) the XML and manually extract whatever you need from them. However, working this way will become cumbersome very fast, while placing the data into a well-structured database schema would help immensely. To do this on the fly, I have created medic, a Python command-line tool to bootstrap and manage a local MEDLINE repository. All you need to have installed to get this tool to work is Python (3.x); You can install medic with Python's own package manager: pip install medic (possibly with the option --user if you are not allowed to administer the machine you are working on). Second, you need to decide which database you want to use for MEDLINE. You can use either Postgres or SQLite as the back-end for medic (medic uses SQL Alchemy as its ORM, so in theory at least, it should be possible to use medic with other DBs, too.) As soon as you have the database installed and running (and CREATEd a DATABASE with UTF-8 text encoding, in the case of Postgres), you are ready to load the baseline files. As loading all of MEDLINE through the ORM can be very slow for Postgres, it is better to parse the data into text files and then load them in one go: medic parse baseline/medline1?n*.xml.gz for table in citations abstracts authors chemicals databases descriptors \ identifiers keywords publication_types qualifiers sections ; do psql medline -c "COPY $table FROM '`pwd`/ ${ table } .tab';" ; done If you are loading the files into SQLite, you can load the data directly with medic insert , although it will be considerably slower than the Postgres parse-and-dump method: medic --url sqlite:///MEDLINE.db insert baseline/medline1?n*.xml.gz Finally, to update the Postgres database to the latest state of MEDLINE, you can parse the updates directory: for file in updates/medline1?n*.xml.gz ; do medic --update parse $file ; # NB: the above command created the file "delete.txt" (a list of PMIDs to delete) medic delete delete.txt for table in citations abstracts authors chemicals databases descriptors \ identifiers keywords publication_types qualifiers sections ; do psql medline -c "COPY $table FROM '`pwd`/ ${ table } .tab';" ; done ; done ; With SQLite you can again take the direct, but slower medic update route: medic --url sqlite:///MEDLINE.db update updates/medline1?n*.xml.gz In theory, the simpler insert/update commands can be used for Postgres, too, but that is only recommended if you are loading citations in the thousands, not millions. If you whish to cron-job this, you should only medic update the latest file(s) - no need to parse-and-dump for a single file, not even for Postgres. In other words, make sure you are not working through all the files in the updates directory every day...

Quickly bootstrapping a subset of MEDLINE Often I find myself only needing a tiny subset of MEDLINE that I am interested in analyzing. In this case, you do not want to actually download, parse, and/or load all of PubMed into a heavy-weight Postgres DB, but rather have a small, single-file SQLite DB with the relevant citations. To do this, medic provides an interface to effortlessly download and bootstrap a database right into the current directory. Assuming you have the list of PubMed IDs (PMIDs) in a file called pmid_list.txt (one ID per line), and you want to bootstrap a SQLite DB file in the current directory called MEDLINE.db , you call medic like this: medic --url sqlite:///MEDLINE.db --pmid-lists insert pmid_list.txt With this, you have now quickly sampled that subset of MEDLINE citaitons relevant to your work, but still have them properly structured, stored in a single file, and easy to access as we will see next.

Extracting MEDLINE citations with medic Right now, medic has no interface to query the abstracts. You can add a Postgres full-text index, but according to my own experience that is not particularly efficient if you have millions of records to index, as in the case of MEDLINE. The right way would be to index the abstracts with a "real" search engine, for example, Lucene, but so far I have not gotten around to write an indexer for medic. The best way right now is to query eUtils directly, using the standard PubMed query syntax, which is pretty powerful, anyways; Note that eSearch queries to the eUtils API are capped, at most 100,000 IDs can be returned at once. To fetch more, you need to page results with retmax and retmin ; Also by default (without setting retmax ) only the first 20 results are returned by eUtils: QUERY="p53+AND+cancer" URL="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi" wget "$URL?db=PubMed&retmax=99&term=$QUERY" -O - 2> /dev/null \ | grep "^<Id>" \ | sed -E 's|</?Id>||g' \ | cut -f3 \ > pmids.txt Again, if you do this often, you might want to stick this into a little script, for example: #!/usr/bin/env bash # for a given query (one term per argument), retrieve (up to retmax) matching PMIDs QUERY = ` echo " $@ " | tr " " + ` URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi" echo " $QUERY " 1 > & 2 wget " $URL ?db=PubMed&retmax=99999&term= $QUERY " -O - 2 > /dev/null \ | grep "^<Id>" \ | sed -E 's|</?Id>||g' \ | cut -f3 The query argument to this script now can contain space characters which are replaced with "+" characters, e.g., " search_pubmed.sh p53 AND cancer " produces the same output as before (with far more PMIDs, however, so please do not try this particular query too often; and using search_pubmed.sh as the name of the above script). Given one or a list of PMIDs, however, medic allows you to quickly pull the citations in a number of formats: medline : write the citations, one per file, in the official MEDLINE format. tiab : only put the title and abstract, one pair per citations, into the files. html : write all citations into one large HTML file ("corpus"). tsv : write the PMID, title, and abstract into one large TSV file, one citation per line. For example: medic --format tiab --pmid-lists selected_pmids.txt , where selected_pmids.txt is a file with one PMID per line, will create one file per citation, named <PMID>.txt . If you do not care too much about the actual IDs and just need a few random citations to work with, here is an easy way to select 999 random PMIDs from MEDLINE; on a Mac or FreeBSD machine: jot 999 100000 25000000 > pmids.rnd_test.txt And when running a Linux or Cygwin OS: shuf -i 100000-25000000 | head -999 > pmids.rnd_test.txt However, this approach is a bit like cheating: if the PMID does not exist, you have a non-existing ID in your list. From a mathematical perspective, if the PMIDs are not evenly distributed over the range you are drawing integers from, you will not have the perfect random sample. Ideally, you should select random IDs from your collection, not the whole numeric range. Note that I chose 999 PMIDs not just by chance - SQLite has 999 set as a hard limit for the number of arguments for a "prepared statement". This means that if you want to fetch more than 999 PMIDs from a SQLite database, you will have to do that in several rounds.