Using EDirect to create a local copy of PubMed
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
EDirect is designed to help you get the PubMed data you need, and only the PubMed data you need, in the exact format you specify. You can use esearch
to search for PubMed records, efetch
to download records in XML, and xtract
to output the specific data elements you need.
But what if you need a lot of data?
If you are trying to download tens or hundreds of thousands of PubMed records, you may find that the downloading process takes an impractically long time (especially during peak hours). Additionally, if your job is very large, you may run afoul of the E-utilities Usage Guidelines and Requirements.
For users who routinely use EDirect to retrieve very large sets of PubMed records, NCBI has introduced a new tool and technique that lets you create your own local copy of PubMed, which may speed up the process of bulk retrieval substantially. Newer versions of EDirect (starting with version 8.00) include a suite of scripts and commands that help you:
- download the entirety of PubMed through the NLM Data Distribution program,
- extract and de-duplicate individual PubMed records,
- create a local archive of PubMed, and
- retrieve large sets of PubMed records from your local archive for use by
xtract
.
Note: This technique is only recommended for advanced EDirect users, who have some prior experience working in Unix. Creating and using a local PubMed archive does require certain minimum hardware specifications and technical expertise, and initial setup can take some time. However, once the archive is created, you will be able to retrieve records from the archive much faster than you can using traditional efetch
. Exact speeds will vary based on your hardware and software configuration, but tests of retrieving records from a properly configured archive have shown speeds of 3,000 or more records per second (compared to about 50 records per second using traditional efetch
).
System Requirements
A current version of EDirect
The archive creation process relies on scripts and commands included in EDirect version 8.00 and later. If you do not yet have EDirect installed, please see our installation page. If you already have EDirect installed, you can check your current version using our testing script. If your version is earlier than 8.00, you can update your installation by re-installing EDirect using our installation script.
A Solid State Drive (SSD)
Creating and retrieving from a PubMed archive is only practical on a Solid State Drive (SSD). The techniques described below will technically work on a spinning-disk hard drive, but at speeds so slow as to render the process impractical. An external SSD will work fine, though you will see even greater speed increases with an internal SSD. Currently, an external SSD of 250 GB or more should be more than sufficient to hold a complete PubMed archive (though larger drives may be necessary in the future, as PubMed continues to grow).
Recommended: MacOS 10.13
When building a local PubMed archive, the EDirect scripts build a hierarchy of 1 million folders to organize the 28 million records. While you can create and use a PubMed archive in any Unix-like environment that can run EDirect (including the MacOS terminal, a Cygwin emulator on a Windows computer, a Linux machine, etc.), the process is designed to take advantage of the new APFS file system, available in MacOS version 10.13 and later, which is specifically built to work with gigantic folder hierarchies. Using other file systems (like NTFS) will certainly work, but the speed benefits may not be as great.
Before you begin
Format your archive drive
Starting with a freshly formatted SSD will help ensure the archive creation process goes smoothly. If you are using MacOS 10.13 or later, format your drive using the new APFS file system. If you are using another operating system, make sure you are using the NTFS file system, with a cluster size (also sometimes called an “allocation unit size”) of 4 KB. Formatting with a different cluster size will have an impact on performance, and on the space required to hold the completed archive: increasing the cluster size increases speed and required disk space, while decreasing the cluster size decreases speed and required disk space.
Set aside some time
Depending on the configuration of your system, building the archive may take quite a while. Downloading the baseline and update files for the first time could take several hours, while the creation of the archive could take anywhere from two to more than thirty hours, depending on your hardware. Fortunately, the full archive creation process only needs to be executed once a year, and can be interrupted and restarted at any time.
Building the archive
Building a local archive of PubMed takes several steps, but the archive-pubmed
command, included in EDirect versions 8.00 and later, automatically performs each of the necessary steps in the correct order. From your Unix terminal, execute the following command:
archive-pubmed -path /Volumes/myssd
This starts the process of building the local archive in the directory /Volumes/myssd
(replace the /Volumes/myssd
with the directory on your SSD where you would like to hold your archive). The archive-pubmed
command then performs the following steps:
- Creates subdirectories of
/Volumes/myssd
to store the archive, as well as the raw baseline and update files which will be used to build the archive. - Analyzes your system and displays messages to suggest steps you can take to improve the archive creation process.
- Downloads all of the baseline and update files from the NLM Data Distribution FTP server and saves them in the
/Volumes/myssd/Pubmed
subdirectory. - Opens each baseline file in turn, decompresses the XML file contained within, and extracts each individual PubMed record into its own, individual XML file. These individual record XML files are then re-compressed and saved into a new directory structure (within the
/Volumes/myssd/Archive
subdirectory) which is built to facilitate rapid access. - Opens each update file in turn, processing them like the baseline files, overwriting previous versions of a PubMed record with newer ones, and deleting records from your archive which have been deleted from PubMed.
As this command finishes processing each baseline and update file, it creates a sentinel file as a flag to indicate which files have already been processed. If the archive-pubmed
command is interrupted in the middle, simply execute the command again to restart it; the sentinel files will automatically indicate where to start back up again.
Note: while the archive-pubmed
command is processing update files, you may occasionally see error messages that look like this:
rm: cannot remove 'XX/XX/XX/XXXXXXXX.xml.gz': No such file or directory
These messages occur when an update file contains a delete message for a record that is not yet in your archive. This will happen when a PubMed record was created in error, then deleted on the same day. These messages can be ignored.
Maintaining the archive
Updating an archive with the latest files
New update files are released to the FTP server on a daily basis. To make sure your local archive mirrors the current state of the PubMed database as closely as possible, you will want to update your local archive before attempting any major projects. To do this, simply re-run the archive-pubmed
command (substituting /Volumes/myssd
with the directory that contains your archive):
archive-pubmed -path /Volumes/myssd
This should only take a few minutes, as it will only need to download the new update files and add those records to your archive.
Re-archive for the new baseline
Once a year, usually around late November, NLM releases a new set of baseline files. These files are updated to reflect changes in the PubMed DTD, and any changes to the MeSH terminology for the new year. Once the new baseline is released, you will need to rebuild your archive from the ground up in order to make sure your archive reflects the current state of the PubMed database.
Before rebuilding your archive, make sure you delete the entire contents of the directory that contains your archive and the downloaded baseline and update files (i.e. /Volumes/myssd
). The easiest way to accomplish this is to reformat your SSD as described above, which will delete the entire archive in minutes instead of hours. Even if you do delete your archive in a different way, it is probably a good idea to reformat your SSD as described above before trying to build your new archive. Once your drive is ready, follow the steps above to build the archive from the new baseline files.
Retrieving from the archive
Retrieve records based on a list of PMIDs
If you have a list of PMIDs saved in a file, you can retrieve the full XML for all of those records from your archive by using the fetch-pubmed
command. Remember to replace Volumes/myssd/Archive
with the directory that contains your archive.
cat lycopene.txt | \
fetch-pubmed -path /Volumes/myssd/Archive > lycopene.xml
This will save the full XML records for all of the PMIDs listed in the file lycopene.txt
to the file lycopene.xml
.
Retrieve records based on a PubMed search
Other EDirect commands are completely interoperable with the archive. You can use esearch
to find relevant records in the live PubMed database using the normal search algorithm, but retrieve the records from your local archive using fetch-pubmed
. Remember to replace Volumes/myssd/Archive
with the directory that contains your archive.
esearch -db pubmed -query "breast cancer" | \
efetch -format uid | \
fetch-pubmed -path /Volumes/myssd/Archive > breastcancer.xml
The first line of this script uses esearch
to search the PubMed database for the query “breast cancer.” The esearch
command uses the E-utilities API to search the live PubMed database.
The second line uses efetch -format uid
to retrieve the PMIDs for all of the records that matched the search criteria from our initial search. Rather than retrieving full XML records for all of these records, we are only retrieving the PMIDs, which takes much less time.
Our third line receives the PMIDs piped in from the second line and retrieves full XML records for each of the PMIDs from the local archive, and saves the resulting XML to a file.
Extract specific elements from a group of archived records
You can pipe the output of a fetch-pubmed
command into xtract
just as you would with efetch
. Remember to replace Volumes/myssd/Pubmed
with the directory that contains your archive.
esearch -db pubmed -query "breast cancer" | \
efetch -format uid | \
fetch-pubmed -path /Volumes/myssd/Archive | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISOAbbreviation Volume Issue PubDate/Year
Additional Information
Differences between your local archive and PubMed
As explained above, the local archive is based on the export files from the NLM Data Distribution program. While export files closely mirror the contents of the live PubMed database, there are a few categories of records which are not included in the exports. As a result, while your local PubMed archive should be a very close match to the live PubMed database, there are likely to be a few discrepancies:
- The local archive will not contain the small number of records in PubMed (approximately 13,000) for books and book chapters.
- For the small number of versioned citations in PubMed (approximately 800), the local archive will contain only the most recent version.
You can find a good representation of the contents of your local archive by searching PubMed (using either the web version of PubMed or esearch
) for the following string:
all[sb] NOT pmcbook NOT ispreviousversion
(In fact, if you are retrieving records based on a PubMed search using esearch
, you may find it useful to add “NOT pmcbook NOT ispreviousversion” to the end of your search string. This will minimize the number of absent records fetch-pubmed
attempts to retrieve.)
Additionally, you may see some further discrepancies, depending on when you update your local archive and when you search. New records are being added to PubMed all the time by publishers submitting citation data. However, the indexes that power PubMed search are updated once a day (usually around 6 AM ET), and new export files are only generated once a day (usually around 2 PM ET). As a result, even a completely up-to-date local archive may include a handful of records that are not yet included in the PubMed search indexes, or vice versa.
Set an environment variable to save time
You can make using archive-pubmed
and fetch-pubmed
commands a little easier by setting up your local archive directory as an environment variable. Add the following line to each user’s .bash_profile configuration file (substituting /Volumes/myssd
with the directory that contains your archive):
export EDIRECT_PUBMED_MASTER=/Volumes/myssd
After you have set this variable, you can omit the -path
argument from both archive-pubmed
and fetch-pubmed
: These commands will check the environment variable, and know exactly where your local archive is located.
For more information
Additional information and documentation on the local archiving process can be found on NCBI’s EDirect documentation page, “Entrez Direct: E-utilities on the UNIX Command Line” (look for the section labeled “Local Data Cache”).