"EDirect for PubMed: Part 2: Extracting Data from XML" Sample Code
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
Below you will find sample code for the examples, in-class exercises and homework presented in the second session of the “EDirect for PubMed” Insider’s Guide class. These examples are written for use with EDirect in a Unix environment. If you need help installing and setting up EDirect, please see our “Installing EDirect” page.
For more examples, please see the sample code from the other parts of “EDirect for PubMed”:
- Part 1: Getting PubMed Data
- Part 3: Formatting Results and Unix Tools
- Part 4: xtract Conditional Arguments
- Part 5: Developing and Building Scripts
The code below is lightly annotated to explain how it works, but if you are looking for more information, we suggest you review our EDirect documentation.
There are many different ways to answer the questions discussed in class. The sample code below provides some options, but by no means the only options. Feel free to modify, adapt, edit, re-use or completely discard any of the suggestions below when trying to find a solution that works best for you.
xtract Basics
For an introduction to the xtract
command, see the xtract section of our EDirect documentation.
Retrieve the article titles for a list of PubMed records
efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element ArticleTitle
The first line of this code uses the efetch
command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207
) in XML format (-format xml
), and concludes by piping (|
) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).
The second line uses the xtract
command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern
argument indicates that we should start a new row in our output table for every PubMed record (-pattern PubmedArticle
). The -element
argument indicates that the table should include a single column, containing the article title for the given record (-element ArticleTitle
).
Retrieve the list of authors for a series of PubMed records
efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern Author -element LastName
The first line of this code uses the efetch
command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207
) in XML format (-format xml
), and concludes by piping (|
) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).
The second line uses the xtract
command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern
argument indicates that we should start a new row in our output table for every author (-pattern Author
). The -element
argument indicates that the table should include a single column, containing the last name for the given author (-element LastName
).
Retrieve the PMID and year of publication for a PubMed record
In order to retrieve the PMID and the year of publication for a PubMed record, we might try to use code such as the following:
efetch -db pubmed -id 27101380 -format xml | \
xtract -pattern PubmedArticle -element PMID Year
The first line of this code uses the efetch
command to retrieve a record from PubMed (-db pubmed -id 27101380
) in XML format (-format xml
), and concludes by piping (|
) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).
The second line uses the xtract
command to create a table, with one row for every PubMed record in our XML (xtract -pattern PubmedArticle
; in this case, the table will only have a single row). The line then uses the -element
argument to create two columns, one for PMID and one for Year. (-element PMID Year
). However, the output of this series of commands is not what we expect:
27101380 27619336 27619799 27746956 27747057 2016 2016 2016 2016 2015 2016 2016 2016 2016
Rather than getting a single PMID and a single year, we get 5 PMIDs and 9 Years. This is because, while the -element
argument is designed to create a new column for each element or attribute specified, it populates each column with the contents of every occurrence of the specified element or attribute in the -pattern
. This means that if there are multiple occurrences of the <PMID>
or <Year>
elements in a PubMed record, the contents of all occurrences will be displayed. As a result, we see not only the PMID for the record, but also the PMIDs used to link it to other records which contain related comments or corrections. Furthermore, in addition to the publication year, we also the year for the other eight dates associated with the PubMed record.
We can avoid this by using Parent/Child construction to specify that we only want the contents of the <PMID>
element that is a direct child of the <MedlineCitation>
element, and that we only want the <Year>
element that is a child of the <PubDate>
element:
efetch -db pubmed -id 27101380 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID PubDate/Year
This version of the code gives us the output we expect:
27101380 2016
Retrieve three data elements for a list of PubMed records
efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISOAbbreviation ArticleTitle
The first line of this code uses the efetch
command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207
) in XML format (-format xml
), and concludes by piping (|
) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).
The second line uses the xtract
command to create a table, with one row for every PubMed record in our XML (xtract -pattern PubmedArticle
), and with three columns: one for PMID (specifically, the contents of the <PMID>
element that is a child of the <MedlineCitation>
element), one for the journal title abbreviation, and one for the article title (-element MedlineCitation/PMID ISOAbbreviation ArticleTitle
).
sort-uniq-count-rank and head
Sort a list of authors by the frequency they appear in your results set
esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \
efetch -format xml | \
xtract -pattern Author -element LastName,Initials | \
sort-uniq-count-rank | \
head -n 10
This series of commands searches PubMed for the string “traumatic brain injury athletes”, restricts results to those published in 2016 and 2017, retrieves the full XML records for each of the search results, extracts the last name and initials of every author on every record, sorts the authors by frequency of occurrence in the results set, and presents the top ten most frequently-occurring authors, along with the number of times that author appeared.
esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \
The first line of this command uses esearch
to search PubMed (-db pubmed
) for our search query (-query "traumatic brain injury athletes"
). The line also restricts the search results to articles that were published in 2016 or 2017 (-datetype PDAT -mindate 2016 -maxdate 2017
).
The “|” character pipes the results of our esearch
into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.
efetch -format xml | \
The second line takes the esearch
results from our first line and uses efetch
to retrieve the full records for each of our results in the XML format (-format xml
), and pipes the XML output to the next line.
xtract -pattern Author -element LastName,Initials | \
The third line uses the xtract
command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern
command indicates that we should start a new row for every author (-pattern Author
). Even if there are multiple authors on a single citation, each author will be on a new line, rather than putting all authors for the same citation on the same line. The command then extracts each author’s last name and initials (-element LastName,Initials
). This will output a list of authors’ names and initials, one author per line, and will pipe the list to the next line.
sort-uniq-count-rank | \
The fourth line uses a special EDirect function (sort-uniq-count-rank
) to sort the list of authors received from the previous line, grouping together the duplicates. The function then counts how many occurrences there are of each unique author, removes the duplicate authors, and then sorts the list of unique authors by how frequently they occur, with the most frequent authors at the top. The function also returns the numerical count, making it easier to quantify how frequently each author occurs in the data set.
head -n 10
The fifth line, which is optional, shows us only the first ten rows from the output of the sort-uniq-count-rank
function (head -n 10
). Because this function puts the most frequently occurring authors first, this will show us only the ten most frequently occurring authors in our search results set:
14 Iverson GL
11 Guskiewicz KM
10 Meehan WP
9 Kerr ZY
9 Kontos AP
9 Solomon GS
9 Zuckerman SL
8 Zafonte R
7 Broglio SP
7 Covassin T
(Note: Your output may vary slightly, as additional citations are added to PubMed and the “most frequent” authors change.)
To show more or fewer rows, adjust the “10” up or down. If you want to see all of the authors, regardless of how frequently they appear, remove this line entirely. (If you do choose to remove this line, make sure you also remove the “|” and “\” characters from the previous line. Otherwise, the system will wait for you to finish entering your command.)
In-class exercise solutions
Note: The first three exercises ask for an xtract
command. The solutions below start with efetch
commands that retrieve a sample set of PubMed records in XML, which are then piped into the xtract
command. This allows us to test and verify the solutions using appropriate sample data.
Exercise 1
Write an xtract
command that creates a table with one row per PubMed article. Each row should have two columns: volume number and issue number.
Solution:
efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element Volume Issue
This xtract
command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle
), and with columns for volume number and issue number (-element Volume Issue
).
Exercise 2
Write an xtract command that creates a table with one row per PubMed record. Each row should have three columns: PMID, journal ISSN, and citation status.
Solution:
efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID Journal/ISSN MedlineCitation@Status
This xtract
command begins the same as the solution for Exercise 1 (xtract -pattern PubmedArticle
). When creating the first column, this command uses Parent/Child construction to specify that we want the <PMID>
element that is the child of the <MedlineCitation>
element, and not another <PMID>
element elsewhere in the PubMed record (like as a child of a <CommentsCorrections>
element; -element MedlineCitation/PMID
).
Similarly, the second column is also created using Parent/Child construction. This is probably not strictly necessary, as the <ISSN>
element only appears in one location in the PubMed XML structure. However, this demonstrates that there may be multiple valid EDirect solutions to a given question (Journal/ISSN
).
Finally, the citation status, which is found in the “Status” attribute of the <MedlineCitation>
element, is placed in the third column (MedlineCitation@Status
).
Exercise 3
Find out which authors have been writing about traumatic brain injuries in athletes, with publications in 2016 and 2017. The output should be a list of author names, one per line, with each author’s last name and initials.
Solution:
esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \
efetch -format xml | \
xtract -pattern Author -element LastName,Initials
This series of commands searches PubMed for the string “traumatic brain injury athletes”, restricts results to those published in 2016 and 2017, retrieves the full XML records for each of the search results, and extracts the last name and initials of every author on every record.
esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \
The first line of code uses esearch
to search PubMed (-db pubmed
) for our search query (-query "traumatic brain injury athletes"
). The line also restricts the search results to articles that were published in 2016 or 2017 (-datetype PDAT -mindate 2016 -maxdate 2017
).
The “|” character pipes the results of our esearch
into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.
efetch -format xml | \
The second line takes the esearch
results from our first line and uses efetch
to retrieve the full records for each of our results in the XML format (-format xml
), and pipes the XML output to the next line.
xtract -pattern Author -element LastName,Initials
The third line uses the xtract
command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern
argument indicates that we should start a new row for every author (-pattern Author
). Even if there are multiple authors on a single citation, each author will be on a new line, rather than putting all authors for the same citation on the same line.
The command then extracts each author’s last name and initials (-element LastName,Initials
).
Homework solutions
Question 1
Using the efetch command below to retrieve PubMed XML, write an xtract command to extract specific elements and arrange them into a table. The table should have one PubMed record per row, with columns for PMID, Journal Title Abbreviation, Publication Year, Volume, Issue and Page Numbers.
efetch -db pubmed -id 12312644,12262899,11630826,22074095,22077608,21279770,22084910 -format xml
Solution:
efetch -db pubmed -id 12312644,12262899,11630826,22074095,22077608,21279770,22084910 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
This xtract
command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle
). When creating the first column, this command uses Parent/Child construction to specify that we want the <PMID>
element that is the child of the <MedlineCitation>
element, and not another <PMID>
element elsewhere in the PubMed record (like as a child of a <CommentsCorrections>
element; -element MedlineCitation/PMID
).
The second column is created without Parent/Child construction, as the <ISOAbbreviation>
element is not repeated in a single PubMed XML record (ISOAbbreviation
).
The third column also uses Parent/Child construction to retrieve the publication year (as opposed to other <Year>
elements; PubDate/Year
); the remaining elements only appear in one location in the PubMed XML structure, so Parent/Child construction is unnecessary (Volume Issue MedlinePgn
).
Question 2
Create a table of the authors attached to PubMed record 28341696. The table should include each author’s last name, initials, and affiliation information (if listed).
Solution:
efetch -db pubmed -id 28341696 -format xml | \
xtract -pattern Author -element LastName Initials Affiliation
This first line of this solution uses the efetch
command to retrieve a record from PubMed (-db pubmed
). We specify that we will retrieve the record for PMID 28341696 (-id 28341696
) and that we want the results in XML (-format xml
).
xtract -pattern Author -element LastName Initials Affiliation
The second line uses the xtract
command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern
argument indicates that we should start a new row for every author (-pattern Author
). Even if there are multiple authors on a single citation, each author will be on a new line, rather than putting all authors for the same citation on the same line. The command then extracts each author’s last name, initials, and affiliation information (-element LastName Initials Affiliation
).
Question 3
Write a series of commands to generate a table of PubMed records for review articles about the Paleolithic diet. The table should have one row per citation, and should include columns for the PMID, the citation status, and the article title.
Solution:
esearch -db pubmed -query "review[pt] paleolithic diet" | \
efetch -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID MedlineCitation@Status ArticleTitle
This series of commands searches PubMed for the string “review[pt] paleolithic diet”, retrieves the full XML records for each of the search results, and extracts the last name and initials of every author on every record.
esearch -db pubmed -query "review[pt] paleolithic diet" | \
The first line of code uses esearch
to search PubMed (-db pubmed
) for our search query (-query "review[pt] paleolithic diet"
). Note that the search query can include search field tags ([pt]
) to help focus our search, just as we can in the web version of PubMed.
The “|” character pipes the results of our esearch
into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.
efetch -format xml | \
The second line takes the esearch
results from our first line and uses efetch
to retrieve the full records for each of our results in the XML format (-format xml
), and pipes the XML output to the next line.
xtract -pattern PubmedArticle -element MedlineCitation/PMID MedlineCitation@Status ArticleTitle
The third line uses the xtract
command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern
argument indicates that we should start a new row for every PubMed record (-pattern PubmedArticle
).
The command then extracts each record’s PMID (using Parent/Child construction; -element MedlineCitation/PMID
), citation status (using “@” to retrieve the attribute value for “Status”; MedlineCitation@Status
), and article title (ArticleTitle
).