xtract: Storing and retrieving information with variables
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
When using the xtract
command, you can only use the -element
argument to display data that is available in the -pattern
or -block
you are currently exploring. For example, consider the following xtract
command:
xtract -pattern MeshHeading -element LastName Initials
This command will not output anything, as the %lt;MeshHeading> element does not contain descendant <LastName>
or <Initials>
elements. The xtract
command cannot find a <LastName>
or <Initials>
within the -pattern MeshHeading
, so provides no output. Similarly, consider the xtract
command:
xtract -pattern PubmedArticle -block Author -element PMID LastName Initials
This command will output a new row for each PubMed record, and will output a list of the authors’ last names and initials for each record, but will not output the PMID, because <PMID>
is not a descendant element of <Author>
. While exploring a -block Author
, the -element
argument can only display elements and attributes that are contained within that -block
.
However, by storing a value to a variable outside of a -block
, then recalling that value from the variable inside the -block
, you can gain even greater flexibility of output.
Declaring variables
An xtract
variable name can be any combination of digits and capital letters. To store a value to a variable, rather than using an -element
argument, use the variable’s name preceded by a dash:
xtract -pattern PubmedArticle -VAR1 MedlineCitation/PMID
The above command will store the contents of the <PMID>
element (which is the direct child of the <MedlineCitation>
element) into the new variable “VAR1”.
You can store multiple elements into the same variable, for convenient retrieval later:
xtract -pattern PubmedArticle -sep "/" -DATE PubDate/Year,PubDate/Month
The above command will store both the <Year>
and <Month>
child elements of <PubDate>
into the variable “DATE”, separated by a “/” (thanks to the -sep "/"
argument).
You can store strings of characters into variables instead of elements or attributes:
xtract -pattern PubmedArticle -LABEL1 "Author: "
The above command will store the string “Author: ” into the variable “LABEL1”. This technique can be used to have additional control over output formatting (see below for an example).
Retrieving data from a variable
To display the contents of an xtract
variable with an -element
argument, use the variable’s name preceded by an ampersand, enclosed in quotes:
xtract -pattern PubmedArticle -VAR1 MedlineCitation/PMID \
-block Author -element LastName Initials "&VAR1"
The above command will create a new row for each PubMed record, storing the PMID into the variable “VAR1”. The command then uses -block Author
to loop through each <Author>
element on the record. For each <Author>
, the command will output the author’s last name and initials, followed by the contents of the variable “VAR1” (which, in this case, is the PMID for the record in question).
Example with variables
The best way to understand the value of variables is to look at an EDirect script that makes effective use of them.
The following code is designed to output affiliation data for an author, to help analyze the different ways an author’s affiliation is represented in PubMed. This could be useful for author disambiguation (especially in the case of authors with common names), or could provide data to analyze an author’s output as a function of the institution with which they are affiliated.
esearch -db pubmed -query "smith bh[Author]" \
-datetype PDAT -mindate 2014 -maxdate 2017 | \
efetch -format xml | \
xtract -pattern PubmedArticle -VAR1 MedlineCitation/PMID \
-block Author -if LastName -equals Smith \
-and Initials -equals BH -and Affiliation \
-element "&VAR1" Affiliation
The above code searches for all of the author BH Smith’s articles in a given date range and outputs the PMID and the affiliation data listed for BH Smith on each record. Affiliation data for all authors other than BH Smith is suppressed, as are all records where BH Smith is an author, but has no listed affiliation data.
esearch -db pubmed -query "smith bh[Author]" \
-datetype PDAT -mindate 2014 -maxdate 2017 | \
The first two lines of the above code finds records for articles authored by BH Smith and published between 2014 and 2017.
efetch -format xml | \
The third line retrieves all matching records in PubMed XML.
xtract -pattern PubmedArticle -VAR1 MedlineCitation/PMID \
The remaining lines of code, starting with the fourth, form an xtract
command which extracts specific data from the PubMed XML retrieved on the previous line. The fourth line uses -pattern PubmedArticle
to create a new row for each PubMed record, but does not immediately output any data. Instead, it saves the PMID for each record to a variable “VAR1” (-VAR1 MedlineCitation/PMID
).
-block Author -if LastName -equals Smith \
-and Initials -equals BH -and Affiliation \
The next two lines use -block Author
to check through each author on the record, but to only display information for authors with a last name of “Smith” (-if LastName -equals Smith
), with initials of “BH” (-and Initials -equals BH
) and with an <Affiliation>
element (-and Affiliation
). If any of these conditions are not true (e.g. if the author’s name doesn’t match, or the author does not have an <Affiliation>
element), the author will be skipped. Based on our search, each record should have at least one author whose name is BH Smith, but not all of those authors will have affiliation data (as <Affiliation>
is an optional element in PubMed XML). These conditions will limit our output, selecting out only the author BH Smith from each record, and only if that author has affiliation data listed.
-element "&VAR1" Affiliation
The last line indicates what data should be output for each row. In this case, we will output the PMID of the record, and the affiliation for the author BH Smith on that record. In the event that BH Smith has no affiliation data on a given record, the conditions in our previous lines will prevent anything from being output for that line.
Our end result will be a list of citations where BH Smith was an author, and where BH Smith’s affiliation data was listed. For each citation, the PMID and BH Smith’s affiliation data will be listed.