xtract: Exploration arguments
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
When using xtract
, there may be times when you want to group multiple elements together in your output. This can be especially useful to link multiple child elements of the same parent together. The xtract
command includes a series of arguments that help with this. These arguments, including -group
, -block
, and -subset
, are referred to as Exploration arguments, because they help you explore subsections of your XML document.
When to use Exploration arguments
For example, if you are trying to write an xtract
command that outputs article PMIDs and author names (including last names and initials), with each row representing a different PubMed article, and with a “|” separating the columns, you might use the following command:
xtract -pattern PubmedArticle -tab "|" -sep "|" -element MedlineCitation/PMID LastName Initials
This command would work if each article only had one author. However, if an article has more than one author, the output may not be what you expect:
PMID1|LastName1.1|LastName1.2|LastName1.3|Initials1.1|Initials1.2|Initials1.3
PMID2|LastName2.1|LastName2.2|Initials2.1|Initials2.2
[...]
Because -element
creates a column populated with each instance of the element or attribute in the -pattern
, xtract
will create two columns: one with every <LastName>
element in the -pattern
, and one with every <Initials>
element in the -pattern
. If you wanted to group together each individual <LastName>
element with the individual <Initials>
element that shares a parent <Author>
element, you could use an Exploration argument.
How to use Exploration arguments
Continuing the previous example, we could modify our command to connect each individual <LastName>
element with its corresponding <Initials>
element by using the -block
argument:
xtract -pattern PubmedArticle -tab "|" -sep "|" -element MedlineCitation/PMID -block Author -element LastName Initials
As with most xtract
commands, this command scans through the XML input from the beginning. When it encounters an occurrence of the element specified in the -pattern
argument, xtract
will start a new row of output. The command will then scan through the -pattern
until it encounters the first occurrence of the XML element specified in the -block
argument (in this case, <Author>
).
The command will then scan through the first instance of the Author -block
, looking for every instance of the first element or attribute specified in the -element
argument. xtract
will populate the first column with each instance of the element or attribute encountered within that first Author -block
.
When xtract
reaches the end of the first instance of the Author -block
, it goes back to the beginning of that first Author -block
and begins looking for every instance of the second element or attribute specified in the -element
argument (if there is one). This process repeats, creating new columns for each -element
until all of the elements specified in -element
have been returned.
The command will then look for the next occurrence of the element specified in the -block
argument. If another occurrence of the -block
element is found, the command repeats the above process, retrieving occurrences of each element within the second -block
, before moving on to look for a new -block
.
The result of this process is an output that resembles the following:
PMID1|LastName1.1|Initials1.1|LastName1.2|Initials1.2|LastName1.3|Initials1.3
PMID2|LastName2.1|Initials2.1|LastName2.2|Initials2.2
To make this command even more effective, you can use multiple elements or attributes in a single column by using a comma. By modifying the command slightly:
xtract -pattern PubmedArticle -tab "|" -sep " " -element MedlineCitation/PMID -block Author -element LastName,Initials
the output will change to:
PMID1|LastName1.1 Initials1.1|LastName1.2 Initials1.2|LastName1.3 Initials1.3
PMID2|LastName2.1 Initials2.1|LastName2.2 Initials2.2
It is important to note that, when using an -element
argument inside a -block
(as demonstrated above), only elements and attributes that appear within that -block
element can be retrieved. For example, if you used -block Author
, you could not use -element PMID
within that -block
, as the <Author>
element does not contain any descendant elements named <PMID>
. If you need information to be output within a -block
that cannot be found in that -block
in the input data, you will need to store that information into a variable.
The Exploration hierarchy
All of the previous examples use only the argument -block
. However, for advanced uses, there is a multi-level hierarchy of Exploration arguments, allowing you to explore sections, subsections, and sub-subsections of an XML document in the same xtract
command. Technically, -pattern
is an Exploration argument, as it subdivides the input XML into smaller sections, and connects the elements within each section (by placing them in the same row). From largest to smallest, the Exploration arguments are
-pattern
-division
-group
-branch
-block
-section
-subset
-unit
For most use cases, only -pattern
and -block
will be necessary. To explore deeply-nested XML, you may wish to use -group
and -subset
as well. However, the remaining Exploration arguments are available for unusual cases. For more about using Exploration arguments, please visit NCBI’s EDirect documentation page, “Entrez Direct: E-utilities on the UNIX Command Line”.