xtract: Creating rows and columns
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
The xtract
command is designed to output data in tabular format, based on your custom specifications. At the most basic level, when using xtract
, you need to specify when to start a new row, how many columns to include, and what data should be included in each column.
Creating rows
The -pattern
argument specifies when to start a new row. You provide -pattern
with the name of an XML element (e.g. -pattern PubmedArticle
). The xtract
command scans through the XML input from the beginning. When it encounters an occurrence of the element specified in the -pattern
argument, xtract
will start a new row of output. All of the data included in this output row will come from descendants (child elements, children of children, etc.) of the element specified in the -pattern
argument. When xtract
reaches the end of the element, it ends the row of output and continues scanning for the next occurrence of the -pattern
.
Creating a column
To create a column that contains data from an XML element or attribute, specify it in the -element
argument (see the xtract
overview page for information about specifying elements and attributes).
Once xtract
has encountered an occurrence of -pattern
(see above), it will scan within the -pattern
, looking for every instance of the element or attribute specified in the -element
argument. xtract
will populate the column with each instance of the element or attribute encountered.
For example, the command:
xtract -pattern PubmedArticle -element ArticleTitle
will create an output table with a new row for each PubMed record (-pattern PubmedArticle
) in the XML input. The table will have a single column, which will contain each record’s article title (-element ArticleTitle
):
ArticleTitle1
ArticleTitle2
ArticleTitle3
[...]
A column may contain multiple values if the element or attribute specified in the -element
field is repeated within the -pattern
. For example, the command:
xtract -pattern PubmedArticle -element Author/LastName
will create an output table with a new row for each PubMed record (-pattern PubmedArticle
) in the XML input. The table will again have a single column, but the column will contain the last name of each author on the record (-element Author/LastName
; for more information on Parent/Child construction, see the overview page):
Author/LastName1.1 Author/LastName1.2 Author/LastName1.3
Author/LastName2.1 Author/LastName2.2
Author/LastName3.1
Author/LastName4.1 Author/LastName4.2 Author/LastName4.3 Author/LastName4.4
[...]
Creating multiple columns
You can create multiple columns with a single -element
argument. To create multiple columns, type -element
followed by multiple elements or attributes, separated by spaces:
xtract -pattern PubmedArticle -element Volume Year
Once xtract
has encountered an occurrence of -pattern
(see above), it will scan within the -pattern
, looking for every instance of the first element or attribute specified in the -element
argument. xtract
will populate the first column with each instance of the element or attribute encountered.
When xtract
reaches the end of the -pattern
, it goes back to the beginning of the -pattern
and begins looking for every instance of the second element or attribute specified in the -element
argument. xtract
will start a new column, populated with each instance of the element or attribute encountered.
By default, the columns are separated by a tab (denoted in Unix as “\t”), but this separator can be adjusted by using the -tab
argument (see Formatting arguments for more information).
This process repeats, creating new columns for each -element
until all of the elements specified in -element
have been returned, at which point the row is ended and xtract
begins looking for the next -pattern
.
For example, the command:
xtract -pattern PubmedArticle -element MedlineCitation/PMID Journal/ISOAbbreviation ArticleTitle
will create an output table with a new row for each PubMed record (-pattern PubmedArticle
) in the XML input. The table will have three columns: one for the record’s PMID (-element MedlineCitation/PMID
), one for the title abbreviation of the record’s journal (Journal/ISOAbbreviation
), and one for the record’s article title (ArticleTitle
):
PMID1 ISOAbbreviation1 ArticleTitle1
PMID2 ISOAbbreviation2 ArticleTitle2
PMID3 ISOAbbreviation3 ArticleTitle3
[...]
Putting multiple elements or attributes in a single column
You can group multiple elements or attributes together in the same output column. This can be useful when grouping together subsections of an XML document using the Exploration arguments. To put multiple elements or attributes in the same column, specify multiple elements or attributes, separated by a comma instead of a space. While the columns will still be separated by a tab (unless the default is changed by the -tab
argument), multiple elements or attributes within the same column will be separated by the separator defined in the -sep
argument. For more information on -tab
and -sep
, see Formatting arguments.
Creating advanced xtract tables
Note that the explanation above describes the xtract
process for simple tables. Some advanced xtract
arguments like -block
, -if
and -unless
alter this process. For more about these advanced arguments, see Exploration Arguments and Conditional Arguments.