xtract: Filtering output with Conditional arguments
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
When using xtract
to process XML data at the end of a data pipeline that includes an esearch
or efilter
command, you can use PubMed’s suite of search tools to limit your output by adding further search criteria to your esearch
or efilter
. However, there may be cases where you wish to restrict your output based on data that is not traditionally searchable in PubMed. To accommodate this, xtract
offers a group of arguments which include or exclude data from the output table, based on whether or not specific conditions are met.
(Note that the xtract
Conditional arguments have been redesigned and improved substantially in recent EDirect releases, starting with EDirect Version 5.20 in October 2016.)
Limiting output based on the presence or absence of an element or attribute
By default, the xtract
command creates a new row for each occurrence of the XML element identified in the -pattern
argument. You can use the -if
argument to instruct xtract
to only create a row for a given pattern if the pattern contains the element or attribute identified in the -if
argument. Otherwise, xtract
will skip the pattern and move on to the next pattern.
For example, the command
xtract -pattern Author -if Identifier -element LastName Initials Identifier
will output the last name, initials and <Identifier>
(usually the author’s ORCID) of each <Author>
in the XML input, if the <Author>
contains an <Identifier>
element. If the <Author>
does not contain an <Identifier>
, the <Author>
is skipped.
The -unless
argument functions exactly like the -if
argument, but reversed: xtract
will create a row for every pattern, unless the pattern contains the element or attribute identified in the -unless
argument. For example, the command
xtract -pattern PubmedArticle -unless MeshHeading -element MedlineCitation/PMID
will output the PMID of each <PubmedArticle>
, unless the <PubmedArticle>
contains <MeshHeading>
element.
Limiting output based on the value of an element or attribute
The previous example showed one way of excluding records that have been indexed with MeSH from your output. However, as in many cases with EDirect, there are several ways to accomplish this goal. All PubMed records that have been indexed with MeSH also have a citation status of “MEDLINE”.
Using -if
or -unless
, we can also include or exclude data based on the value/contents of a particular element or attribute by using the -equals
argument.
-if Element -equals String
-unless Element -equals String
With this syntax, we could re-write our previous command to exclude all records that have a citation status of “MEDLINE”, rather than to exclude all records that contain a <MeshHeading>
element:
xtract -pattern PubmedArticle -unless MedlineCitation@Status -equals MEDLINE -element MedlineCitation/PMID
We could also use this method to only include certain elements, based on the value of one of the element’s attributes:
xtract -pattern DescriptorName -if DescriptorName@MajorTopicYN -equals Y -element DescriptorName
This command would create a new row for each <DescriptorName>
element that has a “MajorTopicYN” attribute with a value of “Y”, and will output the <DescriptorName>
. This will essentially provide a list of the MeSH Headings in the input XML that are flagged as Major Topics, and will exclude all other MeSH Headings.
In general, strings specified in an -equals
argument do not need to be enclosed in quotes. However, if the string includes a space, enclose the entire string in double quotes to ensure that xtract
matches the entire string.
Alternatives to -equals
While -equals
is useful if we know the exact value we want to match, -if
and -unless
also work with a series of other arguments that allow more flexibility in creating conditions:
-equals
: The element/attribute must exactly match the string you specify.-contains
: The element/attribute must contain the string you specify.-starts-with
: The element/attribute must start with the string you specify.-ends-with
: The element/attribute must end with the string you specify.-is-not
: The element/attribute must not match the string you specify.
If the element or attribute specified in your -if
or -unless
argument contains numeric data (like Year, Volume, Issue, etc.), you may prefer to use Conditional arguments that treat data as numbers, not as strings:
-gt
: The value of the element/attribute is greater than the number you specify.-ge
: The value of the element/attribute is greater than or equal to the number you specify.-lt
: The value of the element/attribute is less than the number you specify.-le
: The value of the element/attribute is less than or equal to the number you specify.-eq
: The value of the element/attribute is equal to the number you specify.-ne
: The value of the element/attribute is not equal to the number you specify.
Using -if and -unless with Exploration arguments
The xtract
Conditional arguments can be especially powerful when combined with Exploration arguments. By applying an -if
or -unless
argument to a -block
argument, data from an individual -block
can be included or excluded, based on whether the -block
meets the conditions specified by -if
or -unless
.
For example, if we wanted a list of every PMID of each PubMed record in our XML file, along with the article’s DOI (if one is provided), we could use the following command:
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block ArticleId -if ArticleId@IdType -equals doi -element ArticleId
This command creates a new row for each PubMed record, and prints the record’s PMID in the first column. The command then looks at each occurrence of the <ArticleID>
element within the “PubmedArticle” -pattern
individually. If an occurrence of the <ArticleID>
element has an “IdType” attribute of “doi”, the second -element
argument will display the <ArticleID>
. This will result in a two column table, where the first column is always a record’s PMID and the second column is either a record’s DOI (if one is provided) or blank (if no DOI is provided).
Combining conditions with -and/-or
The xtract
command allows you to restrict output based on multiple conditions, using Boolean AND/OR, with the -and
and -or
arguments.
When trying to impose multiple conditions in the same xtract
command, the first condition is written as always, using -if
or -unless
, with an optional -equals
, -contains
, etc. The syntax for subsequent conditions is almost identical, though we replace the -if
or -unless
with -or
or -and
.
We can include or exclude data if any one of multiple conditions are met by using -or
. Modifying our previous example, we could use the -or
argument to populate the second column of our output if the <ArticleID>
element has an “IdType” attribute of “doi” or of “pmc”:
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block ArticleId -if ArticleId@IdType -equals doi -or ArticleID@IdType -equals pmc -element ArticleId
This will result in a two-column table, where the first column is always a record’s PMID and the second column is either a record’s DOI (if one is provided), a record’s PMCID (if one is provided), both a DOI and a PMCID (if both are provided) or blank (if neither is provided).
We can also include or exclude data only if every one of multiple conditions are met by using -and
:
xtract –pattern PubmedArticle –element MedlineCitation/PMID \
–block Author -if LastName -equals Smith –and Initials -equals BH –element Affiliation
This command creates a new row for each PubMed record, and prints the record’s PMID in the first column. The command then looks at each occurrence of the <Author>
element within the “PubmedArticle” -pattern
individually. If an occurrence of the <Author>
element has a <LastName>
of “Smith” and an <Initials>
of “BH”, the second -element
argument will display the <Affiliation>
. This will result in a two column table, where the first column is always a record’s PMID and the second column is either affiliation data for the Author BH Smith (if BH Smith is one of the authors on the record) or blank (BH Smith is not one of the record’s authors).
As with -if
and -unless
, -and
and -or
can be used with any of the alternatives to -equals (e.g. -contains
, -starts-with
, etc.) described above.
(Note that there is no -not
argument for Boolean NOT, as -if
and -unless
are functional opposites, and make an additional -not
argument unnecessary.)
Limiting output based on position
The -if
and -unless
arguments include data based on the presence, absence or value of an element or attribute. The -position
argument can include data based on its position in an XML document.
The -position
argument is used with one of the Exploration arguments, such as -block
, and tells xtract
to include only data from a single occurrence of a block element that is in a specific position relative to other occurrences of that block. For example:
xtract -pattern PubmedArticle -block Author -position 3 -element LastName
would output the LastName element from the third Author in a given PubMed record.
When using a -position
, you can specify an integer, as demonstrated above. However, you can also use -position first
to include only the first occurrence of the block (equivalent to -position 1
), or -position last
to include only the last occurrence of the block. Using -position last
can be especially useful when you are unsure how many occurrences of the block there will be in each pattern.