Performing Batch MetamorphoSys Runs
This document is a guide to running MetamorphoSys in a programmatic way rather than through the standard GUI interface. This can be very useful if you need to make a variety of runs to produce multiple subsets and want to reuse standard configuration files.
Getting Started
To perform an scripted run of MetamorphoSys you will need five things.
- Input data files in a directory. The data files can either be an installed RRF or ORF subset, or could be the raw .nlm files that come with a UMLS distribution.
- Destination directory where the scripted subset is to be created
- MetamorphoSys installation (like an unpacked mmsys.zip file).
- JRE matching the version of the MetamorphoSys distribution.
- An unpacked mmsys.zip file will contain a JRE directory you can use for Linux, Solaris, or Windows.
- MetamorphoSys configuration file.
The easiest way to obtain all five pieces is to start with a UMLS distribution: downloaded files from the Knowledge Sources Server. Either the Metathesaurus .nlm files or an installed .RRF subset can serve as the input data. You can choose any directory to hold output data. Unpacking the mmsys.zip file to a known location on your machine can serve as both the MetamorphoSys installation and as the JRE.
For the final piece, the easiest way to obtain a configuration file that you want to use is to run the MetamorphoSys application and work your way to the configuration screens in the GUI. After having selected your source list and other configuration options, save your configuration to a file in a known place. The resulting configuration file may contain path information that is overridden by properties specified in the script. This is not a problem. The switches to the Java call above will override any path information int the file.
Configuring and Running a Script
In the sections below, you can follow a sample series of steps for putting together the pieces above into a script and actually generating a subset.
Windows
@echo off
REM
REM Specify directory containing .RRF or .nlm files
REM
set METADIR=C:\UMLS
REM
REM Specify output directory
REM
set DESTDIR=C:\UMLS\METASUBSET
REM
REM Specify MetamorphoSys directory
REM
set MMSYS_HOME=C:\UMLS\MMSYS
REM
REM Specify CLASSPATH
REM
set CLASSPATH=%MMSYS_HOME%;%MMSYS_HOME%\lib\jpf-boot.jar
REM
REM Specify JAVA_HOME
REM
set JAVA_HOME="%MMSYS_HOME%\jre\windows"
REM
REM Specify configuration file
REM
set CONFIG_FILE=C:\config.properties
REM
REM Call Batch MetamorphoSys
REM
cd %MMSYS_HOME%
%JAVA_HOME%\bin\java -Djava.awt.headless=true -Djpf.boot.config=%MMSYS_HOME%\etc\subset.boot.properties -Dlog4j.configuration=%MMSYS_HOME%\etc\subset.log4j.properties -Dinput.uri=%METADIR% -Doutput.uri=%DESTDIR% -Dmmsys.config.uri=%CONFIG_FILE% -Xms300M -Xmx1000M org.java.plugin.boot.Boot
The script defines the five needed pieces: input directory, output directory, MetamorphoSys installation, JRE, and configuration file. JAVA_HOME and the CLASSPATH are configured. The script makes the required java call from the MMSYS_HOME directory.
This will produce an output subset based on the input data specified. The subset will contain either ORF or RRF files (depending upon which you indicated in the configuration file). The style of input data must be correctly specified in the configuration file(choose from .nlm files, RRF, or ORF files). A log of the progress will also be generated as it runs.
Linux, Macintosh, or Solaris
Consider the following script:
#!/bin/sh -f
#
# Specify directory containing .RRF or .nlm files
#
METADIR=/d1/UMLS
#
# Specify output directory
#
DESTDIR=/d1/UMLS/METASUBSET
#
# Specify MetamorphoSys directory
#
MMSYS_HOME=/d1/UMLS/MMSYS
#
# Specify CLASSPATH
#
CLASSPATH=${MMSYS_HOME}:$MMSYS_HOME/lib/jpf-boot.jar
#
# Specify JAVA_HOME
#
JAVA_HOME=$MMSYS_HOME/jre/linux
#
# Specify configuration file
#
CONFIG_FILE=/d1/umls/config.properties
#
# Run Batch MetamorphoSys
#
export METADIR
export DESTDIR
export MMSYS_HOME
export CLASSPATH
export JAVA_HOME
cd $MMSYS_HOME
$JAVA_HOME/bin/java -Djava.awt.headless=true -Djpf.boot.config=$MMSYS_HOME/etc/subset.boot.properties \
-Dlog4j.configuration=$MMSYS_HOME/etc/subset.log4j.properties -Dinput.uri=$METADIR \
-Doutput.uri=$DESTDIR -Dmmsys.config.uri=$CONFIG_FILE -Xms300M -Xmx1000M org.java.plugin.boot.Boot
The script defines the five needed pieces: input directory, output directory, MetamorphoSys installation, JRE, and configuration file. JAVA_HOME and the CLASSPATH are configured. The script makes the required java call from the MMSYS_HOME directory.
This will produce an output subset based on the input data specified. The subset will contain either ORF or RRF files (depending upon which you indicated in the configuration file). The style of input data must be correctly specified in the configuration file(choose from .nlm files, RRF, or ORF files). A log of the progress will also be generated as it runs.
Configuration File Notes
As indicated above, the configuration file you use is best generated using the MMSYS GUI. There are a couple of things you may want to consider when reusing a configuration file.
- Input data to a batch MetamorphoSys process can take the form of an installed RRF subset. If you create your configuration file using the GUI, this will be managed for you. If you want to change your mind after the fact, you can edit a few properties file settings to fix this.
- First, choose one of the following two settings:
- mmsys_input_stream=gov.nih.nlm.umls.mmsys.io.RRFMetamorphoSysInputStream
- mmsys_input_stream=gov.nih.nlm.umls.mmsys.io.NLMFileMetamorphoSysInputStream
- Now, if you chose, say RRFMetamorphoSysInputStream, make sure you express the relevant properties for this input stream. For example (in this case we assume the path /d1/UMLS contains RRF files):
- gov.nih.nlm.umls.mmsys.io.RRFMetamorphoSysInputStream.meta_source_uri=/d1/UMLS/
- gov.nih.nlm.umls.mmsys.io.RRFMetamorphoSysInputStream.meta_source_uri=/d1/UMLS/
- First, choose one of the following two settings:
- Output data can take either the form of RRF or ORF data. If you create your configuration file using the GUI, this will be managed for you. If you want to change your mind after the fact, you can edit a property file settings to fix this. Choose one of the following two settings:
- mmsys_output_stream=gov.nih.nlm.umls.mmsys.io.RRFMetamorphoSysOutputStream
- mmsys_output_stream=gov.nih.nlm.umls.mmsys.io.ORFMetamorphoSysOutputStream
- Each time the UMLS Metathesaurus is updated, some of the various "default" data sets may change. For example, the source list, default SAB,TTY list, and list of suppressible (CUI,AUI). If your configurations rely on these properties (e.g. sources or termgroups properties), make sure you compare the previous version value list to the current version value list. To avoid this kind of problem, it is often better to express your configurations in terms of things to include instead of things to exclude. Furthermore, you can always re-open your configuration file in the GUI for the latest release of MetamorphoSys and see a report of changes that may affect your configuration. Then you can make desired changes, save it again, and reuse it in your batch environment.
Instead of using the MetamorphoSys GUI to create your configuration file, you may want to consider a programmatic approach to editing the default user.a.prop (in config/ directory of the distribution) config file that comes with a MetamorphoSys distribution. Consider this code snippet:
% grep ^sources $MMSYS_HOME/config/2009AB/user.a.prop | /usr/local/bin/perl -pe 's/sources=//; s/;/\n/g' | \
awk -F\| '{print $1"|"$1}' | /usr/local/bin/perl -pe 's/\n/;/g' >! /tmp/sab_list.txt
% /usr/local/bin/perl -pe 'open(SOURCES,"/tmp/sab_list.txt"); \
$sources = <SOURCES>; \
chop($sources); \
s/(gov.nih.nlm.umls.mmsys.filter.SourceListFilter.selected_sources).*/$1=$sources/; \
s/^(mmsys_input_stream)=.*/$1=gov.nih.nlm.umls.mmsys.io.NLMFileMetamorphoSysInputStream/; \
s/^(mmsys_output_stream)=.*/$1=gov.nih.nlm.umls.mmsys.io.RRFMetamorphoSysOutputStream/; \
s/^(.*)\.remove_selected_sources=true/$1.remove_selected_sources=false/; ' \
$INIT_CONFIG_FILE >! my_config.prop
In this example, we are starting by looking up the complete list of sources in the "sources" property in the default configuration file and compiling a SAB list. The second command makes four modifications to the default configuration file and writes a new configuration file (my_config.prop).
- The selected_sources property of the source list filter is set to the complete source list (taken from the prior command).
- The mmsys_input_stream property is set to NLMFileMetamorphoSysInputStream (.nlm Files).
- The mmsys_output_stream property is set to RRFMetamorphoSysOutputStream (RRF Files).
- The remove_selected_sources property (of the source list filter) is set to false (causing the source list filter to operate in include mode).
The effect of this is the output my_config.prop file which is now configured (correctly for this version of the data) to be a "keep everything" subset of the NLM data files. Now, this config file can be passed along with other parameters to the scripted MetamorphoSys call to make the desired subset.
Last Reviewed: July 29, 2016