David Sherertz, Apelon, Inc.
Executive Summary
The NLM received 9 SNOMED CT source files in the January 2005 release, and 3 Spanish translation SNOMED CT source files in the October 2004 release, from the College of American Pathologists (CAP) for inclusion in the 2005AB release of the UMLS Metathesaurus. The 2005AB release is the fifth Metathesaurus release to appear in the Rich Release Format (RRF), and the second to include SNOMED Spanish translations. An important principle of the release file structure is “source transparency”. Source transparency implies that, independent of the Metathesaurus value-added attributes, the original SNOMED CT source files can be recreated ‘exactly’ from the information in the RRF distribution files. For the January 2005 / October 2004 SNOMED CT source files and the 2005AB Metathesaurus RRF distribution files, this recreation has been demonstrated successfully.
Abstract
One central principle for the incorporation of a source vocabulary into the UMLS is source transparency. This implies that there is no loss of information in the editing and release process. Every element of information contained in the source is included in the release files, even though it may be represented in a format different from that of the original source files, and some of the information may not yet be released in the UMLS (for example, inactive SNOMED CT concepts / descriptions, and relationships that include them). The data structure of SNOMED CT is complex, and some of its information pertains to particular atoms (specific occurrences of a string or concept name), rather than to a SNOMED CT concept as a whole or to all instances of a string in the vocabulary. The need to represent all of this information influenced the development of a new release file format for the UMLS Metathesaurus – the Rich Release Format (RRF). The 2005AB release is the fifth time that RRF files are being distributed. With each release, it is important to re-verify that source transparency has been achieved, especially for a large and complex vocabulary like SNOMED CT.
Starting with the January 2005 and October 2004 SNOMED CT source files received from the CAP, all information was extracted into source-derived files. These files formed the basis for comparisons. All SNOMED CT information from the 2005AB RRF distribution files was extracted into UMLS-derived files and converted to a format identical to the source-derived files. Direct row-by-row comparisons between the source-derived and UMLS-derived files proved they are identical. This verified that all information contained in the source can be retrieved from the UMLS files, and that source transparency has been achieved again in the incorporation of SNOMED CT into the UMLS Metathesaurus .
Introduction
The RRF extensions to the Metathesaurus release files are an enhancement of the original release format (ORF). These extensions enable all Metathesaurus sources including SNOMED CT and other new sources such as the NCI Thesaurus to be represented transparently. Thus, a critical quality assurance check is to see if the SNOMED CT information can be extracted from the 2005AB RRF Metathesaurus distribution files and used exclusively to reproduce the original 12 SNOMED CT files as received from the CAP. Appendix A shows the file names, field names and sizes of each of these 12 SNOMED CT files ( They are the ones in the list that begin with ‘../sct’ and end with ‘.txt’.) Appendix B shows the successful result of the process described in the remaining sections, which elaborate on the recreation details. The 12 SNOMED CT files are compared to versions derived from the RRF distribution files. In this comparison, for the original files, multiple blanks and vertical bars are removed, and the original field delimiter (a TAB) is converted to the distribution file delimiter (a vertical bar). Each pair of original source-derived and RRF distribution UMLS-derived files is sorted and compared. There are no differences in any of the files; this defines what is meant by the files being ‘exactly’ identical.
Methods
The only UNIX commands used in any of the scripts are nawk, sed, sort, set, diff, wc, rm, touch and echo. All of the scripts are csh and are called from a single shell script, called the Master Script, which in turn runs a sequence of scripts in the order described in the steps below.
- The Master Script first makes the ‘original’ SNOMED CT files. It does this by extracting only the active status (0 and 6) concepts and descriptions, and only relationships where there are active concepts on both ends of the relationship. This results in all of the historical relationships dropping out, because by definition, the first concept in a historical relationship is inactive. For future releases of the Metathesaurus, since all of the inactive SNOMED CT concepts and descriptions will be released, nothing should drop out. As shown on the lines annotated with a ‘>’ in Appendix A, 17% of SNOMED CT concepts and 24% of descriptions are inactive. The number of relationships involving an inactive concept on one end or the other is only 9% of all relationships, and slightly more than 28% of component history lines involve inactive descriptions. None of the subset or cross mapping files involve any inactive concepts, so these files contain the same number of rows and fields as received from the CAP, except for the Spanish subset members file, where a very few rows drop out because the October 2004 Spanish files are slightly out of synchronization with the January 2005 SNOMED CT files.
The remaining information in the 12 SNOMED CT files is put into 12 files with descriptive names starting with ‘QA’ and ending with ‘*.original’, as shown in Appendix A. The SNOMED CT files as received from the CAP use a TAB character as a field delimiter. This is converted to a vertical bar, the distribution field delimiter, in the QA*.original files. Also, in the Cross Mapping TARGETCODES field, the SNOMED CT file uses a vertical bar as a subfield delimiter. This is changed to a comma ‘,’. The only other change done to these 12 QA*.original files from SNOMED CT as received from the CAP is to remove any vertical bars, as well as multiple, leading and trailing blanks, and to add a vertical bar at the end of each row. These minor changes are made to facilitate comparisons with the files as recreated from the 2005AB RRF Metathesaurus distribution files.
- The Master Script next extracts all of the SNOMED CT rows from the RRF distribution files. It does this by pulling all the rows that have an SAB of ‘SNOMEDCT’ from the MRCONSO, MRSAT, MRREL, MRHIST, and MRMAP files. For the MRREL rows, it also pulls only the rows where both the STYPE1 and STYPE2 fields are SCUI. In MRMAP, the SAB field is the MAPSETSAB field. The Spanish rows are extracted into separate files from the rows in MRCONSO and MRSAT that have an SAB of ‘SCTSPA’.
- The Master Script then runs in sequence 6 scripts that recreate from the RRF files pulled in Step 2) the 12 SNOMED CT files corresponding to the QA*.original files created in Step 1). The files created in this step have descriptive names starting with ‘QA’ and ending with ‘*.final’, and their descriptive name is identical to the corresponding QA*.original file made in Step 1). In each of these 6 scripts, after making its QA*.final file(s) (some scripts make two or three QA*.final files), the Unix wc command is run on the QA*.original and QA*.final files, and then a diff command piped into a wc -l. A successful recreation is achieved when the wc counts are identical and the diff with wc –l returns 0.
The approach in creating most of the QA*.final files in the 6 scripts of this step is to make a series of ‘triples’ for the fields that will become the QA*.final file. The triples have the operative ID (concept, description, relationship, subset, cross mapping) as the first field, the field number in the QA*.final file as the second field, and the field value as the third field. Then a sort and simple nawk script reassembles the QA*.final rows from these triples. For some of the QA*.final files, this approach can be simplified, and the QA*.final file can be made by just extracting and reordering the appropriate fields directly from the pertinent RRF file (for example, the Component History file).
In debugging the 6 scripts in Step 3), it is helpful to create and look over intermediate files during the process of making the QA*.final to isolate and fix whatever problem is occurring. For the final run, each script removes all of the intermediate files, and only leaves the QA*. original file when there is a discrepancy with the QA*.final file; otherwise only the QA*.final file is left for use in other release QA. One other file is left during the creation of the QA*.final file for the Descriptions and the Spanish Descriptions. This is a file mapping the Meta-assigned AUIs to the SNOMED CT Description IDs, which are needed to make the corresponding Subset Members file. Except for these two carry-over files, all 6 of the scripts in this step are completely self-contained. One assumption made in making the triples is that the field names for the SNOMED CT files in the RRF files (MRSAT principally) are unique if the name is shortened to its first letter and last five letters. That is true currently, but must be re-verified in the future when new field names are added to SNOMED CT.
Results
Appendix A has the details of running the script that extracts the QA*.original files from the SNOMED CT files as received from the CAP, Step 1) of the Master Script described above. For each of the 12 files, it shows the wc counts for the file as received, and then the corresponding QA*.original file of active rows only. In the 6 files where some of the rows of the file were removed, a line annotation beginning with a ‘>’ is added showing the count of the number of lines dropped in the QA*.original file, and the percentage this represents of the file as received. These percentages and the reasons for their removal were described in Methods.
Appendix B shows a single run of the Master Script described in Methods. It shows the output from running the entire Master Script. As can be seen, the wc (record/line, word, character) counts are all identical for each of the 12 files, and the diff shows that the files are identical, with the exception of the Spanish subset members files, which is a known synchronization and update problem. The entire Master Script, starting with just the SNOMED CT files as received, and the 2005AB RRF distribution files, takes under 50 minutes to complete successfully on a medium-sized Solaris machine.
Discussion
The 6 scripts that make the 12 QA*.final files vary in their degree of complexity. Interesting details about fine points will be elaborated in this section for each QA*.final file. One general point is that the QA*.original files use a TAB as a field delimiter, while the distribution MR*.RRF files use a vertical bar ‘|’. In the sections below, whenever a reference is made to a specific MR*.RRF distribution file, it is understood that this reference is to the extracted version of that RRF file from the full release file with either a SNOMEDCT or SCTSPA SAB row, and that the .RRF suffix is omitted. Below the descriptive name of each file are the row(s) of the SNOMED CT field names for that file, as some of these field names are mentioned in the discussion. For brevity, the field delimiter is shown as a vertical bar; in the actual SNOMED CT source files, the delimiter is a TAB character.
Concepts
CONCEPTID|CONCEPTSTATUS|FULLYSPECIFIEDNAME|CTV3ID|SNOMEDID|ISPRIMITIVE|
This QA*.final file is very straightforward, as it involves simply pulling all of the rows from the MRCONSO file with a TTY field of “FN”, and then getting the SCUI attributes from the MRSAT file. With these triples, the QA*.final file is made and compared to the QA*.original file
Descriptions (2)
DESCRIPTIONID|DESCRIPTIONSTATUS|CONCEPTID|TERM|INITIALCAPITALSTATUS|
DESCRIPTIONTYPE|LANGUAGECODE|
This QA*.final file is also relatively simple. A separate QA*.final file is made for the U.S. and Spanish descriptions. Note that for each language, there is a separate MRCONSO and MRSAT file made by extracting the rows from the RRF files with the appropriate SAB for each language. The one exception is the ‘description’ record made for the Cross Mappings and for the Subsets; these are ignored. In future Metathesaurus releases, the SAB for these records should be changed to MTHSCT. Then, the SAUI attributes are pulled from each MRSAT file. One wrinkle with the Description attributes is that they are linked by their AUI to the DESCRIPTIONID for their triple. So in pulling all the MRCONSO rows, a mapping file is also made connecting the Description’s AUI to the DESCRIPTIONID to allow linking the MRSAT attribute to its appropriate triple. With these triples, each QA*.final file is made and compared to the corresponding QA*.original file.
Relationships
RELATIONSHIPID|CONCEPTID1|RELATIONSHIPTYPE|CONCEPTID2|CHARACTERISTICTYPE|
REFINABILITY|RELATIONSHIPGROUP|
This QA*.final file is a little more complex. First of all, the only rows needed in the MRREL file are the rows with a DIR (Directionality) field = “Y”. Secondly, the MRREL file uses AUIs for both the ‘to’ and ‘from’ concepts in the relationship, so these AUIs have to be looked up in MRCONSO to get their CONCEPTID. Then, the sense of the relationship in the MRREL file is reversed from the sense in the SNOMED CT file (right-to-left versus left-to-right respectively), so the two concepts involved in the relationship have to be switched in making the QA*.final file. The SNOMED CT field RELATIONSHIPTYPE involves getting all the UMLSRELAs ATVs with their type from MRSAT, matching the UMLSRELA string from the MRREL rows, and then replacing the string with the corresponding RELATIONSHIPTYPE. The remaining fields are pulled from the SRUI rows in MRSAT. With these triples, the QA*.final file is made and compared to the QA*.original file.
Note that beginning with the July 2004 SNOMED CT release, the historical relationships are included in a single relationships file; prior to this, the historical relationships were distributed in a separate file with the same format as the relationships file. The only way of distinguishing active relationships from historical relationships is that historical relationships have a CHARACTERISTICTYPE value equal to 2. In the 2005AB release, no historical relationships were released, as the first concept in the relationship is always inactive. Thus, in making the QA*.original file for relationships, only the relationships involving active concepts are included.
Also note that the AQ (Allowable Qualifiers) relationships are in both the MRSAT file, and the MRREL file. They are made from both places, and a sort -u is done to insure that they are identically represented in the two distribution file places in the QA*.final file.
Component History
COMPONENTID|RELEASEVERSION|CHANGETYPE|STATUS|REASON|
This QA*.final file simply involves getting all of the rows in MRHIST with a CHANGEKEY of CONCEPTSTATUS or DESCRIPTIONSTATUS, and printing out the appropriate fields from those rows in the order of the QA*.original file. In this way, the QA*.final file is made and compared to the QA*.original file.
Subsets (2)
SUBSETID|SUBSETORIGINALID|SUBSETVERSION|SUBSETNAME|SUBSETTYPE|LANGUAGECODE|
REALMID|CONTEXTID|
A separate QA*.final file is made for the U.S. and Spanish subsets. Note that for each language, Step 2) makes a separate MRCONSO and MRSAT file. The QA*.final files are each mostly made from triples extracted from the MRSAT file. However, there is one row in MRCONSO with a TTY of ‘SB’ that contains the SUBSETNAME field. With these triples, each QA*.final file is made and compared to the QA*.original file.
Subset Members (2)
SUBSETID|MEMBERID|MEMBERSTATUS|LINKEDID|
A separate QA*.final file is made for the U.S. and Spanish subset members. Note that for each language, Step 2) makes a separate MRSAT file. Each QA*.final file is entirely made from the MRSAT file, as all of the fields except the MEMBERID are in the ATV field of the SUBSETMEMBER ATN rows. The MEMBERID is gotten from the mapping of the AUI in the MRSAT row to the DESCRIPTIONID made when the Descriptions file is built. Each QA*.final file is made and compared to the QA*.original file.
Cross Mappings Sets
MAPSETID|MAPSETNAME|MAPSETTYPE|MAPSETSCHEMEID|MAPSETSCHEMENAME|
MAPSETSCHEMEVERSION|MAPSETREALMID|MAPSETSEPARATOR|MAPSETRULETYPE|
This QA*.final file is made entirely from the MRSAT file, as all of the fields are separate rows in MRSAT; these are made into triples. With these triples, the QA*.final file is made and compared to the QA*.original file. The MAPSETXRTARGETID, MAPSETID and MAPSETSCHEMEID values are saved with their CUI for use in making the Cross Mapping Targets file and Cross Mappings file below.
Cross Mapping Targets
TARGETID|TARGETSCHEMEID|TARGETCODES|TARGETRULE|TARGETADVICE|
This QA*.final file is made entirely from the MRMAP file. One ‘fake’ row must also be written out with just the TARGETID and TARGETSCHEMEID populated from the saved values in making the Cross Mappings Sets file above. All of the other rows are from the non ‘XR’ REL rows of MRMAP. It should be noted that the initial QA*.final file made in this way has many duplicate rows. To get it to compare to the QA*.original file a sort –u is done to remove the duplicates. The QA*.final file is made and compared to the QA*.original file.
Cross Mappings
MAPSETID|MAPCONCEPTID|MAPOPTION|MAPPRIORITY|MAPTARGETID|MAPRULE|MAPADVICE|
This QA*.final file is made from all of the rows of MRMAP. The MAPSETID is from the saved value above for the CUI of the MRMAP row. The MAPTARGETID for the XR REL rows is the saved MAPSETXRTARGETID value above for the CUI of the MRMAP row. That is, the XR REL rows of MRMAP do not have their own MAPTARGETID value, but are all mapped to that one value (which is supposed to mean ‘cannot be mapped’). Otherwise, the appropriate fields from MRMAP are printed out in the order of the QA*.original file. The QA*.final file is made and compared to the QA*.original file.
Conclusion
It is possible to recreate the 12 SNOMED CT files as received from the CAP using only the information in the Metathesaurus 2005AB RRF distribution files. In fact, only 5 of the RRF files are needed. While some of the steps and links require a sophisticated understanding of both the SNOMED CT fields and the UMLS Metathesaurus fields, and what each means, the process can be implemented simply and tested efficiently. This successful test of the recreation of the SNOMED CT files is a strong endorsement that the RRF structure does indeed support true source transparency. And, if SNOMED CT in the UMLS Metathesaurus is transparently complete, that is a good indication that any other present or future sources will be demonstrably transparent when represented in the RRF file structure.
Appendix A – SNOMED CT Release and QA*.original Files
Content Directory
Concepts
CONCEPTID|CONCEPTSTATUS|FULLYSPECIFIEDNAME|CTV3ID|SNOMEDID|ISPRIMITIVE|
364462 3718918 26094772 Content/sct_concepts_20050131.txt
303140 1630579 22598785 QAcons.original
> 61322 inactive concepts (16.8%)
Descriptions
DESCRIPTIONID|DESCRIPTIONSTATUS|CONCEPTID|TERM|INITIALCAPITALSTATUS|
DESCRIPTIONTYPE|LANGUAGECODE|
984537 10287244 63879918 Content/sct_descriptions_20050131.txt
753037 3429948 50413969 QAdesc.original
> 231500 inactive descriptions (23.5%)
Spanish Descriptions
DESCRIPTIONID|DESCRIPTIONSTATUS|CONCEPTID|TERM|INITIALCAPITALSTATUS|
DESCRIPTIONTYPE|LANGUAGECODE|
893235 10225713 64725618 Spanish Descriptions/sct_descriptions_20041031.txt
669710 3846491 50963665 QAdesces.original
> 223525 inactive Spanish descriptions (25.0%)
Relationships
RELATIONSHIPID|CONCEPTID1|RELATIONSHIPTYPE|CONCEPTID2|CHARACTERISTICTYPE|
REFINABILITY|RELATIONSHIPGROUP|
1453500 10174500 66793761 Content/sct_relationships_20050131.txt
1322856 1322856 62059156 QArels.original
> 130644 relationships involving inactive concepts/historical (8.99%)
History Directory
Componenthistory
COMPONENTID|RELEASEVERSION|CHANGETYPE|STATUS|REASON|
1631538 6912491 43627416 History/sct_componenthistory_20050131.txt
1169216 1279892 31819186 QAhist.original
> 462322 inactive components (28.3%)
Subset Directory
US Subset
SUBSETID|SUBSETORIGINALID|SUBSETVERSION|SUBSETNAME|SUBSETTYPE|LANGUAGECODE|
REALMID|CONTEXTID|
2 19 147 Subsets/sct_subsets_us_20050131.txt
2 5 149 QAenssets.original
Spanish Subset
2 18 143 Subsets/sct_subsets_20041031.txt
2 4 145 QAesssets.original
US Subset Members
SUBSETID|MEMBERID|MEMBERSTATUS|LINKEDID|
725206 2175619 14558868 Subsets/sct_subsetmembers_us_20050131.txt
725206 725206 15284074 QAenssetmems.original
Spanish Subset Members
688393 2065180 14303724 Subsets/sct_subsetmembers_20041031.txt
688319 688319 14990548 QAesssetmems.original
> 74 members (0.01%) not synchronized with January 2005 SNOMED CT
Cross Mappings Directory
Sets
MAPSETID|MAPSETNAME|MAPSETTYPE|MAPSETSCHEMEID|MAPSETSCHEMENAME|
MAPSETSCHEMEVERSION|MAPSETREALMID|MAPSETSEPARATOR|MAPSETRULETYPE|
2 27 289 Cross Mappings/sct_crossmapsets_icd9_20050131.txt
2 13 295 QAxmapssets.original
Maps
MAPSETID|MAPCONCEPTID|MAPOPTION|MAPPRIORITY|MAPTARGETID|MAPRULE|MAPADVICE|
92785 556711 2965579 Cross Mappings/sct_crossmaps_icd9_20050131.txt
92785 92785 3058364 QAxmaps.original
Targets
TARGETID|TARGETSCHEMEID|TARGETCODES|TARGETRULE|TARGETADVICE|
14099 42298 642945 Cross Mappings/sct_crossmaptargets_icd9_20050131.tx t
14099 14099 657044 QAxmaptargets.original
Appendix B – Successful 2005AB Run of Master-QA Script
Wed May 18 17:44:02 EDT 2005 master-QA-make.s start Master QA
Wed May 18 17:44:02 EDT 2005 extract-SNOMED-from-Meta.s start
Wed May 18 17:52:02 EDT 2005 extract-SNOMED-from-Meta.s end
Wed May 18 17:52:02 EDT 2005 make-snomed-original.s start
Wed May 18 17:58:01 EDT 2005 make-snomed-original.s end
Wed May 18 17:58:01 EDT 2005 create-concepts.s start
303140 1630579 22598785 QAcons.original
303140 1630579 22598785 QAcons.final
606280 3261158 45197570 total
0
Wed May 18 18:00:28 EDT 2005 create-concepts.s end
Wed May 18 18:00:28 EDT 2005 create-descriptions.s start
753037 3429948 50413969 QAdesc.original
753037 3429948 50413969 QAdesc.final
1506074 6859896 100827938 total
0
Wed May 18 18:05:54 EDT 2005 create-descriptions.s end
Wed May 18 18:05:54 EDT 2005 create-descriptions-es.s start
669710 3846491 50963665 QAdesces.original
669710 3846491 50963665 QAdesces.final
1339420 7692982 101927330 total
0
Wed May 18 18:10:25 EDT 2005 create-descriptions-es.s end
Wed May 18 18:10:25 EDT 2005 create-relationships-history.s start
1322856 1322856 62059156 QArels.original
1322856 1322856 62059156 QArels.final
2645712 2645712 124118312 total
0
1169216 1279892 31819186 QAhist.original
1169216 1279892 31819186 QAhist.final
2338432 2559784 63638372 total
0
Wed May 18 18:23:58 EDT 2005 create-relationships-history.s end
Wed May 18 18:23:58 EDT 2005 create-subsets.s start
2 4 145 QAesssets.original
2 4 145 QAesssets.final
4 8 290 total
0
688319 688319 14990548 QAesssetmems.original
688238 688238 14988853 QAesssetmems.final
1376557 1376557 29979401 total
> 153* Known discrepancy in safe replacement synchronization
2 5 149 QAenssets.original
2 5 149 QAenssets.final
4 10 298 total
0
725206 725206 15284074 QAenssetmems.original
725206 725206 15284074 QAenssetmems.final
1450412 1450412 30568148 total
0
Wed May 18 18:30:20 EDT 2005 create-subsets.s end
Wed May 18 18:30:20 EDT 2005 create-cross-mappings.s start
2 13 295 QAxmapssets.original
2 13 295 QAxmapssets.final
4 26 590 total
0
14099 14099 657044 QAxmaptargets.original
14099 14099 657044 QAxmaptargets.final
28198 28198 1314088 total
0
92785 92785 3058364 QAxmaps.original
92785 92785 3058364 QAxmaps.final
185570 185570 6116728 total
0
Wed May 18 18:32:32 EDT 2005 create-cross-mappings.s end
Wed May 18 18:32:32 EDT 2005 master-QA-make.s end Master QA
Last Reviewed: January 29, 2008