David Sherertz, Apelon, Inc.
Executive Summary
The NLM received 9 SNOMED CT source files in the January 2005 release, and 3 Spanish translation SNOMED CT source files in the October 2004 release, from the College of American Pathologists (CAP) for inclusion in the 2005AB release of the UMLS Metathesaurus. The 2005AB release is the fifth Metathesaurus release to appear in the Rich Release Format (RRF), and the second to include SNOMED Spanish translations. An important principle of the inversion file structure is “source transparency”. Source transparency implies that, independent of the Metathesaurus value-added attributes, the original SNOMED CT source files can be recreated ‘exactly’ from the information in the *.src inversion files. For the January 2005 / October 2004 SNOMED CT source files and the 2005AB *.src inversion files, this recreation has been demonstrated successfully.
Abstract
One central principle for the incorporation of a source vocabulary into the UMLS is source transparency. This implies that there is no loss of information in the inversion and insertion process. Every element of information contained in the source is included in the inversion files, even though it may be represented in a format different from that of the original source files, and some of the information may not yet be released in the UMLS (for example, inactive SNOMED CT concepts / descriptions, and relationships that include them). The data structure of SNOMED CT is complex, and some of its information pertains to particular atoms (specific occurrences of a string or concept name), rather than to a SNOMED CT concept as a whole or to all instances of a string in the vocabulary. Each release of SNOMED CT is inverted fully, although only the changes are marked as needing review in the insertion of the inversion files. All unchanged information is considered as a safe replacement, and only the source version label of the release is updated in the unchanged information.
Starting with the January 2005 and October 2004 SNOMED CT source files received from the CAP, all information was extracted into source-derived files. These files formed the basis for comparison. All SNOMED CT information from the 2005AB *.src inversion files was extracted into UMLS-derived files and converted to a format identical to the source-derived files. Direct row-by-row comparisons between the source-derived and UMLS-derived files proved they are identical. This verified that all information contained in the source can be retrieved from the inversion files, and that source transparency has been achieved again in the incorporation of SNOMED CT into the UMLS Metathesaurus .
Introduction
The RRF extensions to the Metathesaurus release files are an enhancement of the original release format (ORF). These extensions enable all Metathesaurus sources including SNOMED CT and other new sources such as the NCI Thesaurus to be represented transparently. Thus, a critical quality assurance check is to see if the SNOMED CT information can be extracted from the 2005AB source inversion files and used exclusively to reproduce the original SNOMED CT files as received from the CAP. Appendix A shows the file names, field names, and sizes of each of these 12 SNOMED CT files. ( They are the ones in the list that begin with ‘sct’ and end with ‘.txt’.) Appendix A also shows the successful result of the process described in the remaining sections, which elaborate on the recreation details. The 12 SNOMED CT files are compared to versions derived from the *.src inversion files. In each comparison, multiple blanks and vertical bars are removed and the files are sorted; this is done for both the original and inversion derived files. There are no differences in any of the files; this defines what is meant by the files being ‘exactly’ identical.
Methods
The only UNIX commands used in any of the scripts are nawk, join, sed, sort, set, diff, wc, rm, touch and echo. All of the scripts are csh and are called from a single shell script, called the Master Script, which in turn runs a sequence of scripts in the order described in the steps below.
- The Master Script first makes the ‘original’ SNOMED CT files. It does this by extracting all the concepts, descriptions, relationships, subsets, history, and cross mappings. The information in the 12 SNOMED CT files is put into 12 files with descriptive names starting with ‘QA’ and ending with ‘*.original’, as shown in Appendix A. The SNOMED CT files as received from the CAP use a TAB character as a field delimiter. This is preserved in the QA*.original files. Also, in the Cross Mapping TARGETCODES field, the SNOMED CT files use a vertical bar as a subfield delimiter. This is changed to a comma ‘,’. These slight changes are made to facilitate comparisons with the files as recreated from the 2005AB *.src inversion files.
- The Master Script next extracts all of the pertinent rows from the *.src inversion files.
- The Master Script then runs in sequence 6 scripts that recreate from the *.src inversion files pulled in Step 2) the 12 SNOMED CT files corresponding to the QA*.original files created in Step 1). The files created in this step have descriptive names starting with ‘QA’ and ending with ‘*.final’, and their descriptive name is identical to the corresponding QA*.original file made in Step 1). In each of these 6 scripts, after making its QA*.final file(s) (some scripts make two or three QA*.final files), the Unix wc command is run on the QA*.original and QA*.final files, and then a diff command piped into a wc -l. A successful recreation is achieved when the wc counts are identical and the diff with wc –l returns 0.
The approach in creating most of the QA*.final files in the 6 scripts of this step is to make a series of ‘triples’ for the fields that will become the QA*.final file. The triples have the operative ID (concept, description, relationship, subset, cross mapping) as the first field, the field number in the QA*.final file as the second field, and the field value as the third field. Then a sort and simple nawk script reassembles the QA*.final rows from these triples. For some of the QA*.final files, this approach can be simplified, and the QA*.final file can be made by just extracting and reordering the appropriate fields directly from the pertinent *.src inversion file (for example, the Cross Mappings file).
In debugging the 6 scripts in Step 3), it is helpful to create and look over intermediate files during the process of making the QA*.final file to isolate and fix whatever problem is occurring. For the final run, each script removes all of the intermediate files, and only leaves the QA*.original file and QA*.final file when there is a discrepancy between them; a clean run of the Master Script will leave no files. All 6 of the scripts in this step are completely self-contained. One assumption made in making the triples is that the field names for the SNOMED CT files in the *.src files are unique if the name is shortened to its first letter and last five letters. That is currently true, but must be re-verified in the future when new field names are added to SNOMED CT.
Results
Appendix A shows a single run of the Master Script described in Methods that first extracts the QA*.original files from the SNOMED CT files as received from the CAP; for each of the 12 files, it shows the wc counts for QA*.original file. It also shows the output from running the rest of the Master Script. As can be seen, the wc (record/line, word, character) counts are all identical for each of the 12 files, and the diff shows the files are identical. The entire Master Script, starting with just the SNOMED CT files as received, and the 2005AB *.src inversion files, takes around 90 minutes to complete successfully on a small-sized Solaris machine.
Discussion
The 6 scripts that make the 12 QA*.final files vary in their degree of complexity. Interesting details about fine points will be elaborated in this section for each QA*.final file. In the sections below, whenever a reference is made to a specific *.src inversion file, the .src suffix is omitted, but the file name is shown in italics. Below the descriptive name of each file are the row(s) of the SNOMED CT field names for that file, as some of these names are mentioned in the discussion. For brevity, the field delimiter is shown as a vertical bar; in the actual files, the delimiter is a TAB character.
Concepts
CONCEPTID|CONCEPTSTATUS|FULLYSPECIFIEDNAME|CTV3ID|SNOMEDID|ISPRIMITIVE
This QA*.final file is made by first getting all of the rows in the Classes_Atoms file with a term group of FN or OF. Next the DESCRIPTIONSTATUS field is used to determine which of the FN or OF rows is the FULLYSPECIFIEDNAME, as some concepts can have more than one. All the other fields are then extracted from the Attributes files. With these triples, the QA*.final file is made and compared to the QA*.original file.
Descriptions (2)
DESCRIPTIONID|DESCRIPTIONSTATUS|CONCEPTID|TERM|INITIALCAPITALSTATUS|
DESCRIPTIONTYPE|LANGUAGECODE
A separate QA*.final file is made for the U.S. and Spanish descriptions. Note that for each language, there is a separate Classes_Atoms and Attributes file. Each QA*.final file is also relatively simple, as it involves simply getting all of the rows from the Classes_Atoms file for each language. The two exceptions are the ‘description’ record made for the Cross Mappings and Subsets (term types XM and SB, respectively); these are ignored. Then, all of the other fields are pulled from the Attributes file for each language. With these triples, each QA*.final file is made and compared to the QA*.original file.
Relationships
RELATIONSHIPID|CONCEPTID1|RELATIONSHIPTYPE|CONCEPTID2|CHARACTERISTICTYPE|
REFINABILITY|RELATIONSHIPGROUP
This QA*.final file is a little more complex. First of all, the ISA relationships are not in the Relationships inversion file; they have to be retrieved from the Treepos.dat file used to build SNOMED CT contexts as part of the inversion. Secondly, the inversion Atom IDs must be converted back to SNOMED CT CONCEPTIDs , as this is what is used in the SNOMED Relationship file. Finally, getting the SNOMED CT field RELATIONSHIPTYPE involves first getting the UMLSRELAs with their type from the Attributes file, matching the UMLSRELA string from the Relationship rows, and finally replacing the string with the corresponding RELATIONSHIPTYPE. The remaining fields are pulled from the Attributes file. With these triples, the QA*.final file is made and compared to the QA*.original file.
Note that beginning with the July 2004 SNOMED CT release, the historical relationships are included in the single relationships file; prior to this, the historical relationships were distributed in a separate file with the same format as the relationships file. The only way of separating active relationships from historical relationships is that the historical relationships always have a CHARACTERISTICTYPE value equal to 2.
Also note that the AQ (Allowable Qualifiers) relationships are in both the Attributes file, and the Relationships file. They are made from both places, and a sort -u is done to insure that they are identically represented in the two places.
Component History
COMPONENTID|RELEASEVERSION|CHANGETYPE|STATUS|REASON
This QA*.final file simply involves getting all of the rows in the Attributes file with a COMPONENTHISTORY attribute name, and printing out the appropriate fields from those rows in the order of the QA*.original file. The QA*.final file is then compared to the QA*.original file.
Subsets (2)
SUBSETID|SUBSETORIGINALID|SUBSETVERSION|SUBSETNAME|SUBSETTYPE|LANGUAGECODE|
REALMID|CONTEXTID
A separate QA*.final file is made for the U.S. and Spanish subsets. Note that for each language, there is a separate Classes_Atoms and Attributes file. The QA*.final files are each mostly made from triples extracted from the Attributes file. However, there is one row in Classes_Atoms with a TTY of ‘SB’ that contains the SUBSETNAME field. With these triples, each QA*.final file is made and compared to the QA*.original file.
Subset Members (2)
SUBSETID|MEMBERID|MEMBERSTATUS|LINKEDID
A separate QA*.final file is made for the U.S. and Spanish subset members. Note that for each language, there is a separate Attributes file. Each QA*.final file is made entirely from the Attributes file, as all of the fields appear there. Each QA*.final file is made and compared to the corresponding QA*.original file.
Cross Mappings Sets
MAPSETID|MAPSETNAME|MAPSETTYPE|MAPSETSCHEMEID|MAPSETSCHEMENAME|
MAPSETSCHEMEVERSION|MAPSETREALMID|MAPSETSEPARATOR|MAPSETRULETYPE
This QA*.final file is made almost entirely from the Attributes file, as most of the fields are separate rows in it; these are made into triples. However, there is one row in Classes_Atoms with a TTY of ‘XM’ that contains the MAPSETNAME field. With these triples, the QA*.final file is made and compared to the QA*.original file.
Cross Mapping Targets
TARGETID|TARGETSCHEMEID|TARGETCODES|TARGETRULE|TARGETADVICE
This QA*.final file is derived entirely from the XMAPTO rows in the Attributes file. The QA*.final file is made and compared to the QA*.original file.
Cross Mappings
MAPSETID|MAPCONCEPTID|MAPOPTION|MAPPRIORITY|MAPTARGETID|MAPRULE|MAPADVICE
This QA*.final file is derived entirely from the XMAP rows in the Attributes file. The QA*.final file is made and compared to the QA*.original file.
Conclusion
It is possible to recreate the 12 SNOMED CT files as received from the CAP using only the information in the Metathesaurus 2005AB *.src inversion files. While some of the steps and links require a sophisticated understanding of both the SNOMED CT fields and the UMLS Metathesaurus source inversion fields, and what each means, the process can be implemented simply and tested efficiently. This successful test of the recreation of the SNOMED CT files is a strong endorsement of the *.src and RRF structures, and proves that they do indeed support true source transparency. And, if SNOMED CT in the UMLS Metathesaurus is transparently complete, that is a good indication that any other present or future source will be demonstrably transparent when represented in the *.src file inversion structure, and the RRF distribution structure.
Appendix A – SNOMED CT / 2005AB Inversion Files
Wed Feb 23 10:53:05 PST 2005 master-create.s start
Content: sct_concepts_20050131.txt
Wed Feb 23 11:00:05 PST 2005 create-concepts.s start
303140 3146279 22295645 QAcons.original
303140 3146279 22295645 QAcons.final
606280 6292558 44591290 total
0
Wed Feb 23 11:07:51 PST 2005 create-concepts.s end
Content: sct_descriptions_20050131.txt
Wed Feb 23 11:07:51 PST 2005 create-descriptions.s start
753037 7948170 49660932 QAdesc.original
753037 7948170 49660932 QAdesc.final
1506074 15896340 99321864 total
0
Wed Feb 23 11:18:55 PST 2005 create-descriptions.s end
Content: sct_relationships_20050131.txt
History: sct_componenthistory_20050131.txt
Wed Feb 23 11:18:55 PST 2005 create-relationships-history.s start
1322856 9259992 60736300 QArels.original
1322856 9259992 60736300 QArels.final
2645712 18519984 121472600 total
0
1169216 4897999 30649970 QAhist.original
1169216 4897999 30649970 QAhist.final
2338432 9795998 61299940 total
0
Wed Feb 23 11:53:50 PST 2005 create-relationships-history.s end
Cross Mappings: sct_crossmaps_icd9_20050131.txt
Cross Mappings: sct_crossmapsets_icd9_20050131.txt
Cross Mappings: sct_crossmaptargets_icd9_20050131.txt
Wed Feb 23 11:53:50 PST 2005 create-cross-mappings.s start
2 27 293 QAxmapssets.original
2 27 293 QAxmapssets.final
4 54 586 total
0
14099 42298 642945 QAxmaptargets.original
14099 42298 642945 QAxmaptargets.final
28198 84596 1285890 total
0
92785 556711 2965579 QAxmaps.original
92785 556711 2965579 QAxmaps.final
185570 1113422 5931158 total
0
Wed Feb 23 11:59:35 PST 2005 create-cross-mappings.s end
Subsets/U.S: sct_subsetmembers_us_20050131.txt
Subsets/U.S: sct_subsets_us_20050131.txt
Subsets/Spanish: /d1/project/snmct/20041031/ SNOMED CT October 2004/Subsets/sct_subsetmembers_20041031.txt
Subsets/Spanish: /d1/project/snmct/20041031/ SNOMED CT October 2004/Subsets/sct_subsets_20041031.txt
Wed Feb 23 11:59:35 PST 2005 create-subsets.s start
2 18 143 QAesssets.original
2 18 143 QAesssets.final
4 36 286 total
0
688393 2065180 14303724 QAesssetmems.original
688393 2065180 14303724 QAesssetmems.final
1376786 4130360 28607448 total
0
2 19 147 QAenssets.original
2 19 147 QAenssets.final
4 38 294 total
0
725206 2175619 14558868 QAenssetmems.original
725206 2175619 14558868 QAenssetmems.final
1450412 4351238 29117736 total
0
Wed Feb 23 12:09:10 PST 2005 create-subsets.s end
Subsets/Spanish: /d1/project/snmct/20041031/SNOMED CT October 2004/Spanish Descriptions/sct_descriptions_20041031.txt
Wed Feb 23 12:09:10 PST 2005 create-es-descriptions.s start
669710 7864750 50293955 QAdesces.original
669710 7864750 50293955 QAdesces.final
1339420 15729500 100587910 total
0
Wed Feb 23 12:16:25 PST 2005 create-es-descriptions.s end
Wed Feb 23 12:16:25 PST 2005 master-create.s end
Last Reviewed: January 29, 2008