Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Character Set

U. S. National Library of Medicine
NLMCatalogRecord data in XML Format

About the Character Set used in NLMCatalogRecord:

XML-formatted NLM catalog records issued in the NLMCatalogRecordSet chiefly contain standard Latin characters but may also contain spacing or non-spacing diacritical marks, subscripts, superscripts and other special characters defined for use in MARC 21 records. These may occur in any non-numeric field where called for to accurately record the data. However, the Chinese, Japanese and Korean characters entered using the East Asian Character Code (EACC) in the MARC-8 environment are not currently included in NLMCatalogRecordSet records.

The XML file uses UTF-8 encoding (from ISO/IEC 10626 and Unicode Standard -- see Unicode for more information on unicode and UTF-8 encoding). The UTF-8 encoded data is in unicode Normalized Form C (see Unicode Technical Report #15), which uses unicode composite characters. This approach is consistent with the direction of the World Wide Web Consortium as described in Character Model for the World Wide Web.

Normalized Form C was adapted for NLMCatalogRecordSet in order to conform with NLM XML records distributed from MEDLINE. Form C differs from the "decomposed" Form D, which is currently defined for expression of MARC 21-formatted records in UTF-/Unicode. In Form D, the diacritic is encoded AFTER the letter it modifies; for more information see: //www.loc.gov/marc/specifications/speccharintro.html

Because of the large number of characters which could conceivably occur, NLM will not attempt to provide a complete list of characters possible in NLMCatalogRecordSet.

Last Reviewed: May 11, 2020