Precomposed character

Precomposed character: A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can be defined as a combination of two or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é (Latin small letter e with acute accent). Technically, é (U+00E9) is a character that can be decomposed into an equivalent string of the base letter e (U+0065) and combining acute accent (U+0301). Similarly, ligatures are precompositions of their constituent letters or graphemes.

Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.

Contents

1 Comparing precomposed and decomposed characters

2 Chinese characters

3 See also

4 Sources

5 External links

Comparing precomposed and decomposed characters

In the following example, there is a common Swedish surname Åström written in the two alternative methods, the first one with a precomposed Å (U+00C5) and ö (U+00F6), and the second one using a decomposed base letter A (U+0041) with a combining ring above (U+030A) and an o (U+006F) with a combining diaeresis (U+0308). To illustrate the difference, the precomposed characters are here displayed in green and the decomposed base letters in black; depending on your browser, the decomposed combining diacritics may be shown in orange or black.

Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D)

Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D)

Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all fonts. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.

With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed Proto-Indo-European word for 'dog'):

ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)

ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)

In some situations, the precomposed green k, u and o with diacritics may render as unrecognized characters, or their typographical appearance may be very different from the final letter n with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.

OpenType has the ccmp "feature tag" to define glyphs that are compositions or decompositions involving combining characters.

Chinese characters

In theory, most Chinese characters as encoded by Han unification and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent strokes and ideograph descriptions, though Unicode does not take this approach that would certainly be on the cutting edge of text storage and layout. Such an approach could potentially reduce the number of characters in the character set from tens of thousands to just a few hundred. On the other hand, a character set encoded in this way would also produce documents that were tenfold larger in bytes to represent the same characters as Unicode.

See also

Dead key

Compose key

Combining character

Unicode equivalence

Complex text layout

Unicode compatibility characters

Sources

The Unicode Standard, Version 5.2: Conformance (see Section 3.7 for Decomposition). The Unicode Consortium, December 2009.

Aaron Weiss: Composite and Precomposed Characters. Web Developer's Virtual Library. February 20, 2001.

MSDN: Defining a Character Set. April 8, 2010.

External links

Free Idg Serif, a derivative of the FreeSerif font with added declarations of precomposed characters.

v · d · eUnicode

Unicode
Unicode Consortium · ISO/IEC 10646 (Universal Character Set)

Code points
Code point · Plane · Block · Mapping characters · Character property · Character charts

Characters

Special purpose

BOM · Combining grapheme joiner · Left-to-right mark and Right-to-left mark · Soft hyphen · Zero-width non-breaking space · Zero-width joiner · Zero-width non-joiner · Zero-width space

Miscellaneous lists

Combining character · Duplicate characters · Graphic characters

Processing

Algorithms

Bi-directional text · Collation (ISO 14651) · Equivalence

Transformation

BOCU-1 · CESU-8 · UTF-1 · UTF-7 · UTF-8 · UTF-9/UTF-18 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-EBCDIC · Punycode · SCSU · Comparison

On pairs
of code points
Equivalence · Combining character · Duplicates · Homoglyph · Precomposed character (List) · Compatibility characters · Z-variant

Usage
Unicode and e-mail · Unicode and HTML · Character entity references · Unicode input · Internationalized domain name · Numeric character reference · Private Use U+F8FF · Typefaces (fonts) ·

Related standards
Common Locale Data Repository (CLDR) · GB 18030 · Han unification · ISO/IEC 8859 (8-bit encodings) · ISO 14651 (Collation) · ISO 15924 (Script codes)

Related topics
Anomalies · ConScript Unicode Registry · Ideographic Rapporteur Group · International Components for Unicode · MUFI · People related to Unicode

Scripts and symbols in Unicode

Common and
inherited scripts
Combining marks · Diacritics · Punctuation · Space

Modern scripts
Arabic (diacritics · Unicode blocks) · Armenian · Balinese · Batak · Bamum · Bengali · Bopomofo · Braille · Buginese · Buhid · Canadian Aboriginal · Cham · Cherokee · CJK Unified Ideographs (Han) · Cyrillic · Deseret · Devanagari · Ethiopic · Georgian · Greek · Gujarati · Gurmukhi · Kanji · Hanja · Hán tự · Hangul · Hanunoo · Hebrew (diacritics) · Hiragana · Javanese · Kannada · Katakana · Kayah Li · Khmer · Lao · Latin · Lepcha · Limbu · Lisu · Malayalam · Mandaic · Meetei Mayek · Mongolian · Manchu · Myanmar · N'Ko · New Tai Lue · Ol Chiki · Oriya · Osmanya · Rejang · Samaritan · Saurashtra · Shavian · Sinhala · Sundanese · Syloti Nagri · Syriac · Tagalog · Tagbanwa · Tai Le · Tai Tham · Tai Viet · Tamil · Telugu · Thaana · Thai · Tibetan · Tifinagh · Vai · Yi

Ancient and
historic scripts
Avestan · Brāhmī · Carian · Coptic · Sumero-Akkadian · Cypriot · Egyptian Hieroglyphs · Glagolitic · Gothic · Imperial Aramaic · Inscriptional Pahlavi · Inscriptional Parthian · Kaithi · Kharoshthi · Linear B · Lycian · Lydian · Ogham · Old Italic · Old Persian · Phags-pa · Phoenician · Old South Arabian · Old Turkic · Runic · Ugaritic

Symbols
Cultural, political, and religious symbols · Currency · Mathematical operators and symbols · Phonetic symbols (including IPA)

Categories:
Unicode
Computer science stubs

Игры ⚽ Нужен реферат?

Look at other dictionaries:

precomposed character — dekomponuojamasis ženklas statusas T sritis informatika apibrėžtis Ženklas, kurį galima sukomponuoti iš kito ženklo, paprastai raidės, prie jo prijungiant ↑nulinio pločio ženklą (taip pat atlikti ir atvirkščią veiksmą – ↑dekomponavimą).… … Enciklopedinis kompiuterijos žodynas
Character encoding — Special characters redirects here. For the Wikipedia editor s handbook page, see Help:Special characters. A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of… … Wikipedia
precomposed — adjective a) composed in advance b) composed of a base character and a diacritical mark … Wiktionary
Combining character — In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks (including combining accents). Unicode also… … Wikipedia
Unicode character property — Unicode assigns character properties to each code point.[1] These properties can be used to handle characters (code points) in processes, like in line breaking, script direction right to left or applying controls. Slightly inconsequently, some… … Wikipedia
Numeric character reference — A numeric character reference (NCR) is a common markup construct used in SGML and other SGML related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the… … Wikipedia
decomposable character — dekomponuojamasis ženklas statusas T sritis informatika apibrėžtis Ženklas, kurį galima sukomponuoti iš kito ženklo, paprastai raidės, prie jo prijungiant ↑nulinio pločio ženklą (taip pat atlikti ir atvirkščią veiksmą – ↑dekomponavimą).… … Enciklopedinis kompiuterijos žodynas
Universal Character Set Characters — The Unicode Consortium (UC) and the International Organisation for Standardisation (ISO) collaborate on the Universal Character Set. (UCS)] . The UCS is an international standard to map characters used in natural language (as opposed to… … Wikipedia
Universal Character Set — The Universal Character Set (UCS), defined by the ISO/IEC 10646 International Standard, is a standard set of characters upon which many character encodings are based. The UCS contains nearly a hundred thousand abstract characters, each identified … Wikipedia
Western Latin character sets (computing) — Several binary representations of character sets for common Western European languages are compared in this article. These encodings were designed for representation of Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish … Wikipedia

Academic Dictionaries and Encyclopedias

Precomposed character

Contents

Comparing precomposed and decomposed characters

Chinese characters

See also

Sources

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Precomposed character

Contents

Comparing precomposed and decomposed characters

Chinese characters

See also

Sources

External links

Look at other dictionaries:

Share the article and excerpts

Direct link