ISO/IEC 2022

ISO/IEC 2022

ISO 2022, more formally ISO/IEC 2022 "Information Technology—Character code structure and extension techniques", is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying
* a technique for including multiple character sets in a single character encoding, and
* a technique for representing character sets which cannot be represented in 7 bits. Unlike ISO 8859 character encodings which use 8 bits for every character, the ISO 2022 encodings are variable size encodings typically using either 8 or 16 bits per character. Several character encodings use ISO 2022 mechanisms. For example, ISO-2022-JP is a widely used character encoding for the Japanese language.

Introduction

Many languages or language families not based on the Latin alphabet such as Greek, Russian, Arabic, or Hebrew have historically been represented on computers with 8-bit extended ASCII encodings including the ISO 8859 family of character sets. Written East Asian languages, specifically Chinese, Japanese, and Korean, use far more characters than can be represented in an 8-bit computer byte and were first represented on computers with language-specific double byte encodings.

ISO 2022 was developed as a technique to attack both of these problems: to represent characters in multiple character sets within a single character encoding, and to represent large character sets.

Being based on ISO 646, ISO 2022 exhibits many of ISO 646's properties. For example, the most significant bit of each byte does not carry any meaning; this allows ISO 2022 (like ISO 646) to be easily transmitted through 7-bit communication channels. (This 7-bit property also forms the basis of the EUC code.)

To represent multiple character sets, the ISO 2022 character encodings include escape sequences which indicate the character set for characters which follow. The escape sequences are registered with ISO and are often three characters long starting with the ASCII ESCAPE character (hexadecimal 1B, octal 33). These character encodings require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on the most recently encountered escape sequence.

To represent large character sets, ISO 2022 builds on ISO 646's property that 1 byte can define 94 graphic (printable) characters (in addition to space and 33 control characters). Using two bytes, it is thus possible to represent up to 8836 (94×94) characters; and, using three bytes, up to 830584 (94×94×94) characters. For the two-byte character sets, the code point of each character is normally specified in so-called "kuten" form (sometimes called "quwei", especially when dealing with GB2312 and related standards), which specifies a zone ("ku" or "qu"), and the point ("ten") or position ("wei") of that character within the zone.

The escape sequences therefore do not only declare which character set is being used, but also, by knowing the properties of these character sets, know whether a 94-, 8836-, or 830584-character (or some other sized) encoding is being dealt with.

In practice, the escape sequences declaring the national character sets may be absent if context or convention dictates that a certain national character set is to be used. For example, RFC 1922, which defines ISO-2022-CN, allows ASCII SHIFT characters to be used instead of escape sequences.

Although the ISO 2022 character sets are still in common use, particularly ISO-2022-JP, most modern e-mail applications are converting to use the simpler Unicode character encodings such as UTF-8.

Code structure

ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.

Character codes from the 7-bit ASCII graphic range (0x20–0x7F) are referred to as "GL" codes, being on the left side of a character code table, while codes from the "high ASCII" range (0xA0–0xFF), if available, are referred to as the "GR" codes.

By default, GL codes specify G0 characters, and GR codes specify G1 characters, but this may be modified with control codes:

Each of the four working sets may be a 94-character set or a 94n-character set. Additionally, G1 through G3 may be a 96- or 96n-character set. When one of the latter is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available.

There are additional (rarely used) features for switching control character sets, but this is a single-level lookup: the 0x00–0x1F range is the C0 control character set, the 0x80–0x9F range is the C1 control character set, and there are escape sequences which switch in various alternatives. It is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible.

As seen in the SS2 and SS3 examples above, single control characters from the C1 control character set may be invoked using only 7 bits using the sequences ESC 0x40 (@) through ESC 0x5F (_). Additional control functions are assigned in the range ESC 0x60 (`) through ESC 0x7E (~). While this article describes escape sequences using the corresponding ASCII characters, they are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.

Escape sequences to designate character sets take the form ESC "I" ["I"...] "F", where there are one or more intermediate "I" bytes from the range 0x20–0x2F, and a final "F" byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use "F" bytes.) The "I" bytes identify the type of character set and the working set it is to be designated to, while the "F" byte identifies the character set itself.

Note that the registry of "F" bytes is independent for the different types. The 94-character graphic set designated by ESC ( A through ESC + A is not related in any way to the 96-character set designated by ESC - A through ESC / A. And neither of those are related to the 94n-character set designated by ESC $ ( A through ESC $ + A, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes, ESC A is a way of specifying the C1 control code 0x81.)

Also note that C0 and C1 control character sets are independent; the C0 control character set designated by ESC ! A (which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by ESC " A (the CCITT attribute control set for Videotex).

Additional "I" bytes may be added before the "F" byte to extend the "F" byte range. This is currently only used with 94-character sets, where codes of the form ESC ( ! "F" have been assigned. At the other extreme, no multibyte 96-sets have been registered, so the sequences above are strictly theoretical.

ISO 2022 character sets

Character encodings using ISO 2022 mechanism include:
* ISO-2022-JP. A widely used encoding for Japanese. Starts in ASCII and includes the following escape sequences
** ESC ( B to switch to ASCII (1 byte per character)
** ESC ( J to switch to JIS X 0201-1976 (ISO 646:JP) Roman set (1 byte per character)
** ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character)
** ESC $ B to switch to JIS X 0208-1983 (2 bytes per character)
* ISO-2022-JP-1. The same as ISO-2022-JP with one additional escape sequence
** ESC $ ( D to switch to JIS X 0212-1990 (2 bytes per character)
* ISO-2022-JP-2. A multilingual extension of ISO-2022-JP. The same as ISO-2022-JP-1 with the following additional escape sequences
** ESC $ A to switch to GB 2312-1980 (2 bytes per character)
** ESC $ ( C to switch to KS X 1001-1992 (2 bytes per character)
** ESC - A to switch to ISO 8859-1 high part, Extended Latin 1 set (1 byte per character)
** ESC - F to switch to ISO 8859-7 high part, Basic Greek set (1 byte per character)
* ISO-2022-JP-3. The same as ISO-2022-JP with three additional escape sequences
** ESC ( I to switch to JIS X 0201-1976 Kana set (1 byte per character)
** ESC $ ( O to switch to JIS X 0213-2000 Plane 1 (2 bytes per character)
** ESC $ ( P to switch to JIS X 0213-2000 Plane 2 (2 bytes per character)
* ISO-2022-JP-2004. The same as ISO-2022-JP-3 with one additional escape sequence
** ESC $ ( Q to switch to JIS X 0213-2004 Plane 1 (2 bytes per character)
* ISO-2022-KR. An encoding for Korean.
** ESC $ ( C to switch to KS X 1001-1992cite web |url=http://examples.oreilly.com/cjkvinfo/AppL/ksx1001.pdf |title=KS X 1001:1992] cite web |url=http://www.itscj.ipsj.or.jp/ISO-IR/149.pdf |title=KS C 5601:1987|date=1988-10-01] , previously named KS C 5601-1987 (2 bytes per character)
* ISO-2022-CN. An encoding for Chinese.
** ESC $ ( A to switch to GB 2312-1980 (2 bytes per character)
** ESC $ ( G to switch to CNS 11643-1992 Plane 1 (2 bytes per character)
** ESC $ ( H to switch to CNS 11643-1992 Plane 2 (2 bytes per character)
* ISO-2022-CN-EXT. The same as ISO-2022-CN with six additional escape sequences
** ESC $ ( E to switch to ISO-IR-165 (2 bytes per character)
** ESC $ ( I to switch to CNS 11643-1992 Plane 3 (2 bytes per character)
** ESC $ ( J to switch to CNS 11643-1992 Plane 4 (2 bytes per character)
** ESC $ ( K to switch to CNS 11643-1992 Plane 5 (2 bytes per character)
** ESC $ ( L to switch to CNS 11643-1992 Plane 6 (2 bytes per character)
** ESC $ ( M to switch to CNS 11643-1992 Plane 7 (2 bytes per character)

The character after the ESC (for single-byte character sets) or ESC $ (for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character ( (0x28) designates a 94-character set to the G0 character set. This may be replaced by ), * or + (0x29–0x2B) to designate to the G1–G3 character sets.

Two of the codes above are 96-character codes, and in the above examples, the character - (0x2D) designates to the G1 character set. This may be replaced with . or / (0x2E or 0x2F) to designate to the G2 or G3 character sets. As mentioned earlier, a 96-character set may not be designated to the G0 set.

There are three special cases for multi-byte codes. The code sequences ESC $ @, ESC $ A, and ESC $ B were all registered before the ISO 2022 standard was finalized, so must be accepted as synonyms for the sequences ESC $ ( @ through ESC $ ( B to designate to the G0 character set. The latter form may also be used, and may be adapated by changing the ( character to designate to the G1 through G3 character sets.

The standard also defines a way to specify coding systems that do not follow its own structure. Of particular interest, the sequence ESC % G designates the UTF-8 coding system, which does not reserve the range 0x80–0xAF for control characters.

See also

*ISO/IEC 646
*C0 and C1 control codes
*CJK
*Mojibake

References

*Lunde, Ken. "CJKV Information Processing". Cambridge, Massachusetts: O'Reilly & Associates, 1998. ISBN 1-56592-224-7.

External links

* [http://www.iso.org/ International Organization for Standardization]
* [http://www.ecma-international.org/publications/standards/Ecma-035.htm ECMA-35] , equivalent to ISO/IEC 2022 and freely downloadable.
* [http://www.itscj.ipsj.or.jp/ISO-IR/ International Register of Coded Character Sets to be Used with Escape Sequences] , a full list of assigned character sets and their escape sequences
* [http://tronweb.super-nova.co.jp/characcodehist.html History of Character Codes in North America, Europe, and East Asia]
* [ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf CJK.INF: a document on encoding Chinese, Japanese, and Korean (CJK) languages, including a discussion of the various variants of ISO 2022] . Also [http://examples.oreilly.com/cjkvinfo/doc/cjk.inf available by HTTP] .;RFCs
* RFC 1468: description of ISO-2022-JP
* RFC 2237: description of ISO-2022-JP-1
* RFC 1554: description of ISO-2022-JP-2
* RFC 1922: description of ISO-2022-CN and ISO-2022-CN-EXT
* RFC 1557: description of ISO-2022-KR


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • ISO/IEC 2022 — ISO/IEC 2022, Informationstechnologie – Zeichensatzstruktur und erweiterungstechniken (englisch Information Technology Character code structure and extension techniques) ist ein ISO Standard, der eine Technik zur Kodierung mehrerer… …   Deutsch Wikipedia

  • ISO/IEC 2022:1994 — изд.4 T JTC 1/SC 2 Информационные технологии. Структура кода символов и методы расширения Изменения и дополнения: – ISO/IEC 2022:1994/Cor.1:1999 (изд.1 JTC 1/SC 2) раздел 35.040 …   Стандарты Международной организации по стандартизации (ИСО)

  • ISO/IEC 8859-1 — ISO 8859 1, more formally cited as ISO/IEC 8859 1 is part 1 of ISO/IEC 8859, a standard character encoding of the Latin alphabet. It is less formally referred to as Latin 1. It was originally developed by the ISO, but later jointly maintained by… …   Wikipedia

  • ISO/IEC 8859-11 — ISO/IEC 8859 11:2001, Information technology 8 bit single byte coded graphic character sets Part 11: Latin/Thai alphabet, is part of the ISO/IEC 8859 series of ASCII based standard character encodings, first edition published in 2001. It is… …   Wikipedia

  • ISO/IEC 8859-8 — ISO 8859 8, more formally cited as ISO/IEC 8859 8 (but not as Latin 8!), is part 8 of ISO/IEC 8859, a standard character encoding defined by ISO.ISO 8859 8 contains all the Hebrew letters (no Hebrew vowel signs). ISO 8859 8:1988, more commonly… …   Wikipedia

  • ISO/IEC 8859-6 — ISO/IEC 8859 6:1999, Information technology 8 bit single byte coded graphic character sets Part 6: Latin/Arabic alphabet, is part of the ISO/IEC 8859 series of ASCII based standard character encodings, first edition published in 1987. It is… …   Wikipedia

  • ISO/IEC 8859-7 — ISO 8859 7, also known as Greek, is an 8 bit character encoding, part of the ISO 8859 standard. It was designed originally to cover the modern Greek language as well as mathematical symbols derived from the Greek.The original 1987 version of the… …   Wikipedia

  • ISO/IEC 8859-2 — ISO 8859 2, more formally cited as ISO/IEC 8859 2 or less formally as Latin 2, is part 2 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. 2, consisting of 191 characters from the… …   Wikipedia

  • ISO/IEC 8859-13 — ISO 8859 13, also known as Latin 7 or Baltic Rim , is an 8 bit character encoding, part of the ISO 8859 standard. It was designed originally to cover the Baltic languages, and added characters missing from the earlier encodings ISO 8859 4 and ISO …   Wikipedia

  • ISO/IEC 8859-16 — ISO 8859 16, also known as Latin 10 or South Eastern European , is an 8 bit character encoding, part of the ISO 8859 standard. It was designed to cover Albanian, Croatian, Hungarian, Polish, Romanian and Slovenian, but also French, German,… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”