Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Common Desktop Environment: Internationalization Programmer's Guide > Chapter 3 Internationalization and Distributed Networks

Encodings and Code Sets

» 

Technical documentation

» Feedback
Content starts here

 » Table of Contents

 » Index

To understand code sets, it is necessary to first understand character sets. A character set is a collection of predefined characters based on the specific needs of one or more languages without regard to the encoding values used to represent the characters. The choice of which code set to use depends on the user's data processing requirements. A particular character set can be encoded using different encoding schemes. For example, the ASCII character set defines the set of characters found in the English language. The Japanese Industrial Standard (JIS) character set defines the set of characters used in the Japanese language. Both the English and Japanese character sets can be encoded using different code sets.

The ISO2022 standard defines a coded character set as a group of precise rules that defines a character set and the one-to-one relationship between each character and its bit pattern. A code set defines the bit patterns that the system uses to identify characters.

A code page is similar to a code set with the limitation that a code-page specification is based on a 16-column by 16-row matrix. The intersection of each column and row defines a coded character.

Code Set Strategy

The common open software environment code set support is based on International Organization for Standardization (ISO) and industry-standard code sets providing industry-standard code sets that satisfy the data processing needs of users.

Each locale in the system defines which code set it uses and how the characters within the code set are manipulated. Because multiple locales can be installed on the system, multiple code sets can be used by different users on the system. While the system can be configured with locales using different code sets, all system utilities assume that the system is running under a single code set.

Most commands have no knowledge of the underlying code set being used by the locale. The knowledge of code sets is hidden by the code-set-independent library subroutines (Internationalization libraries), which pass information to the code-set-dependent subroutines.

Because many programs rely on ASCII, all code sets include the 7-bit ASCII code set as a proper subset. Because the 7-bit ASCII code set is common to all supported code sets, its characters are sometimes referred to as the portable character set.

The 7-bit ASCII code set is based on the ISO646 definition and contains the control characters, punctuation characters, digits (0-9), and the English alphabet in uppercase and lowercase.

Code Set Structure

Each code set is divided into two principle areas:

  • Graphic Left (GL) Columns 0-7

  • Graphic Right (GR) Columns 8-F

The first two columns of each code set are reserved by ISO standards for control characters. The terms C0 and C1 are used to denote the control characters for the Graphic Left and Graphic Right areas, respectively.

NOTE: The PC code sets use the C1 control area to encode graphic characters.

The remaining six columns are used to encode graphic characters (see Figure 3-1 “Code Set Overview”) Graphic characters are considered to be printable characters, while the control characters are used by devices and applications to indicate some special function

Figure 3-1 Code Set Overview

Code Set Overview

Control Characters

Based on the ISO definition, a control character initiates, modifies, or stops a control operation. A control character is not a graphic character, but can have graphic representation in some instances. The control characters in the ISO646- IRV character set are present in all supported code sets, and the encoded values of the C0 control characters are consistent throughout the code sets.

Graphic Characters

Each code set can be considered to be divided into one or more character sets, such that each character is given a unique coded value. The ISO standard reserves six columns for encoding characters and does not allow graphic characters to be encoded in the control character columns.

Single-Byte Code Sets

Code sets that use all 8 bits of a byte can support European, Middle Eastern, and other alphabetic languages. Such code sets are called single-byte code sets. This provides a limit of encoding 191 characters, not including control characters.

Multibyte Code Sets

The term multibyte code sets is used to refer to all possible code sets regardless of the number of bytes needed to encode any specific character. Because the operating system should be capable of supporting any number of bits to encode a character, a multibyte code set may contain characters that are encoded with 8, 16, 32, or more bits. Even single-byte code sets are considered to be multibyte code sets.

Extended UNIX Code (EUC) Code Set

The EUC code set uses control characters to identify characters in some of the character sets. The encoding rules are based on the ISO2022 definition for the encoding of 7-bit and 8-bit data. The EUC code set uses control characters to separate some of the character sets.

The term EUC denotes these general encoding rules. A code set based on EUC conforms to the EUC encoding rules but also identifies the specific character sets associated with the specific instances. For example, eucJP for Japanese refers to the encoding of the JIS characters according to the EUC encoding rules.

The first set (CS0) always contains an ISO646 character set. All of the other sets must have the most-significant bit (MSB) set to 1, and they can use any number of bytes to encode the characters. In addition, all characters within a set must have:

  • Same number of bytes to encode all characters

  • Same column display width (number of columns on a fixed-width terminal)

Each character in the third set (CS2) is always preceded with the control character SS2 (single-shift 2, 0x8e). Code sets that conform to EUC do not use the SS2 control character other than to identify the third set.

Each character in the fourth set (CS3) is always preceded with the control character SS3 (single-shift 3, 0x8f). Code sets that conform to EUC do not use the SS3 control character other than to identify the fourth set.

ISO EUC Code Sets

The following code sets are based on definitions set by the International Organization for Standardization (ISO).

  • ISO646-IRV

  • ISO8859-1

  • ISO8859-x

  • eucJP

  • eucTW

  • eucKR

ISO646-IRV

The ISO646-IRV code set defines the code set used for information processing based on a 7-bit encoding. The character set associated with this code set is derived from the ASCII characters.

ISO8859-1

ISO8859-1 encoding is a single-byte encoding that is based on and is compatible with other ISO, American National Standards Institute (ANSI), and European Computer Manufacturer's Association (ECMA) code extension techniques. The ISO8859 encoding defines a family of code sets with each member containing its own unique character sets. The 7-bit ASCII code set is a proper subset of each of the code sets in the ISO8859 family.

The ISO8859-1 code set is called the ISO Latin-1 code set and consists of two character sets:

  • ISO646-IRV Graphic Left, 7-bit ASCII character set

  • ISO8859-1 Graphic Right (Latin) character set

These character sets combined include the characters necessary for Western European languages such as Danish, Dutch, English, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish.

While the ASCII code set defines an order for the English alphabet, the Graphic Right (GR) characters are not ordered according to any specific language. The language-specific ordering is defined by the locale.

Other ISO8859 Code Sets

This section lists the other significant ISO8859 code sets. Each code set includes the ASCII character set plus its own unique characters.

ISO8859-2

Latin alphabet, No. 2, Eastern Europe

  • Albanian

  • Czechoslovakian

  • English

  • German

  • Hungarian

  • Polish

  • Rumanian

  • Serbo-Croatian

  • Slovak

  • Slovene

ISO8859-5

Latin/Cyrillic alphabet

  • Bulgarian

  • Byelorussian

  • English

  • Macedonian

  • Russian

  • Ukrainian

ISO8859-6

Latin/Arabic alphabet

  • English

  • Arabic

ISO8859-7

Latin/Greek alphabet

  • English

  • Greek

ISO8859-8

Latin/Hebrew alphabet

  • English

  • Hebrew

ISO8859-9

Latin/Turkish alphabet

  • Danish

  • Dutch

  • English

  • Finnish

  • French

  • German

  • Irish

  • Italian

  • Norwegian

  • Portuguese

  • Spanish

  • Swedish

  • Turkish

eucJP

The EUC for Japanese consists of single-byte and multibyte characters (2 and 3 bytes). The encoding conforms to ISO2022 and is based on JIS and EUC definitions, see Table 3-2 “Encoding for eucJP”

Table 3-2 Encoding for eucJP

CS

Encoding

 

Character Set

cs0

0xxxxxxx

 

ASCII

cs1

1xxxxxxx

1xxxxxxx

JIS X0208-1990

cs2

0x8E

1xxxxxxx

JIS X0201-1976

cs3

0x8F

1xxxxxxx 1xxxxxxx

JIS X0212-1990

 

JIS X0208-1990

A code of the Japanese graphic character set for information interchange (1990 version) that contains 147 special characters, 10 numeric digits, 83 Hiragana characters, 86 Katakana characters, 52 Latin characters, 48 Greek characters, 66 Cyrillic characters, 32 line-drawing elements, and 6355 Kanji characters.

JIS X0201

A code for information interchange that contains 63 Katakana characters.

JIS X0212-1990

A code of the supplementary Japanese graphic character set for information interchange (1990 version) that contains 21 additional special characters, 21 additional Greek characters, 26 additional Cyrillic characters, 27 additional Latin characters, 171 Latin characters with diacritical marks, and 5801 additional Kanji characters.

eucTW

The EUC for Traditional Chinese is an encoding consisting of characters that contain single-byte and multibyte (2 and 4 bytes) characters. The EUC encoding conforms to ISO2022 and is based on the Chinese National Standard (CNS) as defined by the Republic of China and the EUC definition, see Table 3-3 “Encoding for eucTW”

Table 3-3 Encoding for eucTW

CS

Encoding

  

Character Set

cs0

0xxxxxxx

  

ASCII

cs1

1xxxxxxx

1xxxxxxx

 

CNS 11643.1992 - plane 1

cs2

0x8EA2

1xxxxxxx

1xxxxxxx

CNS 11643.1992 - plane 2

cs3

0x8EA3

1xxxxxxx

1xxxxxxx

CNS 11643.1992 - plane 3

 

0x8EB0

1xxxxxxx

1xxxxxxx

CNS 11643.1992 - Plane 16

 

CNS 11643-1992 defines 16 planes for the Chinese Standard Interchange Code, each plane can support up to 8836 characters (94x94). Currently, only planes 1 through 7 have characters assigned. Table 3-4 “16 Planes of the CNS 11643-1992 Standard” shows the 16 planes of the CNS 11643-1992 standard.

Table 3-4 16 Planes of the CNS 11643-1992 Standard

Plane

Definition

# of Character

EUC Encoding

1

Most frequently used

6085

A1A1-FDCB

2

Secondary frequently

7650

8EA2 A1A1 - 8EA2 F2C4

3

Exec.Yuen EDP [1]center

6148

8EA3 A1A1 - 8EA3 E2C6

4

RIS[2], Vendor defined

7298

8EA4 A1A1 - 8EA4 EEDC

5

Rarely used by MOE[3]

8603

8EA5 A1A1 - 8EA5 FCD1

6

Variation char set 1 by MOE

6388

8EA6 A1A1 - 8EA6 E4FA

7

Variation char set 2 by MOE

6539

8EA7 A1A1 - 8EA7 E6D5

8

Undefined

0

8EA8 A1A1 - 8EA8 FEFE

9

Undefined

0

8EA9 A1A1 - 8EA9 FEFE

10

Undefined

0

8EAA A1A1 - 8EAA FEFE

11

Undefined

0

8EAB A1A1 - 8EAB FEFE

12

User Defined Character (UDC)

0

8EAC A1A1 - 8EAC FEFE

13

UDC

0

8EAD A1A1 - 9EAD FEFE

14

UDC

0

8EAE A1A1 - 8EAE FEFE

15

UDC

0

8EAF A1A1 - 8EAF FEFE

16

UDC

0

8EB0 A1A1 - 8EB0 FEFE

[1] EDP: Center of Directorate, General of Budget, Accounting, and Statistics

[2] RIS: Residence Information System

[3] MOE: Ministry of Education

 

eucKR

The EUC for Korean is an encoding consisting of single-byte and multibyte characters (shown in Table 3-5 “Encoding for eucKR”) The encoding conforms to ISO2022 and is based on Korean Standard Code (KSC) set and EUC definitions.

Table 3-5 Encoding for eucKR

CS

Encoding

 

Character Set

cs0

0xxxxxxx

 

ASCII

cs1

1xxxxxxx

1xxxxxxx

KS C 5601-1992

cs2

  

Not used

cs3

  

Not used

 

KSC 5601-1992 (code of the Korean character set for information interchange, 1992 version) contains 432 special characters, 30 Arabic and Roman numeral characters, 94 Hangul alphabet characters, 52 Roman characters, 48 Greek characters, 27 Latin characters, 169 Japanese characters, 66 Russian characters, 68 line-drawing elements, 2344 precomposed Hangul characters, and 4888 Hanja characters.

The Hangul characters represent the sounds of the Korean words. Each Hangul character is composed of from one to three of the Hangul elementary phonetic signs: an initial consonant (if any), a vowel, and a final consonant (if any). Many Korean words can also be written with Traditional Chinese characters (called Hanja in Korean). In traditional times, Korean texts were generally written in a mixture of Hangul and Hanja: Hanja for the main words (nouns, verbs, modifiers) and Hangul for the particles and grammatical inflections. In recent times, most Korean texts are written purely in Hangul, although personal names may still appear written with Hanja.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.