CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page. For example, Unicode is a code page that has several encoding forms, like UTF-8, UTF-16 and UTF-32.
Contents |
The terms code page and CCSID are often used interchangeably even though they are not synonymous. A code page may be only part of what makes up a CCSID. The following definitions help to illustrate this point, from glyph to CCSID and everything in between.
A glyph is the actual physical pattern of pixels or ink that shows up on a display or printout.
A character is a concept that covers all glyphs associated with a certain symbol. For instance, "F", "F", "F", "F", "F", and "F" are all different glyphs, but use the same character. The various modifiers (bold, italic, underline, color, and font) do not change the Fs essential F-ness.
A character set contains the characters necessary to allow a particular human to carry on a meaningful interaction with the computer. Here's where we start to separate characters into various alphabets (Latin, Arabic, Hebrew, Cyrillic, and so on) or ideographic groups (Chinese, Korean, and so on).
A code page represents a particular assignment of code point values to glyphs. The code point is the logical representation of the computer's internal byte representation of that character. Many characters are represented by different code points in different code pages. It is important to note that all code points in a code page contain the same number of bytes. Certain character sets can be adequately represented with single-byte code pages (256 characters), but many require more than that. Examples include JIS X 0208 and Unicode.
An encoding scheme is the byte format of a code page. It maps code point values to byte values in a computer. For example, UTF-8 and UTF-16BE are two encodings of the same Unicode code page. In IBM's CDRA, this is typically represented with an ESID (Encoding Scheme IDentifier). EUC and ISO-2022 are other examples of encoding schemes.
A coded character set identifier (CCSID) contains all of the information necessary to assign and preserve the meaning and rendering of characters through various stages of processing and interchange. This information always includes at least one code page, but may include multiple code pages of differing byte-lengths. The CCSID also has an associated encoding scheme that governs how various code points are to be handled. This mechanism allows a program to recognize bidirectional orientation, character shaping (mainly of Arabic characters), and other complex encoding information.
The following examples show how some CCSIDs are made up of other CCSIDs.
| Character Set | Code Page | CCSID | Encoding Scheme |
|---|---|---|---|
| 1122 | 897 | 897 | SBCS |
| 370 | 301 | 301 | DBCS |
| Character Set | Code Page | CCSID | Encoding Scheme |
|---|---|---|---|
| 1172 | 1041 | 1041 | SBCS |
| 370 | 301 | 301 | DBCS |
| Character Set | Code Page | CCSID | Encoding Scheme |
|---|---|---|---|
| 1170 | 897 | 4993 | SBCS |
| 370 | 301 | 301 | DBCS |
All three of these variant Shift-JIS CCSIDs are MBCS (multi-byte character sets). The SBCS (single byte character set) portion of each CCSID is different. The DBCS portion is the same across each CCSID. CCSID 5028 uses an updated code page 897 called CCSID 4993. CCSID 932 uses the original code page 897, which is CCSID 897. CCSID 942 uses a different SBCS from the other 2 CCSIDs, which is 1041.
Also notice how CCSID 5028 and 4993 are different by 4096 (1000 in hexadecimal) from the predecessor CCSID with the same code page identifier. This is a common way that CDRA denotes an upgraded CCSID.
There are a few reasons for this amount of complexity.
|
|||||||||||||||||||||||||||||||||||||||||||||||
|
|