The Full Wiki

More info on UTF-32/UCS-4

UTF-32/UCS-4: Wikis

Advertisements
  

Note: Many of our articles have direct quotes from sources you can cite, within the Wikipedia article! This article doesn't yet, but we're working on it! See more info or our list of citable articles.

Encyclopedia

From Wikipedia, the free encyclopedia

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 bits for each Unicode code point. All other Unicode transformation formats use variable-length encodings.

Because UTF-32 uses 4 bytes for every character it is quite space inefficient. Specifically, non-BMP characters are so rare in most texts, they may as well be considered non-existent for sizing discussions, making UTF-32 between two and four times the size of other encodings.

Though a fixed number of bytes per code point seems convenient, it is not used as much as the other Unicode encodings. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases, since even with a “fixed width” font there may be more than one code point per character position (combining marks) or more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing.

Contents

History

The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF.

Because only 17 planes are actually in use, all current code points are between 0 and 0x10FFFF. UTF-32 is a subset of UCS-4 that uses only this range. Since the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, UTF-32 will be able to represent all Unicode points except those in deprecated private-use areas (groups 60 to 7F and planes E0 to FF). Accordingly, UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.

Non-usage in HTML5

HTML5 states that "authors should not use UTF-32, as the encoding detection algorithms described in this specification intentionally do not distinguish it from UTF-16." [1]

See also

External links

Advertisements

Advertisements






Got something to say? Make a comment.
Your name
Your email address
Message