The Full Wiki

UTF-8: Wikis

  

Encyclopedia

From Wikipedia, the free encyclopedia

.UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode.^ The letter x indicates bits available for encoding bits of the UCS-4 character value.
  • rfc2279 UTF-8 12 January 2010 6:43 UTC www.normos.org [Source type: Reference]

^ Jump to: navigation , search UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode .

^ UTF-8 is one encoding of the Unicode > repertoire.
  • RE: UTF-8 BOM FAQ from Richard Ishida on 2003-11-17 (public-i18n-geo@w3.org from November 2003) 12 January 2010 6:43 UTC lists.w3.org [Source type: FILTERED WITH BAYES]

.It is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII.^ It is able to represent any universal character in the Unicode standard, yet is backwards compatible with ASCII.This script is used to encode / decode UTF-8 data using javascript.
  • Javascript UTF-8 decode 12 January 2010 6:43 UTC www.scripts.com [Source type: General]

^ Every ASCII character is represented as an ASCII character in UTF-8.
  • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

^ It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII .

.For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages,[1] and other places where characters are stored or streamed.^ UTF-8 is also the preferred encoding for multi-lingual web pages.
  • A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX 12 January 2010 6:43 UTC eyegene.ophthy.med.umich.edu [Source type: Reference]

^ When you see a page on the web, chances are it's encoded in one of these encodings.
  • UTF-8: The Secret of Character Encoding - HTML Purifier 12 January 2010 6:43 UTC htmlpurifier.org [Source type: FILTERED WITH BAYES]

^ For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.

.UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.^ The following routine converts the UTF-8 encoding to UCS -4 character codes: ...
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ UTF-8 character encoding problem .
  • WordPress › Support » Tags — utf-8 12 January 2010 6:43 UTC wordpress.org [Source type: General]

^ The used database must use UTF-8 encoding.
  • How to get UTF-8 working in java webapps? - Stack Overflow 12 January 2010 6:43 UTC stackoverflow.com [Source type: General]

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[2] The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8.[3]
Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Contents

History

.By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets.^ If the character encoding is set from another source, e.g.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Set Returns the document's character encoding .
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ UTF-8 applies to the multi-byte encoding of characters.
  • ADVICE: running Gallery2 (g2) in the non-UTF-8 environment | Gallery 12 January 2010 6:43 UTC gallery.menalto.com [Source type: General]

.The draft ISO 10646 standard contained a non-required annex called UTF that provided a byte-stream encoding of its 32-bit code points.^ Return value: the 32 bit representation of the processed UTF-8 code point.
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ Each byte of a UTF-8 stream or sequence is unambiguous.
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

^ Unicode character set encoded as a byte stream by UTF-8, from (soon to be) Annex P of ISO 10646.
  • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

.This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.^ Every ASCII character is represented as an ASCII character in UTF-8.
  • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

^ Byte represents an ASCII character.
  • From utf-8 string to words(substrings) - C++ Forums 12 January 2010 6:43 UTC www.cplusplus.com [Source type: FILTERED WITH BAYES]

^ By far the most important is that a byte in the ASCII range 0-127 represents itself in UTF. Thus UTF is backward compatible with ASCII. .
  • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

.In July 1992, the X/Open committee XoJIG was looking for a better encoding.^ In August 1992, X-Open circulated a proposal for another UTF-like byte encoding of Unicode characters.
  • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

.Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only bytes where the high bit was set.^ ASCII characters would require 1 byte, other characters would require 2 bytes.
  • Acceptable UTF-8 Range - Ultrashock Forums 12 January 2010 6:43 UTC www.ultrashock.com [Source type: General]

^ ASCII characters are identical to the one byte sequence in UTF-8 .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ Every ASCII character is represented as an ASCII character in UTF-8.
  • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

.In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties.^ In August 1992, X-Open circulated a proposal for another UTF-like byte encoding of Unicode characters.
  • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

.Ken Thompson of the Plan 9 operating system group at Bell Labs, then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string to find code point boundaries.^ UTF-8 encoded code point.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ This function is used to find the length (in code points) of a UTF-8 encoded string.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ It can be the beginning of a new code point, or not.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

.Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike.^ Later, Ken Thompson and Rob Pike did significant work for the formal UTF-8.
  • rfc2279 UTF-8 12 January 2010 6:43 UTC www.normos.org [Source type: Reference]

^ And a note to comment #7 (and #6): It’s not Kernighan and Pike, but Thompson and Pike, who designed UTF8.
  • Lair Of The Multimedia Guru » UTF-8 12 January 2010 6:43 UTC guru.multimedia.cx [Source type: FILTERED WITH BAYES]

.The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.^ (This is needed for any URL that the page is intending to use to communicate back to the server.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Implementations must act as if they used the following state machine to tokenize HTML. The state machine must start in the data state .
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Second, it can be used to embed content from a third-party site, sandboxed to prevent that site from opening popup windows, etc, without preventing the embedded page from communicating back to its originating site, using the database APIs to store data, etc.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

[4]
UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25–29, 1993.

Description

.The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes.^ Because each character is not made up of a fixed number of bytes, UTF-8 is called a variable-width encoding.
  • utf-8 « B and E Blog 12 January 2010 6:43 UTC www.bandeblog.com [Source type: FILTERED WITH BAYES]
  • How UTF-8 Encoding works « B and E Blog 12 January 2010 6:43 UTC www.bandeblog.com [Source type: FILTERED WITH BAYES]

^ UTF-8 applies to the multi-byte encoding of characters.
  • ADVICE: running Gallery2 (g2) in the non-UTF-8 environment | Gallery 12 January 2010 6:43 UTC gallery.menalto.com [Source type: General]

^ UTF-8 is a character encoding , or a way to represent characters in a digital manner.
  • UTF-8 in PHP - General Documentation - Flourish 12 January 2010 6:43 UTC flourishlib.com [Source type: FILTERED WITH BAYES]

.Each byte has 0–4 leading consecutive 1 bits followed by a zero bit to indicate its type.^ UTF-8 sequence followed by a ZERO bit .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ If the first bit (the high-order bit) is zero, then it’s a single-byte character, and we can directly map its remaining bits to the Unicode characters 0 – 127.
  • utf-8 « B and E Blog 12 January 2010 6:43 UTC www.bandeblog.com [Source type: FILTERED WITH BAYES]
  • How UTF-8 Encoding works « B and E Blog 12 January 2010 6:43 UTC www.bandeblog.com [Source type: FILTERED WITH BAYES]

^ This is stored in the leading bits of the first byte in the character.
  • utf-8 « B and E Blog 12 January 2010 6:43 UTC www.bandeblog.com [Source type: FILTERED WITH BAYES]
  • How UTF-8 Encoding works « B and E Blog 12 January 2010 6:43 UTC www.bandeblog.com [Source type: FILTERED WITH BAYES]

.N 1 bits indicates the first byte in a N-byte sequence, with the exception that zero 1 bits indicates a one-byte sequence while one 1 bit indicates a continuation byte in a multi-byte sequence (this was done for ASCII compatibility).^ Each additional bytes (continuing bytes) in the UTF-8 sequence, contain a ONE bit followed by a ZERO bit ...
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ ASCII characters are identical to the one byte sequence in UTF-8 .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ UTF-8 sequence followed by a ZERO bit .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

.The scalar value of the Unicode code point is the concatenation of the non-control bits.^ Unicode code point that is not a surrogate code point).
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Converts an array of Unicode scalar values (code points) into * UTF-8.
  • UTF-8: What is It and Why is It Important 12 January 2010 6:43 UTC www.joconner.com [Source type: Reference]

^ The bits represented by " n "s hold the unicode character code value.
  • A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX 12 January 2010 6:43 UTC eyegene.ophthy.med.umich.edu [Source type: Reference]

.In this table, zeroes and ones represent control bits, x-s represent the lowest 8 bits of the Unicode value, y-s represent the next higher 8 bits, and z-s represent the bits higher than that.^ It is just the lowest 7 bits of the full unicode value.

^ The contents of the control represent the control's default value.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Since five and six byte values cannot be represented by UTF-16, the values 0xF5 and higher should be rejected out of hand.
  • SecuriTeam - Java Runtime UTF-8 Decoder Smuggling Vector 12 January 2010 6:43 UTC www.securiteam.com [Source type: FILTERED WITH BAYES]

Unicode Byte1 Byte2 Byte3 Byte4 example
U+0000–U+007F
(0 to 127)
0xxxxxxx '$' U+0024
00100100
0x24
U+0080–U+07FF
(128 to 2,047)
110yyyxx 10xxxxxx '¢' U+00A2
11000010,10100010
0xC2,0xA2
U+0800–U+FFFF
(2,048 to 65,535)
1110yyyy 10yyyyxx 10xxxxxx '€' U+20AC
11100010,10000010,10101100
0xE2,0x82,0xAC
U+10000–U+10FFFF
(65,536 to 1,114,111)
11110zzz 10zzyyyy 10yyyyxx 10xxxxxx '𤭢' U+024B62
11110000,10100100,10101101,10100010
0xF0,0xA4,0xAD,0xA2
.So the first 128 characters (US-ASCII) need one byte.^ Byte represents an ASCII character.
  • From utf-8 string to words(substrings) - C++ Forums 12 January 2010 6:43 UTC www.cplusplus.com [Source type: FILTERED WITH BAYES]

^ ASCII characters are identical to the one byte sequence in UTF-8 .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ UTF-8's first 128 characters have the same binary as ASCII and so even if your computer doesn't understand UTF-8 it'll still understand the most common letters (in English at least).
  • Force Mail.app to always send UTF-8 messages - Mac OS X Hints 12 January 2010 6:43 UTC www.macosxhints.com [Source type: FILTERED WITH BAYES]

.The next 1,920 characters need two bytes to encode.^ According to the table above, we need 3 bytes to encode this character.
  • .:: Phrack Magazine ::. 12 January 2010 6:43 UTC www.phrack.org [Source type: FILTERED WITH BAYES]

^ UCS-2 encodes Unicode characters into two bytes, which is wasteful if you are only dealing with ASCII or Latin1 text, and insufficient if you need characters above U+00FFFF. UCS-4 uses four bytes, which lets it handle higher characters, but this is even more wasteful for ASCII or Latin1.
  • FLTK: Unicode and UTF-8 Support 12 January 2010 6:43 UTC www.fltk.org [Source type: FILTERED WITH BAYES]

^ UTF-8 encodes all Unicode characters into variable length sequences of bytes.
  • FLTK: Unicode and UTF-8 Support 12 January 2010 6:43 UTC www.fltk.org [Source type: FILTERED WITH BAYES]

.This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets.^ The easiest way to deal with this possibility is to stick to the 96 characters of the basic ASCII alphabet (English letters, digits, and a few punctuation marks).
  • Project version control 12 January 2010 6:43 UTC www.tigris.org [Source type: FILTERED WITH BAYES]

^ (ANSI - Central Europe) 1251 (ANSI - Cyrillic) 1252 (ANSI - Western Europe / Latin I) 1253 (ANSI - Greek) 1254 (ANSI - Turkish) 1255 (ANSI - Hebrew) 1256 (ANSI - Arabic) 1257 (ANSI - Baltic) 1258 (ANSI/OEM - Viet Nam) There is also a little utility (with source) for generating conversion tables for more codepages.
  • UTF-8 conversion support for mIRC | Steven Wittens - Acko.net 12 January 2010 6:43 UTC acko.net [Source type: Original source]

^ The price paid for this is that the representations of Cyrillic, Armenian, Hebrew and Arabic letters {U+0400..U+07FF} grow from two to three bytes length, =FE and =FF are no longer avoided, and UTF-16 surrogates have to be used for characters beyond the 16 bit range which could be helped by a scheme like 16 .
  • Unicode Transformation Formats 12 January 2010 6:43 UTC czyborra.com [Source type: FILTERED WITH BAYES]

.Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).^ They are common for all X applications which uses .

^ Note: Some older software transcoding to UTF-8 may produce illegal output for some input, in particular for characters outside the BMP (Basic Multilingual Plane).
  • UTF-8 - Internationalized Resource Identifiers (IRIs) [RFC-Ref] 12 January 2010 6:43 UTC www.rfc-ref.org [Source type: Reference]

^ Using a terminal * mbyte-terminal * The GUI fully supports multi-byte characters.

.Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.^ Includes Unicode 3.1 (or later) characters beyond Plane 0.
  • UTF-8 Sampler 12 January 2010 6:43 UTC www.columbia.edu [Source type: FILTERED WITH BAYES]

^ Anything else Emit a U+003C LESS-THAN SIGN character token and reconsume the current input character in the script data escaped state .
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ UCS-2 encodes Unicode characters into two bytes, which is wasteful if you are only dealing with ASCII or Latin1 text, and insufficient if you need characters above U+00FFFF. UCS-4 uses four bytes, which lets it handle higher characters, but this is even more wasteful for ASCII or Latin1.
  • FLTK: Unicode and UTF-8 Support 12 January 2010 6:43 UTC www.fltk.org [Source type: FILTERED WITH BAYES]

.By continuing the pattern given above it is possible to deal with much larger numbers.^ On setting, the given value must be converted to the shortest possible string representing the number as a valid non-negative integer and then that string must be used as the new content attribute value.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Otherwise, the given value must be converted to the shortest possible string representing the number as a valid non-negative integer and then that string must be used as the new content attribute value.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ On setting, the given value must be converted to the shortest possible string representing the number as a valid integer and then that string must be used as the new content attribute value.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

.The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set).^ WE8ISO is a single byte character set.
  • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

^ Bytes or sequences of bytes in the original byte stream that did not conform to the encoding specification (e.g.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ This limits the number of characters to 256.

However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.
.With these restrictions, bytes in a UTF-8 sequence have the following meanings.^ UTF-8 sequence followed by a ZERO bit .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ Each byte of a UTF-8 stream or sequence is unambiguous.
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

^ The following routine checks if a byte sequence is valid UTF-8 .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

.The ones marked in red can never appear in a legal UTF-8 sequence.^ UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points.
  • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

^ Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM).
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ Could someone make a modified script (or a new one from scratch) that doesn't do anything but makeing the non-UTF-letters red?
  • UTF-8 conversion support for mIRC | Steven Wittens - Acko.net 12 January 2010 6:43 UTC acko.net [Source type: Original source]

.The ones in green are represented in a single byte.^ It is easy to recognize the start of each byte sequence that represents one character, so readers can easily re-synchronize if an error occurs.
  • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

^ Multi-byte support * multibyte * * multi-byte * * Chinese * * Japanese * * Korean * This is about editing text in languages which have many characters that can not be represented using one byte (one octet).

^ This means that the ASCII charset can be represented unchanged with a single byte of storage space.
  • What Is UTF-8 And Why Is It Important? 12 January 2010 6:43 UTC developers.sun.com [Source type: Reference]
  • UTF-8: What is It and Why is It Important 12 January 2010 6:43 UTC www.joconner.com [Source type: Reference]

The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:
binary hex decimal notes
00000000-01111111 00-7F 0-127 US-ASCII (single byte)
10000000-10111111 80-BF 128-191 Second, third, or fourth byte of a multi-byte sequence
11000000-11000001 C0-C1 192-193 Overlong encoding: start of a 2-byte sequence, but code point ≤ 127
11000010-11011111 C2-DF 194-223 Start of 2-byte sequence
11100000-11101111 E0-EF 224-239 Start of 3-byte sequence
11110000-11110100 F0-F4 240-244 Start of 4-byte sequence
11110101-11110111 F5-F7 245-247 Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF
11111000-11111011 F8-FB 248-251 Restricted by RFC 3629: start of 5-byte sequence
11111100-11111101 FC-FD 252-253 Restricted by RFC 3629: start of 6-byte sequence
11111110-11111111 FE-FF 254-255 Invalid: not defined by original UTF-8 specification

Invalid byte sequences

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:
  • the red invalid bytes in the above table
  • an unexpected continuation byte
  • a start byte not followed by enough continuation bytes
  • a sequence that decodes to a value that should use a shorter sequence (an "overlong form").
.Many earlier decoders would happily try to decode these.^ The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

.Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes.^ UTF-8 is transparent to plain ASCII characters, is self-synchronized (meaning it is possible for a program to figure out where in the bytestream characters start) and can be used with normal string comparison functions for sorting and such.
  • PHP: utf8_encode - Manual 12 January 2010 6:43 UTC us2.php.net [Source type: FILTERED WITH BAYES]

^ Such an invalid UTF-8 escape is often referred to as an overlong sequence.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ ASCII 7 characters correctly (semantically correct) but will only be able to sort multibyte UTF-8 characters based on their byte-by-byte values.
  • Handling UTF-8 with PHP [Web Application Component Toolkit] 12 January 2010 6:43 UTC www.phpwact.org [Source type: FILTERED WITH BAYES]

.Invalid UTF-8 has been used to bypass security validations in high profile products including Microsoft's IIS web server.^ Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ Try to use UTF-8 encoding of content in CSS in order to bypass validation routines.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ US.UTF-8 locale is the flagship Unicode locale in the Solaris 7 product that supports and provides multi-scripts processing capability by using UTF-8 as its codeset.
  • 4. Overview of en_US.UTF-8 Locale Support (Solaris Internationalization Guide For Developers) - Sun Microsystems 12 January 2010 6:43 UTC docs.sun.com [Source type: Reference]

[5]
.RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."^ For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence: .
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ According to the RFC 3629, a particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

[6] .The Unicode Standard requires decoders to "...treat any ill-formed code unit sequence as an error condition.^ The Unicode Standard requires a Unicode-compliant decoder to "...treat any ill-formed code unit sequence as an error condition.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ A character encoding is a mapping from a set of characters to sequences of code units.
  • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

.This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."^ This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ The Unicode Standard requires a Unicode-compliant decoder to "...treat any ill-formed code unit sequence as an error condition.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ A character encoding is a mapping from a set of characters to sequences of code units.
  • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

.Many UTF-8 decoders throw an exception if a string has an error in it.^ Decode UTF-16 encoded strings.
  • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

^ Moving forwards or backwards in a text string is easier in UTF-8 than many other multibyte encodings.
  • What Is UTF-8 And Why Is It Important? 12 January 2010 6:43 UTC developers.sun.com [Source type: Reference]

^ If applying the algorithm to convert a string to a number to the string given by the element's value results in an error, then throw an INVALID_STATE_ERR exception, and abort these steps; otherwise, let value be the result of that algorithm.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

.In recent times this has been found to be impractical: being unable to work with data means you cannot even try to fix it.^ Thank you, V Require to display data in Spanish Langauge March 14, 2007 - 8am Central time zone Bookmark .
  • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

^ If you find bugs, feel free to report them, but try to give a little more information than just 'it doesn't work'.
  • UTF-8 conversion support for mIRC | Steven Wittens - Acko.net 12 January 2010 6:43 UTC acko.net [Source type: Original source]

^ We thought we had a problem at one place I worked and it turned out it was working okay, but the default font for the tool we were looking at the data with made it look like it was being messed up.
  • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

.One example was Python 3.0 which would exit immediately if the command line had invalid UTF-8 in it.^ For example, in the context menu for a water faucet, the command "open" might be disabled if the faucet is already open, but the command "eat" would be marked hidden since the faucet could never be eaten.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ The element with ID "a" in the following example would be the one used to determine if the word "Hello" is checked for spelling errors.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Say you want to an Australian English UTF-8 locale then you would use the following command: .
  • Gentoo Forums :: View topic - HOWTO: Using UTF-8 on Gentoo (edited) 12 January 2010 6:43 UTC forums.gentoo.org [Source type: General]

[7] .A more useful solution is to translate the first byte to a replacement and continue parsing with the next byte.^ In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ The XML specification practically requires parsers to begin at the first byte of an XML document and continue parsing until the end, and all existing parsers operate like this.
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

^ We wrote a trivial program to look for non-ASCII bytes in text files and used a Plan 9 program called tcs (translate character set) to change encodings.
  • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

Popular replacements are:
  • The replacement character '�' (U+FFFD)
  • The '?' or '¿' character (U+003F or U+00BF)
  • The invalid Unicode code points U+DC80..U+DCFF where the low 8 bits are the byte's value.
  • Interpret the bytes according to another encoding (often ISO-8859-1 or CP1252).
.Replacing errors is "lossy": more than one UTF-8 string converts to the same Unicode result.^ UTF-8 encoded string to convert.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ Converts a UTF-16 encoded string to UTF-8.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ Converts a UTF-32 encoded string to UTF-8.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

.Therefore the original UTF-8 should be stored, and translation should only be used when displaying the text to the user.^ Localization includes the translation of text such as user interface labels, error messages, and online help.
  • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

^ User-PI SHOULD be capable of supporting UTF-8 encoding for the language negotiated.
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ If a client detects that a server is non UTF-8 , it SHOULD change its display appropriately.
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

Invalid code points

.UTF-8 may only legally be used to encode valid Unicode scalar values.^ The term Unicode character is used to mean a Unicode scalar value (i.e.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ UTF-8 may be parsed as valid UTF-8 .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ However, this is only an argument for using Unicode.
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

.According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above.^ UTF-8 is a compact, efficient Unicode encoding.
  • What Is UTF-8 And Why Is It Important? 12 January 2010 6:43 UTC developers.sun.com [Source type: Reference]

^ Each byte of a UTF-8 stream or sequence is unambiguous.
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

^ UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.
  • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

.Whether an actual application should treat these as invalid is questionable.^ Subsequently, all applications running on top of these toolkits should be UTF-8-aware out of the box.
  • Gentoo Linux Documentation-- Using UTF-8 with Gentoo 12 January 2010 6:43 UTC www.gentoo.org [Source type: FILTERED WITH BAYES]

^ Unicode consortium says applications should treat canonical equivalences like this as the same, but everyone else (including serious standards bodies, e.g.
  • Lair Of The Multimedia Guru » UTF-8 12 January 2010 6:43 UTC guru.multimedia.cx [Source type: FILTERED WITH BAYES]

.Allowing them allows lossless conversion of an invalid UTF-16 string and allows CESU encoding (described below) to be decoded.^ UTF-8 string to look for invalid UTF-8 sequences.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ UTF-32 encoded string to convert.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ Converts a UTF-8 encoded string to UTF-32.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

There are other code points that are far more important to detect and reject, such as the reversed-BOM U+FFFE, or the codes U+0080..U+00AF which may indicate improperly translated CP1252 or double-encoded UTF-8.

Official name and incorrect variants

.The official name is "UTF-8". All letters are upper-case, and the name is hyphenated.^ Note that the output from locale -a on the Linux box shown above shows " utf8 " in lower case without a hyphen: this is a BUG. When you set the LANG variable, be sure to type UTF-8 in UPPER CASE and with a hyphen: .
  • A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX 12 January 2010 6:43 UTC eyegene.ophthy.med.umich.edu [Source type: Reference]

^ In the HTML syntax, attribute names, even those for foreign elements , may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ The server should attempt to convert all pathnames to UTF-8 , but if it can't then it should leave that name in its raw form.
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

.This spelling is used in all the documents relating to the encoding.^ Nor can you rely on all users limiting their use of characters to the ASCII subset which is common for the majority of encodings.
  • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

^ Vim can support all of these encodings, but always uses UTF-8 internally.

^ If the encoding supported by the terminal doesn't include all the characters that Vim uses, this leads to lost characters.

.Alternatively, the name "utf-8" may be used by all standards conforming to the Internet Assigned Numbers Authority (IANA) list[8] (which include CSS, HTML, XML, and HTTP headers)[9], as the declaration is case insensitive.^ In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Try to use UTF-8 encoding of content in CSS in order to bypass validation routines.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ Status: Last call for comments This table lists the character reference names that are supported by HTML, and the code points to which they refer.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

.Other descriptions that omit the hyphen or replace it with a space, such as "utf8" or "UTF 8", are incorrect.^ If a client attempts to SELECT or EXAMINE such mailboxes without the "UTF8" parameter, the server MUST reject the command with a [ UTF-8-ONLY ] response code.
  • draft-ietf-eai-imap-utf8-09 - IMAP Support for UTF-8 12 January 2010 6:43 UTC tools.ietf.org [Source type: Reference]

^ The script performs a real UTF-8 encoding/decoding, so unlike a simple 'find and replace' approach, characters which do not fit into the current codepage are indicated as such.
  • UTF-8 conversion support for mIRC | Steven Wittens - Acko.net 12 January 2010 6:43 UTC acko.net [Source type: Original source]

^ Characters in the ASCII range occupy only half the space in UTF-8 that they do in some other encodings of Unicode, particularly UTF-16.
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

Despite this, most agents such as browsers can understand them.

UTF-8 derivations

.The following implementations are slight differences from the UTF-8 specification.^ This specification does not make any attempt to support EBCDIC-based encodings and UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior in implementations of this specification.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Implementations, however, are required to follow specific rules when populating a cache based on a cache manifest, to ensure that certain origin-based restrictions are honored.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Specific cases where UTF-8 characters are permitted or not permitted are described in the following paragraphs.
  • draft-ietf-eai-imap-utf8-09 - IMAP Support for UTF-8 12 January 2010 6:43 UTC tools.ietf.org [Source type: Reference]

.They are incompatible with the UTF-8 specification.^ The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them.
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

CESU-8

.Many pieces of software added UTF-8 conversions for UCS-2 data and did not alter their UTF-8 conversion when UCS-2 was replaced with the surrogate-pair supporting UTF-16.^ Is conversation from UTF-8 to UTF-16 considerable overhead?
  • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

^ UTF-16 encoded data to decode.
  • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

^ It has to be very careful not to split an encoding (just like a good UTF-16 citizen has to know not to split high/low surrogates...
  • Sorting it all Out : Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) 12 January 2010 6:43 UTC blogs.msdn.com [Source type: FILTERED WITH BAYES]

.The result is that each half of a UTF-16 surrogate pair is encoded as its own 3-byte UTF-8 encoding, resulting in 6 bytes rather than 4 for characters outside the Basic Multilingual Plane.^ UTF-16 encoded data to decode.
  • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

^ Converts a UTF-16 encoded string to UTF-8.
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

^ Decode UTF-16 encoded strings.
  • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

.Oracle databases use this, as well as Java and Tcl as described below, and probably a great deal of other Windows software where the programmers were unaware of the complexities of UTF-16. Although most usage is by accident, a supposed benefit is that this preserves UTF-16 binary sorting order when CESU-8 is binary sorted.^ But you can use UTF-8 in xchat as well.
  • UTF-8 conversion support for mIRC | Steven Wittens - Acko.net 12 January 2010 6:43 UTC acko.net [Source type: Original source]

^ Usage of these utilities is described below.
  • A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX 12 January 2010 6:43 UTC eyegene.ophthy.med.umich.edu [Source type: Reference]

^ Faster random access wouldn't assist XML processing in any meaningful way; so although this might be a good reason to use a different encoding in a database or other system, it doesn't apply to XML. .
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
  • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

Modified UTF-8

.In Modified UTF-8[10] the null character (U+0000) is encoded as 0xC0,0x80 rather than 0x00, which is not valid UTF-8[11] because it is not the shortest possible representation.^ UTF-8 encoded character sequence .
  • UTF-8 - Internationalization of the File Transfer Protocol [RFC-Ref] 12 January 2010 6:43 UTC rfc-ref.org [Source type: Reference]

^ This is a test page to see how well your browser supports UTF-8 character encoding.

^ Because code points could otherwise be coded in more than one way using UTF-8, the Standard stipulates that the shortest possible representation for a character should be used.
  • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

.Modified UTF-8 strings will never contain any null bytes,[12] which allows them (with a null byte added to the end) to be processed by the traditional ASCIIZ string functions, yet allows all Unicode values including U+0000 to be in the string.^ Supports all western languages, Chinese, Japanese, Korean, Hebrew, Arabic, Thai, Vietnamese, Russian, Cyrillic, Greek, Unicode, utf-8 and more.
  • UTF-8 downloads at VicMan 12 January 2010 6:43 UTC www.vicman.net [Source type: FILTERED WITH BAYES]

^ May be a future version of PHP could make strings using double-byte UTF-16 encoding, so that Unicode could be natively supported, but this would require adding new functions to define the behavior of I/O functions to provide external encode/decode capabilities when sending Unicode strings to a file stream, or with echo and print builtins for the standard input and output streams: - What would be the semantic of fwrite($fp, $string) if $string can contain any UTF-16 characters ?
  • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

^ If the string being parsed does not contain a U+0023 NUMBER SIGN character, or if the first such character in the string is the last character in the string, then return null and abort these steps.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.
.In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.^ Supports all western languages, Chinese, Japanese, Korean, Hebrew, Arabic, Thai, Vietnamese, Russian, Cyrillic, Greek, Unicode, utf-8 and more.
  • UTF-8 downloads at VicMan 12 January 2010 6:43 UTC www.vicman.net [Source type: FILTERED WITH BAYES]

^ UTF-8 is transparent to plain ASCII characters, is self-synchronized (meaning it is possible for a program to figure out where in the bytestream characters start) and can be used with normal string comparison functions for sorting and such.
  • PHP: utf8_encode - Manual 12 January 2010 6:43 UTC us2.php.net [Source type: FILTERED WITH BAYES]
  • PHP: utf8_encode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

^ Therefore any > program source using the standard string handling functions (like strlen) is > broken.
  • Lair Of The Multimedia Guru » UTF-8 12 January 2010 6:43 UTC guru.multimedia.cx [Source type: FILTERED WITH BAYES]

.However it uses Modified UTF-8 for object serialization,[13] for the Java Native Interface,[14] and for embedding constant strings in class files.^ I have also used XML File Changes IS option to modify web.config file according to the user input, and my application started without any problem afterwards.
  • XML File Changes Encoding problems ... UTF-8 becomes UTF-16 - Flexera Software Community 12 January 2010 6:43 UTC community.flexerasoftware.com [Source type: General]

^ Note however that editing these files is a modification of the JRE, and Sun does not support modified JREs.
  • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

^ Introduction Examples of Use Introductionary Sample Checking if a file contains valid UTF-8 text Ensure that a string contains valid UTF-8 text Reference Functions From utf8 Namespace Types From utf8 Namespace Functions From utf8::unchecked Namespace Types From utf8::unchecked Namespace Points of Interest Conclusion Links .
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

[15] .Tcl also uses the same modified UTF-8[16] as Java for internal representation of Unicode data but uses strict CESU-8 for external data.^ The internal representation (Rune) of a character now differs from its external representation (UTF).
  • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

^ UTF-16 encoded data to decode.
  • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

^ Note that this improved approach was included in Unicode, albeit too late, when UTF-16 was introduced.
  • Lair Of The Multimedia Guru » UTF-8 12 January 2010 6:43 UTC guru.multimedia.cx [Source type: FILTERED WITH BAYES]

Byte order mark

.Many Windows programs (including Windows Notepad) add the bytes 0xEF,0xBB,0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte-order mark (BOM), and is commonly referred to as a UTF-8 BOM even though it is not relevant to byte order.^ It is a byte encoding and is therefore byte-order independent.
  • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

^ Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM) .
  • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

^ No byte order marks when using encodings in StreamWriters?
  • Sorting it all Out : Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) 12 January 2010 6:43 UTC blogs.msdn.com [Source type: FILTERED WITH BAYES]

.The BOM can also appear if another encoding with a BOM is translated to UTF-8 without stripping it.^ The attacker's UTF-8 encoded payload is processed and acted on by the application without filtering or transcoding .
  • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

^ Convmv is a utility for converting file names in directory trees from one encoding to another (for example, from a legacy encoding into a UTF-8 encoding).
  • A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX 12 January 2010 6:43 UTC eyegene.ophthy.med.umich.edu [Source type: Reference]

^ Then Notepad and many applications designed for ANSI would automatically support UTF-8 (without BOM).
  • Sorting it all Out : Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) 12 January 2010 6:43 UTC blogs.msdn.com [Source type: FILTERED WITH BAYES]

The presence of the UTF-8 BOM may cause interoperability problems with existing software that could otherwise handle UTF-8, for example:
.
  • Older text editors may display the BOM as "" at the start of the document, even if the UTF-8 file contains only ASCII and would otherwise display correctly.
  • Programming language parsers can often handle UTF-8 in string constants and comments, but cannot parse the BOM at the start of the file.
  • Programs that identify file types by leading characters may fail to identify the file if a BOM is present even if the user of the file could skip the BOM. Or conversely they will identify the file when the user cannot handle the BOM. An example is the Unix shebang syntax.
  • Programs that insert information at the start of a file will result in a file with the BOM somewhere in the middle of it (this is also a problem with the UTF-16 BOM).^ Works, but typing non-ASCII characters might be a problem.

    ^ (A user agent may want to truncate the string to 1024 characters for display, for instance.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ UTF-32 string where to append the result of conversion.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    One example is offline browsers that add the originating URL to the start of the file.
.If compatibility with existing programs is not important, the BOM could be used to identify if a file is UTF-8 versus a legacy encoding, but this is still problematic due to many instances where the BOM is added or removed without actually changing the encoding, or various encodings are concatenated together.^ If the new encoding is a UTF-16 encoding, change it to UTF-8.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ If the default system locale does not support UTF-8, in theory your application could change the locale “on the fly” using setlocale but in practice that requires two things; that there is a locale available on the system which supports UTF-8 (not guaranteed) and that the correct locale identifier string can be found (there a definately differences between Windows and *Nix locale identifiers and even amongst the Unixes believe there are variations e.g.
  • Handling UTF-8 with PHP [Web Application Component Toolkit] 12 January 2010 6:43 UTC www.phpwact.org [Source type: FILTERED WITH BAYES]

^ Then Notepad and many applications designed for ANSI would automatically support UTF-8 (without BOM).
  • Sorting it all Out : Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) 12 January 2010 6:43 UTC blogs.msdn.com [Source type: FILTERED WITH BAYES]

Checking if the text is valid UTF-8 is more reliable than using BOM.

Advantages and disadvantages

General

Advantages

.
  • The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially.^ Every ASCII character is represented as an ASCII character in UTF-8.
    • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

    ^ Instead, every character is represented by a number.
    • Gentoo Linux Documentation-- Using UTF-8 with Gentoo 12 January 2010 6:43 UTC www.gentoo.org [Source type: FILTERED WITH BAYES]

    ^ Since 7-bit ASCII characters can represent only themselves in UTF, the compiler does not have to be careful while looking for the termination of a string or comment.
    • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

    .This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
  • UTF-8 is the only encoding for XML entities that does not require a BOM or an indication of the encoding.^ UTF-8 is a compact, efficient Unicode encoding.
    • What Is UTF-8 And Why Is It Important? 12 January 2010 6:43 UTC developers.sun.com [Source type: Reference]

    ^ Why choose UTF-8 instead of UTF-16 or other Unicode encodings?
    • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
    • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

    ^ UTF-8 encoded string to convert.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    [17]
  • .
  • UTF-8 and UTF-16 are the standard encodings for Unicode text in HTML documents, with UTF-8 as the preferred and most used encoding.
  • UTF-8 strings can be fairly reliably recognized as such by a simple heuristic algorithm.^ If enctype is text/plain Use the text/plain encoding algorithm .
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ J2SE 1.4 uses version 3.0 of the Unicode standard, and J2SE 1.3 uses version 2.1.
    • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

    ^ Why choose UTF-8 instead of UTF-16 or other Unicode encodings?
    • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
    • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

    [18] .The chance of a random string of bytes being valid UTF-8 and not pure ASCII is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence.^ Checks whether a sequence of octets is a valid UTF-8 string.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

    ^ Return value : true if the sequence is a valid UTF-8 string; false if not.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

    ^ UTF-8 string to test for validity.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

    [19] .ISO/IEC 8859-1 is even less likely to be mis-recognized as UTF-8: the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol.^ ASCII that add a potpourri of useful, non-standard characters like é and æ.
    • UTF-8: The Secret of Character Encoding - HTML Purifier 12 January 2010 6:43 UTC htmlpurifier.org [Source type: FILTERED WITH BAYES]

    ^ Encodes an ISO-8859-1 string to UTF-8 .
    • PHP: utf8_encode - Manual 12 January 2010 6:43 UTC us2.php.net [Source type: FILTERED WITH BAYES]
    • PHP: utf8_encode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

    ^ UTF-8 will not work with software designed for ASCII only either .
    • Lair Of The Multimedia Guru » UTF-8 12 January 2010 6:43 UTC guru.multimedia.cx [Source type: FILTERED WITH BAYES]

    .This is an advantage that most other encodings do not have, causing errors (mojibake) if the encoding is not stated in the file and wrongly guessed.
  • Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.
  • One UTF-8 advantage is that other byte-based encodings can pass through the same API. This means however that the encoding must be identified.^ UTF-8 encoded code point and returns the current one.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

    ^ This function has two purposes: one is to iterate backwards through a UTF-8 encoded string.
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    ^ The code-point length of a string is the number of Unicode code points in that string.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    Because the other encodings are unlikely to be valid UTF-8, a reliable way to implement this is to assume UTF-8 and switch to a legacy encoding only if several invalid UTF-8 byte sequences are encountered.

Disadvantages

.
  • A UTF-8 parser that is not compliant with current versions of the standard might accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output.^ J2SE 1.4 uses version 3.0 of the Unicode standard, and J2SE 1.3 uses version 2.1.
    • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

    ^ Client detection should always be limited to detecting known current versions; future versions and unknown versions should always be assumed to be fully compliant.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ The Unicode Standard requires a Unicode-compliant decoder to "...treat any ill-formed code unit sequence as an error condition.
    • CAPEC - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Release 1.4) 12 January 2010 6:43 UTC capec.mitre.org [Source type: Reference]

    .This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.^ Return value: the 32 bit representation of the processed UTF-8 code point.
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    ^ Return value : the 32 bit representation of the processed UTF-8 code point.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

    ^ ASCII is strictly seven-bit, meaning that it uses bit patterns representable with seven binary digits, which provides a range of 0 to 127 in decimal.
    • Gentoo Linux Documentation-- Using UTF-8 with Gentoo 12 January 2010 6:43 UTC www.gentoo.org [Source type: FILTERED WITH BAYES]

    [20]
  • .
  • The introduction of UTF-8 gave one new active encoding on top of the locally established encoding.^ If the new encoding is a UTF-16 encoding, change it to UTF-8.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ UTF-8 is only one of the supported I/O encodings .
    • Unicode Transformation Formats 12 January 2010 6:43 UTC czyborra.com [Source type: FILTERED WITH BAYES]

    ^ Top Reviewer: A reader Please tell us which one has to be used - either utf-8 or some other - explicity please...
    • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

    To have two actively used encodings gave bugs and confusion, and UTF-8 was blamed for that in countries where there had not been any encoding troubles for some years.

Compared to single-byte encodings

Advantages

.
  • UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time.^ What is a coded character set?
    • Unicode Transformation Formats 12 January 2010 6:43 UTC czyborra.com [Source type: FILTERED WITH BAYES]

    ^ UTF-8 is the proper binary encoding of the Unicode character set.
    • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www-128.ibm.com [Source type: General]
    • Encode your XML documents in UTF-8 12 January 2010 6:43 UTC www.ibm.com [Source type: General]

    ^ AL16UTF16 is the Unicode 3.1 UTF-16 Universal character set.
    • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

    .For many languages there has been more than one single-byte encoding in usage, so even knowing the language was insufficient information to display it correctly.
  • The bytes 0xfe and 0xff do not appear, so a valid UTF-8 stream never matches the UTF-16 byte order mark and thus cannot be confused with it.^ WHAT IS A LOCALE? * locale * There are many of languages in the world.

    ^ It is a byte encoding and is therefore byte-order independent.
    • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

    ^ UTF-16 encoded data to decode.
    • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

    The absence of 0xFF (\377) also eliminates the need to escape this byte in Telnet (and FTP control connection).

Disadvantages

.
  • UTF-8 encoded text is larger than the appropriate single-byte encoding except for plain ASCII characters.^ Every ASCII character is represented as an ASCII character in UTF-8.
    • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

    ^ The text/plain encoding algorithm is as follows: .
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ WE8ISO is a single byte character set.
    • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

    .In the case of languages which used 8-bit character sets with non-Latin alphabets encoded in the upper half (such as most Cyrillic and Greek alphabet code pages), letters in UTF-8 will be double the size.^ If the character encoding is set from another source, e.g.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ Set Returns the document's character encoding .
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ What is a coded character set?
    • Unicode Transformation Formats 12 January 2010 6:43 UTC czyborra.com [Source type: FILTERED WITH BAYES]

    .For some languages such as Hindi's Devanagari and Thai, letters will be triple the size (this has caused objections in India and other countries).
  • A vocal number of computer users feel their current 8-bit encoding (such as Windows-1252) include all the necessary characters for them and all users they communicate with.^ The necessary development of completely different single-byte encodings for non-Latin alphabets, such as EUC (Extended Unix Coding) which is used for Japanese and Korean (and to a lesser extent Chinese) created more confusion, while other operating systems still used different character sets for the same languages, for example, Shift-JIS and ISO-2022-JP. Users wishing to view cyrillic glyphs had to choose between KOI8-R for Russian and Bulgarian or KOI8-U for Ukrainian, as well as all the other cyrillic encodings such as the unsuccessful ISO 8859-5, and the common Windows-1251 set.
    • Gentoo Linux Documentation-- Using UTF-8 with Gentoo 12 January 2010 6:43 UTC www.gentoo.org [Source type: FILTERED WITH BAYES]

    ^ Supports all western languages, Chinese, Japanese, Korean, Hebrew, Arabic, Thai, Vietnamese, Russian, Cyrillic, Greek, Unicode, utf-8 and more.
    • UTF-8 downloads at VicMan 12 January 2010 6:43 UTC www.vicman.net [Source type: FILTERED WITH BAYES]

    ^ Although necessary for some scripts, such as Thai, Arabic, and Hebrew, such characters confuse the issues for Latin languages because they generate multiple representations for accented characters.
    • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

    .They therefore see no benefit in Unicode, though often their hostility is mis-directed at UTF-8 specifically and not Unicode in general.
  • It is possible in UTF-8 (or any other multi-byte encoding) to split a string in the middle of a character, which may result in an invalid string if the pieces are not concatenated later.
  • If the code points are all the same size, measurements of a fixed number of them is easy.^ Code points are the numbers that can be used in a coded character set.
    • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

    ^ UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.
    • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

    ^ Getting exactly ONE Unicode code point out of UTF-8 .
    • Sorting it all Out : Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) 12 January 2010 6:43 UTC blogs.msdn.com [Source type: FILTERED WITH BAYES]

    .Due to ASCII-era documentation where "character" is used as a synonym for "byte" this is often considered important.^ Using a terminal * mbyte-terminal * The GUI fully supports multi-byte characters.

    ^ ASCII that add a potpourri of useful, non-standard characters like é and æ.
    • UTF-8: The Secret of Character Encoding - HTML Purifier 12 January 2010 6:43 UTC htmlpurifier.org [Source type: FILTERED WITH BAYES]

    ^ We wrote a trivial program to look for non-ASCII bytes in text files and used a Plan 9 program called tcs (translate character set) to change encodings.
    • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

    However, by measuring string positions using bytes instead of "characters" most algorithms can be easily and efficiently[citation needed] adapted for UTF-8.

Compared to other multi-byte encodings

Advantages

.
  • UTF-8 uses the codes 0-127 only for the ASCII characters.
  • UTF-8 can encode any Unicode character.^ Every ASCII character is represented as an ASCII character in UTF-8.
    • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

    ^ UTF-8 encoding table and Unicode characters .
    • Unicode/UTF-8-character table 12 January 2010 6:43 UTC www.utf8-chartable.de [Source type: FILTERED WITH BAYES]

    ^ Or 7-bit ASCII representations of the Unicode characters.
    • Unicode Transformation Formats 12 January 2010 6:43 UTC czyborra.com [Source type: FILTERED WITH BAYES]

    .Files in different languages can be displayed correctly without having to choose the correct code page or font.^ Kaboom can read and write byte order marks (BOM) and is very usefull for everybody who has to deal with text files in foreign code pages.
    • UTF-8 downloads at VicMan 12 January 2010 6:43 UTC www.vicman.net [Source type: FILTERED WITH BAYES]

    ^ Kaboom converts code pages of text files, e.g.
    • UTF-8 downloads at VicMan 12 January 2010 6:43 UTC www.vicman.net [Source type: FILTERED WITH BAYES]

    ^ The long answer needs to look at which languages you want to display at the same time, and how your application selects fonts.
    • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

    .For instance Chinese and Arabic can be in the same text without special codes inserted to switch the encoding.
  • UTF-8 is "self-synchronizing": character boundaries are easily found when searching either forwards or backwards.^ When searching for a character without a composing character, this will find matches in the text with or without composing characters.

    ^ UTF-8 encoding table and Unicode characters .
    • Unicode/UTF-8-character table 12 January 2010 6:43 UTC www.utf8-chartable.de [Source type: FILTERED WITH BAYES]

    ^ UTF-8 encoded code point.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    .If bytes are lost due to error or corruption, one can always locate the beginning of the next character and thus limit the damage.^ Chinese = more then one byte/ character We8iso = one byte/ character Impedence mismatch, you lose -- data corrupt.
    • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

    ^ Unicode throws away the traditional single-byte limit of character sets.
    • Gentoo Linux Documentation-- Using UTF-8 with Gentoo 12 January 2010 6:43 UTC www.gentoo.org [Source type: FILTERED WITH BAYES]

    ^ The data would at some point appear to corrupt itself via some maintenance process due to character set conversion.
    • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

    .Many multi-byte encodings are much harder to resynchronize.
  • Any byte oriented string searching algorithm can be used with UTF-8 data, since the sequence of bytes for a character cannot occur anywhere else.^ If enctype is multipart/form-data Use the multipart/form-data encoding algorithm .
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ Thus is useful for languages where a sequence of characters can be broken anywhere.

    ^ UTF-8 string to look for invalid UTF-8 sequences.
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    .Some older variable-length encodings (such as Shift JIS) did not have this property and thus made string-matching algorithms rather complicated.
  • Efficient to encode using simple bit operations.^ It is a variable length encoding scheme.
    • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

    ^ If the given token is a case-sensitive match for one of the tokens in the DOMTokenList object's underlying string then remove the given token from the underlying string and stop the algorithm, returning false.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points.
    • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

    UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete UTF-1 encoding).

Disadvantages

.
  • UTF-8 often takes more space than an encoding made for one or a few languages.^ One or more space characters .
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ Is it possible to display more than one language in Sun's JREs?
    • Java Internationalization FAQ 12 January 2010 6:43 UTC java.sun.com [Source type: Reference]

    ^ UTF-8 is only one of the supported I/O encodings .
    • Unicode Transformation Formats 12 January 2010 6:43 UTC czyborra.com [Source type: FILTERED WITH BAYES]

    Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.

Compared to UTF-16

Advantages

.
  • Converting to UTF-16 while maintaining compatibility with existing programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated.^ UTF-16 encoded data to decode.
    • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

    ^ UTF-8 encoded string to convert.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    ^ Converts a UTF-16 encoded string to UTF-8.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    .Handling of invalid encodings makes this much more difficult than it may first appear.
  • Byte streams containing invalid UTF-8 cannot be losslessly converted to UTF-16. Invalid UTF-16 however can be losslessly converted to UTF-8. This turns out to be surprisingly important[citation needed] in practice.
  • Characters outside the basic multilingual plane are not a special case.^ Handle 2-byte encoding.
    • [sources] Log of /emacs/emacs/lisp/international/utf-8.el 12 January 2010 6:43 UTC cvs.savannah.gnu.org [Source type: FILTERED WITH BAYES]

    ^ Converts a UTF-16 encoded string to UTF-8.
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    ^ Converts a UTF-32 encoded string to UTF-8.
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    .UTF-16 is often mistaken to be the obsolete constant-length UCS-2 encoding, leading to code that works for most text but suddenly fails for non-BMP characters.
  • ASCII characters take 1 byte in UTF-8 and 2 in UTF-16. Text in all languages using codepoints below U+0800 (which includes all modern European languages) will be smaller in UTF-8 due to the presence of ASCII spaces, newlines, numbers, punctuation, and Latin letters.
  • Most communication and storage was designed for a stream of bytes.^ ASCII that add a potpourri of useful, non-standard characters like é and æ.
    • UTF-8: The Secret of Character Encoding - HTML Purifier 12 January 2010 6:43 UTC htmlpurifier.org [Source type: FILTERED WITH BAYES]

    ^ Works, but typing non-ASCII characters might be a problem.

    ^ Remove all space characters in s .
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    A UTF-16 string must use a pair of bytes for each code, which introduces a couple of potential problems:
    • The order of those two bytes becomes an issue and must be added to the protocol, such as with a byte order mark
    • If a byte is missing from UTF-16, the whole rest of the string will be meaningless text.

Disadvantages

.
  • A simplistic parser for UTF-16 is unlikely[citation needed] to convert invalid sequences to ASCII. Since the dangerous characters in most situations are ASCII, a simplistic UTF-16 parser is much less dangerous than a simplistic UTF-8 parser.
  • Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters.^ Every ASCII character is represented as an ASCII character in UTF-8.
    • TextMate Blog » Handling encodings (UTF-8) 12 January 2010 6:43 UTC blog.macromates.com [Source type: General]

    ^ Converts a UTF-16 encoded string to UTF-8.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    ^ The input stream converts bytes into characters for use in the tokenizer .
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    .Since ASCII includes the space, Arabic numerals, newline characters, most characters used in programming and markup languages, as a punctuation marks used in some of such languages, this rarely happens.^ ASCII that add a potpourri of useful, non-standard characters like é and æ.
    • UTF-8: The Secret of Character Encoding - HTML Purifier 12 January 2010 6:43 UTC htmlpurifier.org [Source type: FILTERED WITH BAYES]

    ^ Some of the examples included in this specification might also be of use, but the novice author is cautioned that this specification, by necessity, defines the language with a level of detail that might be difficult to understand at first.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ (For example, the script element can result in scripts executing and using the dynamic markup insertion APIs to insert characters into the stream being tokenized.
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    .For example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version [21]
  • In UCS-2 (but not UTF-16) Unicode code points are all the same size, making measurements of a fixed number of them easy.^ Getting exactly ONE Unicode code point out of UTF-8 .
    • Sorting it all Out : Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) 12 January 2010 6:43 UTC blogs.msdn.com [Source type: FILTERED WITH BAYES]

    ^ Unicode code point that is not a surrogate code point).
    • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

    ^ To find its UTF-8 code point: 1.
    • Ask Tom "Multilingual Database and UTF-8" 12 January 2010 6:43 UTC asktom.oracle.com [Source type: FILTERED WITH BAYES]

    .Due to ASCII-era documentation where "character" is used as a synonym for "byte" this is often considered important.^ Using a terminal * mbyte-terminal * The GUI fully supports multi-byte characters.

    ^ ASCII that add a potpourri of useful, non-standard characters like é and æ.
    • UTF-8: The Secret of Character Encoding - HTML Purifier 12 January 2010 6:43 UTC htmlpurifier.org [Source type: FILTERED WITH BAYES]

    ^ We wrote a trivial program to look for non-ASCII bytes in text files and used a Plan 9 program called tcs (translate character set) to change encodings.
    • Hello World or Καλημέρα κόσμε or こんにちは 世界 12 January 2010 6:43 UTC plan9.bell-labs.com [Source type: FILTERED WITH BAYES]

    .Most UTF-16 implementations, including Windows, measure UTF-16 non-BMP characters as 2 units, as this is the only practical way to handle the strings.^ Converts a UTF-16 encoded string to UTF-8.
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]
    • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]

    ^ Decode UTF-16 encoded strings.
    • PHP: utf8_decode - Manual 12 January 2010 6:43 UTC kr.php.net [Source type: FILTERED WITH BAYES]

    ^ Converts an UTF-8 encoded string to UTF-16 .
    • UTF8-CPP: UTF-8 with C++ in a Portable Way 12 January 2010 6:43 UTC utfcpp.sourceforge.net [Source type: Reference]

    The same applies to UTF-8.

See also

References

  1. ^ "Moving to Unicode 5.1". Official Google Blog. 2008-5-5. http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html. Retrieved 2008-05-08. 
  2. ^ Alvestrand, H. (1998), "IETF Policy on Character Sets and Languages", RFC 2277, Internet Engineering Task Force 
  3. ^ "Using International Characters in Internet Mail". Internet Mail Consortium. August 1, 1998. http://www.imc.org/mail-i18n.html. Retrieved 2007-11-08. 
  4. ^ Pike, Rob (2003-04-03). "UTF-8 history". http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt. 
  5. ^ Marin, Marvin (2000-10-17). "Web Server Folder Traversal MS00-078". http://www.sans.org/resources/malwarefaq/wnt-unicode.php. 
  6. ^ Yergeau, F. (2003), "UTF-8, a transformation format of ISO 10646", RFC 3629, Internet Engineering Task Force 
  7. ^ "Non-decodable Bytes in System Character Interfaces". http://www.python.org/dev/peps/pep-0383/. 
  8. ^ Internet Assigned Numbers Authority Character Sets
  9. ^ W3C: Setting the HTTP charset parameter notes that the IANA list is used for HTTP
  10. ^ "Java SE 6 documentation for Interface java.io.DataInput, subsection on Modified UTF-8". Sun Microsystems. 2008. http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8. Retrieved 2009-05-22. 
  11. ^ "[...] the overlong UTF-8 sequence C0 80 [...]", "[...] the illegal two-octet sequence C0 80 [...]""Request for Comments 3629: "UTF-8, a transformation format of ISO 10646"". 2003. http://www.apps.ietf.org/rfc/rfc3629.html#page-5. Retrieved 2009-05-22. 
  12. ^ "[...] Java virtual machine UTF-8 strings never have embedded nulls.""The Java Virtual Machine Specification, 2nd Edition, section 4.4.7: "The CONSTANT_Utf8_info Structure"". Sun Microsystems. 1999. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html#7963. Retrieved 2009-05-24. 
  13. ^ "[...] encoded in modified UTF-8.""Java Object Serialization Specification, chapter 6: Object Serialization Stream Protocol, section 2: Stream Elements". Sun Microsystems. 2005. http://java.sun.com/javase/6/docs/platform/serialization/spec/protocol.html#8299. Retrieved 2009-05-22. 
  14. ^ "The JNI uses modified UTF-8 strings to represent various string types.""Java Native Interface Specification, chapter 3: JNI Types and Data Structures, section: Modified UTF-8 Strings". Sun Microsystems. 2003. http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html#wp16542. Retrieved 2009-05-22. 
  15. ^ "[...] differences between this format and the "standard" UTF-8 format.""The Java Virtual Machine Specification, 2nd Edition, section 4.4.7: "The CONSTANT_Utf8_info Structure"". Sun Microsystems. 1999. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html#7963. Retrieved 2009-05-23. 
  16. ^ "In orthodox UTF-8, a NUL byte(\x00) is represented by a NUL byte. [...] But [...] we [...] want NUL bytes inside [...] strings [...]""Tcler's Wiki: UTF-8 bit by bit (Revision 6)". 2009-04-25. http://wiki.tcl.tk/_/revision?N=1211&V=6. Retrieved 2009-05-22. 
  17. ^ W3.org
  18. ^ W3 FAQ: Multilingual Forms: a Perl regular expression to validate a UTF-8 string)
  19. ^ There are 256 × 256 − 128 × 128 not-pure-ASCII two-byte sequences, and of those, only 1920 encode valid UTF-8 characters (the range U+0080 to U+07FF), so the proportion of valid not-pure-ASCII two-byte sequences is 3.9%. Similarly, there are 256 × 256 × 256 − 128 × 128 × 128 not-pure-ASCII three-byte sequences, and 61,406 valid three-byte UTF-8 sequences (U+000800 to U+00FFFF minus surrogate pairs and non-characters), so the proportion is 0.41%; finally, there are 2564 − 1284 non-ASCII four-byte sequences, and 1,048,544 valid four-byte UTF-8 sequences (U+010000 to U+10FFFF minus non-characters), so the proportion is 0.026%. Note that this assumes that control characters pass as ASCII; without the control characters, the percentage proportions drop somewhat).
  20. ^ Tools.ietf.org
  21. ^ The version from 2009-04-27 of ja:UTF-8 needed 50 kb when saved (as UTF-8), but when converted to UTF-16 (with notepad) it took 81 kb, with a similar result for the Korean article

External links

There are several current definitions of UTF-8 in various standards documents:
  • RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
  • The Unicode Standard, Version 5.0, §3.9 D92, §3.10 D95 (2007)
  • The Unicode Standard, Version 4.0, §3.9–§3.10 (2003)
  • ISO/IEC 10646:2003 Annex D (2003)
They supersede the definitions given in the following obsolete works:
  • ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
  • The Unicode Standard, Version 2.0, Appendix A (1996)
  • RFC 2044 (1996)
  • RFC 2279 (1998)
  • The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1 : UTF-8 Shortest Form (2000)
  • Unicode Standard Annex #27: Unicode 3.1 (2001)
.They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.^ If no such set of allowed values is provided, then all values are conforming.
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Since they are not all spaces, they are handled as per the "anything else" rules in the " in table " insertion mode, which defer to the " in body " insertion mode but with foster parenting .
  • HTML5 12 January 2010 6:43 UTC dev.w3.org [Source type: Reference]

^ Return value: the 32 bit representation of the processed UTF-8 code point.
  • CodeProject: UTF-8 With C++ in a Portable Way. Free source code and programming help 12 January 2010 6:43 UTC www.codeproject.com [Source type: Reference]


Wiktionary

Up to date as of January 15, 2010

Definition from Wiktionary, a free dictionary

English

Initialism

Wikipedia-logo.png
Wikipedia has an article on:
UTF-8
  1. The Unicode Transformation Format, a variable width encoding scheme for Unicode characters.

See also

  • UTF-7 mostly obsolete older encoding scheme
  • UTF-16

Citable sentences

Up to date as of December 22, 2010

Here are sentences from other pages on UTF-8, which are similar to those in the above article.








Got something to say? Make a comment.
Your name
Your email address
Message