The Full Wiki

More info on Charset detection

Charset detection: Wikis

Advertisements

Note: Many of our articles have direct quotes from sources you can cite, within the Wikipedia article! This article doesn't yet, but we're working on it! See more info or our list of citable articles.

Encyclopedia

From Wikipedia, the free encyclopedia

Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system would mis-detect the phrase "Bush hid the facts" as Chinese.

Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents encoded in a superset of ASCII, such as UTF-8 or Latin-1, can declare their encoding in a meta element, thus:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band using the Content-type header.

See also

External links

Advertisements

Advertisements






Got something to say? Make a comment.
Your name
Your email address
Message