Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system would mis-detect the phrase "Bush hid the facts" as Chinese.
Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents encoded in a superset of ASCII, such as UTF-8 or Latin-1, can declare their encoding in a meta element, thus:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band using the Content-type header.
|
|||||||||||||||||||||||||||||||||||||||||||||||
|
|