Encoding

Book Creator
Add this page to your book

Book Creator
Remove this page from your book

Encoding (or code page) it is a rule defining how characters encoded to bytes and how convert bytes back to symbols. This has nothing to do with a font or with symbol representation. This is data only.

Code page can provide rule for decoding/encoding of all set of symbols from all languages (Unicode code pages) or limited set of symbols from some languages (1251, 1250 etc).

Depending on code page selected for decoding of bytes, you get different characters as decoding result. If results does not meet the expectation, it does not mean that something is corrupted, it only mean that code page (key) was selected wrong.

But there are some cases, than data can be really corrupted as result of conversion. This can happen, if you convert from Unicode code page to non-Unicode code page, and there was some symbols in text, that can not be mapped to symbols in target, not Unicode code page.

Honestly, you never convert bytes to symbols. It is always a conversion from bytes representing symbols encoded in one code page to bytes representing symbols in another code page. Just some code pages are native for operating system (as UTF-16 LE == 1200) and you can directly pass bytes sequence encoded in this code page to OS drawing routines.

HippoEDIT internally uses Unicode UTF-16 LE code page to keep texts, and convert bytes to this code page on reading, and vice versa on saving. Because of that it can display/edit texts in all languages in single document. Only HippoEDIT NU (non-Unicode) not able to handle mixed languages documents, because it does not keep texts as Unicode, but uses system, single byte code page, to minimize memory consumption (Unicode version of HippoEDIT uses twice more memory for keeping texts).

Normally all data in a file encoded using same code page. If you mix blocks encoded using different code pages, without predefined unique marker which defines start of block and code page, it is not a data, but trash bytes. While there is no way to decode these bytes back to characters.

To help readers recognize code page of encoded bytes, writers use BOM sequence. If BOM sequence not found, readers can check for some pattern strings which define code page (as codepage= or encoding in HTML or XML), use statistical analysis of data etc.

More info about character encoding on Wikipedia.