How encoding detection works

  1. The file can have one encoding (same as code page). Encoding can be as Unicode ( UTF16 LE, BE (1200, 1201), UTF8 (65000) ) as not Unicode (for example 252 (Western European) etc).
  2. There are several places, where encoding conversion can be applied to document: Open, Save As, New, Search and Replace
  3. The encoding can be selected/changed in File Open/Save dialog, via context menu or status bar, in Project Settings, in Tools→Options→Document settings, in syntax specification (here you can set as preferred encoding, as forced encoding). In addition to this HippoEDIT does an auto-detection of the encoding using different algorithms (Check BOM bytes, statistics test for UTF16 LE/BE, statistics test for UTF8, check by encoding strings and same checks as IE uses).
  4. If encoding for document once changed by the user, this preference has priority over all the rest of settings. Preferences are machine specific but can be reset, if HippoEDIT temp files would be deleted or format of them would change in new version.

So, how all this works together (or designed to work ) :

For the new File encoding selection (if the setting is not defined, or set to Automatic next taken).

  • Syntax force encoding
  • Current Project settings encoding
  • Document settings encoding
  • Syntax preferred encoding
  • System local encoding

For Open File encoding selection (if setting is not defined, or set to Automatic next taken)

  • Syntax force encoding
  • Encoding selected in File Open dialog
  • Auto-detected encoding with the usage of all algorithms mentioned above. Set of applied algorithms can be changed in settings.xml
  • Syntax preferred encoding
  • Project settings encoding taken
  • Document setting encoding
  • System local encoding

For Save File encoding selection (if setting is not defined, or set to Automatic next taken)

  • Encoding selected in File Save dialog
  • Current document encoding
  • During save, HippoEDIT checks the consistency of current document encoding and encoding found with encoding strings (XML, HTML etc). If encoding does not match, the user would be asked to select which encoding to use
  • Because HippoEDIT internally works with Unicode representation of text (UTF16 LE), on saving, can happen that current text could not be saved without loss of information with currently selected encoding. In this case, HippoEDIT should pop-up a warning, informing the user about possible data loss and suggest to save the document as Unicode or using some another encoding. This behavior controlled by flag Check encoding accuracy in Tools→Options→Formatting

Search and Replace encoding uses same logic as for Open/Save file, just interactive selection of encoding, with Open/Save dialog, not available.

So, if you see that documents are open with wrong encoding, you have several choices of how to solve this:

  • Explicitly select correct encoding in File Open dialog
  • Set, for syntax you are using, forced encoding in SPECIFICATION section of schema spec file:
    <Encoding default="852" force="true"/>
  • Disable extended auto-detection (IE algorithms). It can return the wrong result if data for analysis is not sufficient.

It can be done with xml flags in settings.xml, section General:

<EncodingDetection extended="false"/> 

Also from now on, extended encoding detection is enabled by default only for syntaxes inherited from deftext (as Plain Text, XML, and HTML).

You can control encoding even in more granular way by disabling some encoding detection methods, which in most cases do not provide false positives. As:

  • extended - heuristically based detection of encoding
  • min_confidence - minimal confidence level for extendede decoding, default is 90, maybe higher than 100
  • bom (default true) - use BOM signs for for encoding detection
  • unicode (default true) - use UTF16 (LE/BE) statistic detection logic in addition to BOM detection (if BOM is not defined)
  • enc_strings (default true) - use “magic” encoding strings for detection (check EncodingDetection in syntax definition)
  • utf8 (default true) - use extended algorithms for UTF8 detection in addition to BOM detection (if BOM is not defined)
<EncodingDetection extended="false" bom="true" unicode="false" enc_string="true" utf8="true" min_confidence="90"/> 

So, you would like that documents with specific syntax (html, js, css, clipper) will be always open using predefined code page (encoding).

Here is example from xml_spec.xml

<Encoding default="utf-8" force="false" bom="false">
...
</Encoding>

More info can be found in SPECIFICATION definition.

There are two parts of logic you can influence by changing syntax schema:

Encoding for Syntax

if you set default encoding for syntax (attribute default), you give a hint to HE, that if none of the detection algorithms succeeded, HE should use this default encoding instead of default encoding defined in Options→Document→Defaults (or from Project if the project is active and redefines default encoding). In the example, the default encoding for XML documents (and all inherited syntaxes) set “utf-8”. If you want to change default encoding for some language to 1251 for example, just add this line:

<Encoding default="1251"/> 

inside of SPECIFICATION part of syntax schema. Of course, if it is already not defined. Take into account, that encoding settings are inherited, so, you can place it also to some base schema to be available in all inherited child schemas, as it done for xml_spec.xml

Force default encoding

If in your case, automatic detection of encoding often makes mistakes and determines wrong encoding, you can as globally disable encoding determination as instruct HE to not use automatic detection for specific syntax. To disable HE encoding auto-detection for syntax you need to extend previously described definition of default encoding with force flag:

<Encoding default="1251" force="true"/>

If such settings exist in the schema, HE will never auto detect encoding and always will use specified encoding for documents with this syntax schema. Independently from any default settings for document or project, but with respect to explicit selecting of the encoding with menu or file open dialog.

Doing of changes to syntax schema files, please keep in mind that default syntax schemes can be overwritten on update (but modified syntax schema will be copied to *.old name). The safest way here will be to create your own syntax schema, inheriting from default, and overwriting of the settings you do not like.