#============================================================================ # Enca v1.13 (2010-02-09) guess and convert encoding of text files # Copyright (C) 2000-2003 David Necas (Yeti) # Copyright (C) 2009-2010 Michal Cihar #============================================================================ Frequently Asked Questions about Enca * Q: What obscure encoding the FAQ, THANKS, and other files use? * A: Run Enca on them and you'll see. * Q: How do I specify input encoding? * A: You can't. If you know the encoding, you don't need Enca. Use some fully-fledged converter. * Q: Why Enca can't detect both language and encoding? * A: Because this is impossible. Well, it's possible for natural, long enough texts. But Enca can detect encoding of nonsense and people are used to it. No program can tell you “ěčřýáíéú” is both Czech and e.g. ISO-8859-2. Incidentally, interpreting the same bytes as Win-1251 one gets “мишэбнйъ”, which is equally good Russian nonsense. * Q: Why Enca can't recognise encoding of all-uppercase texts? * A: Mostly again because there's a trade-off between nonsense detection and low-probability natural text detection. But it's possible to detect both, I'm working on that. * Q: Why Enca needs LC_CTYPE to be set? * A: No, it doesn't need it. But you need to use “-L language-code” then. It's possible to put it into ENCAOPT environment variable, if you want to make your life easier. (Never versions of Enca try to guess your language from other locale settings too.) * Q: Why “enca -x ascii” doesn't work? * A: Unfortunately there are several different things people call “conversion to ASCII”. Consider following characters: ě (Latin small letter e with caron) and ≫ (Much greater-than). By conversion to ASCII you may mean: 1a. Omitting these characters in output, because they are not representable in ASCII. 1b. Keeping these characters intact in output, because they are not representable in ASCII. 1c. Failing (possibly damaging the file), because they are not representable in ASCII 2. Approximating them with single, close ASCII characters; in this case probably plain “e” and “>”. 3. Expanding them to sequences of ASCII characters which may be less readable then the approximations above, but generally allow reconstruction of the original characters. Such sequences could be e.g. RFC-1345 mnemonics “e<” and “>>”. What happens when you run “enca -x ascii” on something, depends on the converter used. The usual scenario is: Enca uses librecode or libiconv, which do (1), and you get upset. The only tool doing (2) known to me is cstocs, and does it only for Latin2 characters (install cstocs and specify cstocs as the converter to be used, if you want (2)). AFAIK, there's no tool doing reasonably (3), though recode can expand to mnemonics (use rfc1345 as target charset instead of ascii). * Q: Why “enca -E cstocs” doesn't work in my RedHat/Fedora? * A: This is a cstocs problem. Cstocs is broken for Perl ≥ 5.8 and UTF-8 locales. Perl Unicode handling changes with every version so an advanced charset converter working in mutiple Perl versions is something between impossible and a big mess. Either set your locales to non-UTF-8 (e.g. ISO-8859-2) or don't use cstocs until this is resolved somehow.