Professional Documents
Culture Documents
ENCODINGS
ENCODINGS
The traditional behavior was that John would assume ISO-8859-1 when converting
plaintexts or salts to UTF-16 (this happens to be very fast), and assume ASCII
in most other cases. The rules engine would accept 8-bit candidates as-is, but
it would not upper/lower-case them or recognise letters etc. And some truncation
or insert operations could split a multi-byte UTF-8 character in the middle,
resulting in meaningless garbage. Nearly all other password crackers have these
limitations.
For temporarily running "the old way", just give --enc=raw. You will still
get output to screen and log in UTF-8, possibly incorrectly encoded. But the
.pot file entries will not be converted in any way (except for Unicode formats
which will always end up correctly, in UTF-8).
For proper function, it's imperative that you let John know about what
encodings are involved if they differ from defaults. For example, if your
wordlist is encoded in ISO-8859-1, you need to use the "--encoding=iso-8859-1"
option (unless you have set that as default in john.conf). But you also need to
know what encoding the hashes were made from - for example, LM hashes are
always made from a legacy MS-DOS codepage like CP437 or CP850. This can be
specified by using the option e.g. "--target-encoding=CP437". John will convert
to/from Unicode as needed.
Finally, there's the special case where both input (wordlist) and output (e.g.
hashes from a website) are UTF-8 but you want to use rules including e.g. upper
or lower-casing non-ASCII characters. In this case you can use an intermediate
encoding (--internal-codepage option) and pick some codepage that can
shelter as much of the needed input as possible. For US/Western Europe, any
"Latin-1" codepage will do fine, e.g. CP850, ISO-8859-1 or CP1252 (if you do
this with a Unicode format like NT, it will silently be treated in another
way internally for performance reasons but the outcome will be the same).
Example: A rule that replaces all instances of "c" within a word with "ç" could
be written as "-U scç" where "-U" means the rule is rejected unless we're
running with some legacy codepage configured to handle it.
Some new reject rules and character classes are implemented, see doc/RULES.
If you use rules without --internal-codepage, some wordlist rules may cut
UTF-8 multibyte sequences in the middle, resulting in garbage. You can reject
such rules with -U to have them in use only when an internal codepage is used.
Please note that for convenience any rule, mask or placeholder given in UTF-8
on command-line or in a config file can be silently converted to whatever
internal codepage is in use at the time:
For this to happen with RULES, you must prepend them with "-U". If such
conversion fails for a rule, that rule will be rejected (silently, but logged).
Thus you can have rules for various codepages such as "-U sRЯ" and "-U sc©" in
place - and only one of them, both, or none will be used, depending on what
internal codepage is active.
Caveats:
Beware of UTF-8 BOM's. They will cripple the first word in your wordlist or
the first entry in your hashfile. Try to avoid using Windows tools that add
them. This version does try to discard them though.
Unicode beyond U+FFFF (4-byte UTF-8) is not supported by default for the NT
formats because it hits performance and because the chance of it being used
in the wild is pretty slim. Supply --enable-nt-full-unicode to configure when
building if you need that support.
New encodings can be added with ease, using automated tools that rely on the
Unicode Database (see Openwall wiki, or just post a request on john-users
mailing list).