You are on page 1of 3

This version of John is UTF-8 and codepage aware.

This means that unlike


"core John", this version can recognize national vowels, lower or upper
case characters, etc. in most common encodings.

The traditional behavior was that John would assume ISO-8859-1 when converting
plaintexts or salts to UTF-16 (this happens to be very fast), and assume ASCII
in most other cases. The rules engine would accept 8-bit candidates as-is, but
it would not upper/lower-case them or recognise letters etc. And some truncation
or insert operations could split a multi-byte UTF-8 character in the middle,
resulting in meaningless garbage. Nearly all other password crackers have these
limitations.

The new defaults (which can be changed in john.conf) are:


* Input (eg. wordlists, usernames etc) are assumed to be UTF-8.
* Output to screen, log and .pot file is UTF-8.
* Target encoding for LM is CP850 (and input will be converted accordingly).
* Internal encoding (eg. for rules processing) is disabled. For Western
Europe, ISO-8859-1 is recommended but since this has a slight performance
impact you'll need to use the command-line option or un-comment the line in
john.conf.

For temporarily running "the old way", just give --enc=raw. You will still
get output to screen and log in UTF-8, possibly incorrectly encoded. But the
.pot file entries will not be converted in any way (except for Unicode formats
which will always end up correctly, in UTF-8).

For proper function, it's imperative that you let John know about what
encodings are involved if they differ from defaults. For example, if your
wordlist is encoded in ISO-8859-1, you need to use the "--encoding=iso-8859-1"
option (unless you have set that as default in john.conf). But you also need to
know what encoding the hashes were made from - for example, LM hashes are
always made from a legacy MS-DOS codepage like CP437 or CP850. This can be
specified by using the option e.g. "--target-encoding=CP437". John will convert
to/from Unicode as needed.

Finally, there's the special case where both input (wordlist) and output (e.g.
hashes from a website) are UTF-8 but you want to use rules including e.g. upper
or lower-casing non-ASCII characters. In this case you can use an intermediate
encoding (--internal-codepage option) and pick some codepage that can
shelter as much of the needed input as possible. For US/Western Europe, any
"Latin-1" codepage will do fine, e.g. CP850, ISO-8859-1 or CP1252 (if you do
this with a Unicode format like NT, it will silently be treated in another
way internally for performance reasons but the outcome will be the same).

Example: A rule that replaces all instances of "c" within a word with "ç" could
be written as "-U scç" where "-U" means the rule is rejected unless we're
running with some legacy codepage configured to handle it.

Mask mode also honors --internal-codepage (or plain --encoding). For


example, the mask ?L is a placeholder for all lowercase Greek letters if you
use CP737. If you instead use CP850, it'll be western-european ones.

The limitation is if you use --target-encoding or --internal-codepage,


the input encoding must be UTF-8. The recommended, and easiest, use is to
keep all wordlists encoded as UTF-8. This will work for most cases without
too much impact on cracking speed and you will almost never have to give
any command-line options.

Some new reject rules and character classes are implemented, see doc/RULES.
If you use rules without --internal-codepage, some wordlist rules may cut
UTF-8 multibyte sequences in the middle, resulting in garbage. You can reject
such rules with -U to have them in use only when an internal codepage is used.

Please note that for convenience any rule, mask or placeholder given in UTF-8
on command-line or in a config file can be silently converted to whatever
internal codepage is in use at the time:

For this to happen with RULES, you must prepend them with "-U". If such
conversion fails for a rule, that rule will be rejected (silently, but logged).
Thus you can have rules for various codepages such as "-U sRЯ" and "-U sc©" in
place - and only one of them, both, or none will be used, depending on what
internal codepage is active.

For MASKS, such conversion is required: If no internal codepage was selected,


or if the selected one can't hold all of the characters, you'll get an error
describing the problem.

If you want to create an incremental charset for a certain codepage, such as


for use with the LM format, you need to use the --target-encoding option with
--make-charset. The fact that we default to storing all passwords to the pot
file in UTF-8 means this will work fine regardless of what formats it contains.
So for example, "--make-charset=custom.chr --target-encoding=cp850" will use
every password in the pot file that can be encoded in CP850, and build the
charset from that.

Caveats:
Beware of UTF-8 BOM's. They will cripple the first word in your wordlist or
the first entry in your hashfile. Try to avoid using Windows tools that add
them. This version does try to discard them though.

Unicode beyond U+FFFF (4-byte UTF-8) is not supported by default for the NT
formats because it hits performance and because the chance of it being used
in the wild is pretty slim. Supply --enable-nt-full-unicode to configure when
building if you need that support.

Example using the now default john.conf settings:

$ ../run/john hashfile --format=lm --single


Using default input encoding: UTF-8
Target encoding: CP850
Loaded 3 password hashes with no different salts (LM [DES 128/128 AVX-16])
Press 'q' or Ctrl-C to abort, almost any other key for status
GEN (Kübelwagen:2)
KÜBELWA (Kübelwagen:1)
MÜLLER (Müller)
3g 0:00:00:00 DONE (2014-03-28 20:48) 300.0g/s 12800p/s 12800c/s 38400C/s
Warning: passwords printed above might be partial
Use the "--show" option to display all of the cracked passwords reliably
Session completed

$ ../run/john hashfile --format=nt --loopback --internal-codepage=cp850


Rules engine using CP850 for Unicode
Loaded 2 password hashes with no different salts (NT [MD4 128/128 SSE2-16])
Assembling cracked LM halves for loopback
Loop-back mode: Reading candidates from pot file $JOHN/john.pot
Press 'q' or Ctrl-C to abort, almost any other key for status
Kübelwagen (Kübelwagen)
müller (Müller)
2g 0:00:00:00 DONE (2014-03-28 20:48) 200.0g/s 3200p/s 3200c/s 6400C/s
Use the "--show" option to display all of the cracked passwords reliably
Session completed

Currently supported encodings:


ASCII (or RAW), UTF-8, ISO-8859-1 (or Latin1 or ANSI),
ISO-8859-2, ISO-8859-7, ISO-8859-15, KOI8-R,
CP437, CP720, CP737, CP850, CP852, CP858, CP866, CP868,
CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256

New encodings can be added with ease, using automated tools that rely on the
Unicode Database (see Openwall wiki, or just post a request on john-users
mailing list).

You might also like