You are on page 1of 2

APPROACH ONE

1. If the cipher is long enough, you can check for more statistically common
letters to help out in guessing parts of the cipher, for example we know that
the letter “e” is the most common letter in the English alphabet. The way
you would do that is brute forces all the different lengths of the cipher (e.g.
if the cipher is LEMON then you would look at every 5th letter of the
cipher text). Then, for each length (say 1–20) you would run the statistical
analysis and see where “e” is, and subtracts that to know the letter in the
cipher text.
2. Write a brute force algorithm that tries different ciphers from a dictionary
(because it is likely going to be an English word), and simply checks which
output resembles English the most using statistical analysis (Hidden
Markov model).
To know how “English-like” a string is, you would need to take a lot of English text
and build a simple statistical model (HMM) that simply stores the probability of
having the letter X given (x-1) and (x-2) is known. For example, if “THE” is much
more common than “THJ” in English (which obviously is the case, then we can say
that the probability of seeing E in the text given that the last two letters were TH is,
say, 0.9. (P(Xn=E | Xn-2=T, Xn-1=H) = 0.9)

To build this structure, take all triplets of letters (without spaces because that’s how
data is encrypted) and just do a frequency count on English. Then for your potentially
decrypted text, check the frequencies and average them out, then compare the
potentially decrypted texts to choose the most English-like one.

APPROACH 2
Without the keyword the primary method of breaking the Vigenère cipher is known as
the Kasiski test, after the Prussian major who first published it. The first stage is
determining the length of the keyword.

Determining the key length


Given an enciphered message such as:

Plaintext: TOBEORNOTTOBE
Keyword: KEYKEYKEYKEYK
Ciphertext: DSZOSPXSRDSZO

Upon inspection of the cipher text, we see that there are a few digraphs repeated,
namely DS, SZ, and ZO. It is statistically unlikely that all of these would arise by
random chance; the odds are that repeated digraphs in the cipher text correspond to
repetitions in the plaintext. If that is the case, the same section of the key must encode
the digraphs both times. Therefore, the length of the key is a factor of the distance in
the text between the repetitions.

Digraph First Position Second Position Distance Factors


DS 1 10 9 3
SZ 1 10 9 3
ZO 1 10 9 3
The common factors (indeed, the only factors in this simple example) are 3 and 9.
This narrows down the possibilities significantly, and the effect is even more
pronounced with longer texts and keys.

Frequency analysis
Once the length of the key is known, a slightly modified frequency analysis technique
can be applied. Suppose the length of the key is known to be three.
Then every third letter will be encrypted with the same letter of the key.
The cipher text can be split into three segments, one for each key letter and the
procedure described for the Caesar cipher can be used.

You might also like