You are on page 1of 4

10/22/2014

CS161 Labs - Lab 3: Breaking Ciphers

CS161 Labs

HOME ABOUT CONTACT ARCHIVE

Lab 3: Breaking Ciphers


Posted on October 20, 2014

Introduction
In lecture you have written a couple of basic programs for encrypting text. In this lab we will
break those ciphers and crack the codes.
The message before encoding will be denoted as the plaintext and the encrypted message
will be denoted as the ciphertext. Encryption is simple a function from plaintext messages
to ciphertext messages. In this lab you will invert an unknown encryption function to get the
plaintext given only the ciphertext.

Dictionary-Based Attacks
The first class of attacks we will consider will be dictionary-based.

A Key Assumption
The basic hypothesis underlying such an attack is that the plaintext will mostly only have
words from a known dictionary (in our case this is the file /usr/share/dict/words)
on the Linux machines. Given that assumption it is your task to construct attacks which
successfully decrypt the messages written up in this lab.
We exclude special characters other than spaces and spaces will not be encrypted. The
special characters in some cases will remain in the ciphertext but our encryption algorithm
will ignore them. Not all of the words will be in the dictionary, but most will.
For each ciphertext we give you to decrypt we will let you know the set of ciphers that were
potentially used to encrypt it.

Breaking ROT-n
In class the rot-n code was introduced. In this part of the lab your goal is to break this and
decode the file c1.txt. Given a ciphertext infer what the n is.
A function that will be useful for cracking the code is:
hamming :: [String] -> [String] -> Int
which is a function to measure the distance between two strings based on how many
characters they share. You will also want to make use of the word list mentioned earlier.
Implement this section as a program
http://people.cs.uchicago.edu/~stoehr/cs161/posts/2014-10-20-lab-3.html

1/4

10/22/2014

CS161 Labs - Lab 3: Breaking Ciphers

./derot [file containing ciphertext]


where derot prints to standard output
n=[rotation factor]
[plaintext]
I will provide a ciphertext to decode, p1.txt among the lab files. It is highly recommended
that you create your own plaintexts, encode them using the rot program from class and then
see if you can decode them.

Breaking a Vigenre Cipher of known password length


This is essentially a generalization of the Caesar rotation cipher. This cipher is contained in
c2.txt and the password used is of length 16. This means that you have 16 rotation ciphers
running simultaneously so if s is your ciphertext then s !! 0, s !! 16, s !! 32, etc.
all have the same rotation cipher. The same is true for s !! 1, s !! 17, etc.
Write a bash tool
./vindecoden [ciphertext file] [N]
which takes a ciphertext file and a length then outputs the plaintext and the password.
password=[estimated password]
[plaintext]
For this problem all of the letters are upper case, the spaces are not encoded and ignored by
the encryption, also, there are no special characters: just [A-Z]. Again, write your own
Vigenre encryption and produce a test ciphertext to make sure that your code works before
embarking on the more challenging lab assignment of p2.txt.
You may also assume that the password comes from words in dictionary (recalling that
everything is upper case). Use that fact in your code.

Using Trees for N-gram counts


In order to crack this code it is very likely you will make use of n-gram counts. An letter ngram count is the number of times a particular sequence of characters occurs in a text.
Consider the letter bigram VVV which does not occur in any English word but ORM does.
One may which to check how many of the letter n-grams in a candidate decoding of the
ciphertext actually occur in the dictionary: i.e. do the ciphertext letter n-gram statistics match
the dictionary letter n-gram statistics? This is generally a very hard question to work on but
its made easier by the fact that the letter n-grams which occur in English are sparse and
highly concentrated. That is, most letter n-grams never occur in English so this can be a big
tip off that a particular decoding sequence is not possible.
To use this idea for decoding we need to build a data structure to efficiently hold and retrieve
letter n-gram counts. A list could work for this purpose but it would be very slow to use.
Since we are always counting letters A through Z we can encode them (using chr and
some simple arithmetic) as the numbers 0 through 26 which means that they can be
http://people.cs.uchicago.edu/~stoehr/cs161/posts/2014-10-20-lab-3.html

2/4

10/22/2014

CS161 Labs - Lab 3: Breaking Ciphers

expressed with a 5 bit binary number. Our efficient data-structure will map 5 bit binary
numbers to counts.
We create a tree data-structure that is constrained to have a particular height and with leaf
nodes that record counts.
data Tree = Leaf Int | Node Int Tree Tree deriving (Show, Eq)
Some example trees are here:
l0
l3
l1
l2
t0
t1
t2

=
=
=
=
=
=
=

Leaf
Leaf
Leaf
Leaf
Node
Node
Node

10
3
4
0
1 l2 l0
1 l3 l1
2 t1 t0

These have been named suggestively to indicate how the tree works. We need to have a
constructor
initTree n
which outputs a tree for a binary representation with n bits. Include a type signature and a
definition for that function. We will also want a getter-function
treeCount t w
which given a tree t and a number w returns the trees count for w. We also want a setterfunction which updates the tree
treeInsert t w
which will update the count that tree t has for number w. For both of the functions above
write a type signature and recursive definition. A hint for writing them is that the base case
(where the tree is just a Leaf Int) is obvious and you should just return the count. Next try
to handle the case for a tree formed with one root node and two leaves and think about how
the algorithm should recurse Then handle the case where you have two levels and hence four
leaves, etc. The definition for each function shouldnt be longer than five lines. mod and div
are your friends here: review them if you dont know what they do.
You will want to write an interface for this tree that handles the abstraction from character to
integer. It is up to you to decide how to handle that abstraction. asciiTreeInsert, for
instance, is one way to go. You will also want to generalize to the case where you have
multiple characters since we care about n-gram statistcs. Its difficult to make the tree
structure handle any particular length of n-gram, so just define a tree for the shorter n-grams
and use those to generate counts. You will want to think about the underlying binary
representation when doing this.

Estimating the length of the Vigenre password (Extra Credit)


The most general problem is where you do not know the length of the password. You will
http://people.cs.uchicago.edu/~stoehr/cs161/posts/2014-10-20-lab-3.html

3/4

10/22/2014

CS161 Labs - Lab 3: Breaking Ciphers

write a third program


./vindecode [ciphertext file]
which outputs the password and the plaintext. To get this program to work you will want to
use some trick to reduce the search space of possible password lengths or you can just try to
brute-force it. To cut down the size of the search space you may want to think about whats
going to happen if the same word appears in the plaintext multiple times in the same location
modulo the password length (i.e. same place relative to the password). The double hint is that
you may want to look at the gcd of the text differences between multiple occurrences.There
is noise in calculating the gcd so youll want to focus on the gcd of subsets of the repetition
periods (and how long the repetition is). The file to decode is here: c3.txt.

What to turn in
You should have written two programs: derot.hs and vindecoden.hs which perform
the first two tasks. For the extra-credit task you will turn in vindecode.hs. Grading will
be based on whether your files can decode the ciphertexts and whether the code clearly
demonstrates how you did it. Save your programs into a folder lab2 within your subversion
repository.
Your code should also include some code for your n-gram counting functions. Make sure that
you have the functions treeInsert and treeCount implemented with the appropriate
type signatures.
It is a good idea to try to decode the passages first without restricting yourself to automatic
algorithms. The code you hand in does may take a while to decode the passages. If your code
is inefficient at performing the decrypting task then you should submit an example
demonstrating that your code does work for the simpler example. Make a note in you
README file along with your submission to discuss practicality.

Site proudly generated by Hakyll

http://people.cs.uchicago.edu/~stoehr/cs161/posts/2014-10-20-lab-3.html

4/4