practicalcryptography.com Open in urlscan Pro
72.14.185.92  Public Scan

URL: http://practicalcryptography.com/cryptanalysis/stochastic-searching/cryptanalysis-simple-substitution-cipher/
Submission: On December 04 via api from CZ — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

CRYPTO

 * Home
 * Ciphers
 * Cryptanalysis
 * Hashes
 * Miscellaneous
 * Resources

 1. Home
 2. /
 3. Cryptanalysis
 4. /
 5. Specific Examples/
 6. Cryptanalysis of the Simple Substitution Cipher


CRYPTANALYSIS OF THE SIMPLE SUBSTITUTION CIPHER

For a recap of how substitution ciphers work, see here.

The Simple substitution cipher is one of the simplest ciphers, simple enough
that it can usually be broken with pen and paper in a few minutes. On this page
we will focus on automatic cryptanalysis of substitution ciphers, i.e. writing
programs to solve these ciphers for us.

The substitution cipher is more complicated than the Caesar and Affine ciphers.
In those cases, the number of keys were 25 and 311 respectively. This allowed a
brute force solution of trying all possible keys. The number of keys possible
with the substitution cipher is much higher, around 2^88 possible keys. This
means we cannot test them all, we have to 'search' for good keys.

We will be using a 'hill-climbing' algorithm to find the correct key. For this
approach, we need a way of determining how similar a piece of text is to english
text. This is called rating the 'fitness' of the text. A piece of text very
similar to english will get a high score (a high fitness), while a jumble of
random characters will get a low score (a low fitness). For this we will use a
fitness measure based on quadgram statistics. For a guide on how to generate
quadgram statistics, and some python code for rating the fitness of text, see
this tutorial. This method works by first determining the statistics of english
text, then calculating the probability that the ciphertext comes from the same
distribution. An incorrectly deciphered (i.e. using the wrong key) message will
probably contain sequences e.g. 'QKPC' which are very rare in normal english. In
this way we can rank different decryption keys, the decryption key we want is
the one that produces deciphered text with the highest likelyhood.

The hill-climbing algorithm looks like this:

 1. Generate a random key, called the 'parent', decipher the ciphertext using
    this key. Rate the fitness of the deciphered text, store the result.
 2. Change the key slightly (swap two characters in the key at random), measure
    the fitness of the deciphered text using the new key.
 3. If the fitness is higher with the modified key, discard our old parent and
    store the modified key as the new parent.
 4. Go back to 2, unless no improvement in fitness occurred in the last 1000
    iterations.

As this cycle proceeds, the deciphered text gets fitter and fitter, the key
becomes better until either the solution appears, or, the solution is not found.
In this case the run has failed and must be repeated with a different starting
key. This means the hill-climbing algorithm is stuck in a 'local maximum', where
there are no simple changes that can be made to the key to improve fitness, and
yet it is not at the true solution. If this happens you can run the algorithm
again with a different parent in the hope it may reach the true solution this
time. In the implementation below, we may restart the algorithm 100's of times
in the search for the best key.

The algorithm depends on the fitness function correctly distinguishing whether
the plaintext from one key is better than plaintext from another. This is done,
as explained earlier, by comparing quadgram statistics from the plaintext to
quadgram statistics of english text. If the statisitics match closely, we say
that the fitness is high. However, this system fails when the true plaintext
does not have statistics similar to english, this example comes from Simon
Singhs "The Code Book":

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

This is full of unusual quadgrams, so we expect it to have a fairly low score.
The hill-climbing algorithm will most likely find a key that gives a piece of
garbled plaintext which scores much higher than the true plaintext. This is a
limitation of any algorithm based on statistical properties of text, including
single letter frequencies, bigrams, trigrams etc. One possible way to overcome
this problem, at the expense of algorithm speed, is to try to find words in the
plaintext and base the fitness on the presence of these.

A final mention should be made about breaking short ciphers. In general, you
will have trouble breaking ciphers less than 100 characters in length. This is
because the statistics of short messages can deviate significantly from the long
term statistics of english. In addition, there is a theoretical limit, given by
the unicity distance which says that 28 characters are required to get a unique
decryption, any cipher shorter is not breakable unless more information is
available such as a crib.


AN EXAMPLE §

We have the following ciphertext:

SOWFBRKAWFCZFSBSCSBQITBKOWLBFXTBKOWLSOXSOXFZWWIBICFWUQLRXINOCIJLWJFQUNWXLFBSZXFBT
XAANTQIFBFSFQUFCZFSBSCSBIMWHWLNKAXBISWGSTOXLXTSWLUQLXJBUUWLWISTBKOWLSWGSTOXLXTSWL
BSJBUUWLFULQRTXWFXLTBKOWLBISOXSSOWTBKOWLXAKOXZWSBFIQSFBRKANSOWXAKOXZWSFOBUSWJBSBF
TQRKAWSWANECRZAWJ

To begin the algorithm, we generate a random key, e.g.

plain alphabet:  ABCDEFGHIJKLMNOPQRSTUVWXYZ
cipher alphabet: YBXONGSWKCPZFMTDHRQUJVELIA

and decipher the ciphertext using this key to get:

GDHMBRIZHMJLMGBGJGBSYOBIDHXBMCOBIDHXGDCGDCMLHHYBYJMHTSXRCYEDJYUXHUMSTEHCXMBGLCMBO
CZZEOSYMBMGMSTMJLMGBGJGBYNHQHXEIZCBYGHFGODCXCOGHXTSXCUBTTHXHYGOBIDHXGHFGODCXCOGHX
BGUBTTHXMTXSROCHMCXOBIDHXBYGDCGGDHOBIDHXCZIDCLHGBMYSGMBRIZEGDHCZIDCLHGMDBTGHUBGBM
OSRIZHGHZEWJRLZHU

The fitness of our first plaintext attempt is -2304.04, something we should be
able to improve on. By swapping 'Y' and 'B' in the key, and deciphering again we
get a fitness of -2200.78, an improvement! But the text is still a long way off
being readable. We must continue with this procedure until there are no two
letters we can swap in the key that will result in an improvement of the
fitness.

After many iterations of this approach, the final key that was found was

XZTJWUMOBEPARIQKDLFSCHYGNV

resulting in the quite readable plaintext:

THESIMPLESUBSTITUTIONCIPHERISACIPHERTHATHASBEENINUSEFORMANYHUNDREDSOFYEARSITBASIC
ALLYCONSISTSOFSUBSTITUTINGEVERYPLAINTEXTCHARACTERFORADIFFERENTCIPHERTEXTCHARACTER
ITDIFFERSFROMCAESARCIPHERINTHATTHECIPHERALPHABETISNOTSIMPLYTHEALPHABETSHIFTEDITIS
COMPLETELYJUMBLED


PYTHON CODE §

Provided here is python code for breaking the Substitution cipher. The code here
uses pycipher for the cipher itself. It implements the steps described above,
using the ngram_score.py file available on the quadgram statistics page.

 * break_simplesub.py



Please enable JavaScript to view the comments powered by Disqus.



CONTENTS

 * An Example
 * Python Code




Y NGP'I ZPGO AVCE GE LGM AVCE VJ OSCC VJ Y JAGMCN CYZS; VPN Y CYZS CSJJ IAVP
AVCE GE LGM AVCE VJ OSCC VJ LGM NSJSUDS - Q.U.U. IGCZYSP. (IAS ESCCGOJAYK GE IAS
UYPH)


COPYRIGHT & USAGE

Copyright James Lyons © 2009-2012
No reproduction without permission.


QUESTIONS/FEEDBACK

Notice a problem? We'd like to fix it!
Leave a comment on the page and we'll take a look.

Sitemap