4.3 Recognizing Unscrambled Frequencies

Next: 4.3.1 Assumptions and Conventions Up: 4. Using Frequencies for Previous: 4.2 Decryption and Frequencies:

4.3 Recognizing Unscrambled Frequencies

To use the idea ``unscramble frequencies to crack a cryptosystem'', we need to study unscrambled frequencies (frequencies of plaintext) to look for recognizable structures/patterns that identify frequencies as being unscrambled.

So, let us look at plaintexts from a wide range of areas: If we look at a narrow range, then we might draw conclusions that apply to only a narrow range. So, let us consider the following three plaintexts:

Genesis from the 1970 edition of The New English Bible, available at
http://etext.library.cornell.edu/cgi-bin/bie-idx?type=HTML&byte=116542360&rgn=book
An edition from 1597 of William Shakespeare's Romeo and Juliet, available at
http://etext.library.cornell.edu/cgi-bin/shakeEd-idx?type=HTML&byte=186196093&rgn=play
A slightly old version of the Announcements page for CS100, available at
http://courses.cs.cornell.edu/cs100/2000sp/oldannounce.txt

Each of these documents contains a corpus or large plaintext. To compute frequency tables, we must deal with both upper- and lower-case letters, punctuation, and possibly accents. We defer the details of what we do to Section 4.3.1, and move immediately to Figure 20 for the unigram frequencies of these corpuses.

**Figure 20:** Unigram Frequencies for Three English Corpuses
$\begin{figure} \begin{center} \texttt{\begin{tabular}{l@{ }r\vert*{27}{r@{ }}\ve... ...5& 6& 7& 3& 1& 2& 0& 1& 0\\ \cline{3-29} \end{tabular}} \end{center}\end{figure}$

Amazing! These wildly different plaintexts all have very similar frequencies! (Although not shown for space reasons, the bigram frequencies are also very similar.) This might lead us to postulate that English has frequencies that are intrinsic or inherent to its very nature, and indeed this is the case: English has intrinsic frequencies that large plaintext is very likely to ``closely'' approximate. Therefore, a good approximation to intrinsic frequencies can be obtained from just about any corpus.

If you try other natural languages, you will discover that they, too, have intrinsic frequencies. This means that as long as we continue to use nothing specific to English we will develop an algorithm that works for natural languages besides English.

Recall that we studied plaintext frequencies to find a way to recognize ``unscrambled frequencies''. Since we have seen that frequencies of large plaintext are ``close'' to intrinsic frequencies, we can use that as our criterion: If frequencies are ``close'' to intrinsic frequencies, then we will consider/assume them to be ``unscrambled''. Thus, unscramble frequencies means bring ``close'' to intrinsic frequencies. This has three immediate consequences:

We need to figure out how to define what ``close'' means precisely enough for use in a computer program.
We need a way to compute intrinsic frequencies. But as we've already observed, we can simply use the frequencies from just about any large plaintext, since it is very likely to have frequencies that closely approximate intrinsic frequencies. A large plaintext used in this fashion is called training text because it ``teaches'' our program what the right answer is.
We need medium to large size ciphertext so that its unscrambled frequencies are close to intrinsic frequencies: Small plaintext, like words or brief sentences, tends to not contain enough letter combinations to be representative of intrinsic frequencies.⁴

Roadmap
Section 3.7	Encipher plaintext $\Rightarrow$ *scramble* frequencies.
Section 3.7	Decipher ciphertext $\Rightarrow$ *unscramble* frequencies.
Section 4.2	(Hope) Unscramble frequencies $\Rightarrow$ decipher ciphertext
Section 4.3	Unscramble = Bring ``close'' to *intrinsic* frequencies
	Approximate *intrinsic* frequencies with *training text*
	Assume ciphertext is medium to large so that unscrambled frequencies resemble intrinsic frequencies
$(\rightarrow)$ Section 4.3.1	Q: How do we compute frequencies for real texts, e.g. deal with punctuation?
Section 4.4	Q: How do we measure ``closeness'' or *distance*?
Section 5	Q: What are legal and effective ways to rearrange frequencies?

4.3.1 Assumptions and Conventions

Next: 4.3.1 Assumptions and Conventions Up: 4. Using Frequencies for Previous: 4.2 Decryption and Frequencies:

Thomas Yan
2000-05-01