next up previous
Next: 4.3.1 Assumptions and Conventions Up: 4. Using Frequencies for Previous: 4.2 Decryption and Frequencies:

   
4.3 Recognizing Unscrambled Frequencies

To use the idea ``unscramble frequencies to crack a cryptosystem'', we need to study unscrambled frequencies (frequencies of plaintext) to look for recognizable structures/patterns that identify frequencies as being unscrambled.

So, let us look at plaintexts from a wide range of areas: If we look at a narrow range, then we might draw conclusions that apply to only a narrow range. So, let us consider the following three plaintexts:

Each of these documents contains a corpus or large plaintext. To compute frequency tables, we must deal with both upper- and lower-case letters, punctuation, and possibly accents. We defer the details of what we do to Section 4.3.1, and move immediately to Figure 20 for the unigram frequencies of these corpuses.
  
Figure 20: Unigram Frequencies for Three English Corpuses
\begin{figure}
\begin{center}
\texttt{\begin{tabular}{l@{ }r\vert*{27}{r@{ }}\ve...
...5& 6& 7& 3& 1& 2& 0& 1& 0\\ \cline{3-29}
\end{tabular}}
\end{center}\end{figure}

Amazing! These wildly different plaintexts all have very similar frequencies! (Although not shown for space reasons, the bigram frequencies are also very similar.) This might lead us to postulate that English has frequencies that are intrinsic or inherent to its very nature, and indeed this is the case: English has intrinsic frequencies that large plaintext is very likely to ``closely'' approximate. Therefore, a good approximation to intrinsic frequencies can be obtained from just about any corpus.

If you try other natural languages, you will discover that they, too, have intrinsic frequencies. This means that as long as we continue to use nothing specific to English we will develop an algorithm that works for natural languages besides English.

Recall that we studied plaintext frequencies to find a way to recognize ``unscrambled frequencies''. Since we have seen that frequencies of large plaintext are ``close'' to intrinsic frequencies, we can use that as our criterion: If frequencies are ``close'' to intrinsic frequencies, then we will consider/assume them to be ``unscrambled''. Thus, unscramble frequencies means bring ``close'' to intrinsic frequencies. This has three immediate consequences:

Roadmap
Section 3.7 Encipher plaintext $\Rightarrow$ scramble frequencies.
Section 3.7 Decipher ciphertext $\Rightarrow$ unscramble frequencies.
Section 4.2 (Hope) Unscramble frequencies $\Rightarrow$ decipher ciphertext
Section 4.3 Unscramble = Bring ``close'' to intrinsic frequencies
  Approximate intrinsic frequencies with training text
  Assume ciphertext is medium to large so that unscrambled frequencies resemble intrinsic frequencies
$(\rightarrow)$     Section 4.3.1 Q: How do we compute frequencies for real texts, e.g. deal with punctuation?
Section 4.4 Q: How do we measure ``closeness'' or distance?
Section 5 Q: What are legal and effective ways to rearrange frequencies?



 
next up previous
Next: 4.3.1 Assumptions and Conventions Up: 4. Using Frequencies for Previous: 4.2 Decryption and Frequencies:
Thomas Yan
2000-05-01