5.1 Sort Unigram Frequencies

Next: 5.1.1 Sorting implies Swapping Up: 5. Attempts at Decryption Previous: 5. Attempts at Decryption

5.1 Sort Unigram Frequencies

Here is an ``obvious'' approach that you, like we did, might think of. Although the discussion in Section 4.2 already warns us it might not work, it is still worth trying. Form the encryption key as follows:

1.: Sort unigrams on the top line by their frequency in training text.
2.: Sort unigrams on the bottom line by their frequency in the ciphertext.

Let's try this approach on Enciphered Announcements and also apply the sanity check mentioned earlier of trying to ``decrypt'' plaintext. Figure 24 shows the alphabet for each corpus sorted in decreasing unigram frequency: Pairing the two lines of letters from any two corpuses yields a key.

**Figure 24:** Sorting Unigram Frequencies
$\begin{figure} \begin{center} \texttt{\begin{tabular}{l\vert r*{26}{@{ }r}\vert}... ...& 1& 1& 1& 1& 0& 0& 0& 0 \\ \cline{2-28} \end{tabular}} \end{center}\end{figure}$

Figure 25 shows the resulting distances between pairs of corpuses after their unigram frequencies have been sorted.

**Figure 25:** Distances after Sorting Unigram Frequencies
$\begin{figure} \begin{center} \fbox{\begin{picture} (225,180)(-95,-145) \put(-50... ...nigram =& 0 \\ bigram =& 0\end{tabular}}} \end{picture}}\end{center}\end{figure}$

Although it might look like we are partly successful, in fact these figures show us a disaster!

At first, it looks like we are partly successful if we look at only the sorted unigram frequencies for Announcements and Enciphered Announcements. The unigram and bigram distances are 0, which is as close as you can get. Furthermore, you can read off the encryption key (top line "etosian...", bottom line "hwrvld...") from the labels! That is, we have discovered the following important potential fact:

If the frequency tables for plaintext and ciphertext match, then the labels from the frequency tables form the top and bottom lines of the encryption key.

However, the fact that pairing ciphertext and its plaintext works tells us very little: In general, we do not have the plaintext decryption of the ciphertext. Therefore, to evaluate our attempted approach, we must look at other pairings of our texts. These other pairings constitute the sanity check from Section 4.5 of trying to ``decrypt'' plaintext.

Observe that for all other pairs of plaintexts, letters are not properly matched up, e.g. 't' in Romeo and Juliet is matched in Genesis to 'a' instead of 't'. Also observe that although the distance between unigram frequencies (percentages) went down as a result of sorting --indeed, sorting makes unigram tables as close as possible-- we got a bogus encryption/decryption key. That is, since we ``decrypted'' plaintext, the key should identical top and bottom lines since no letters are renamed. However, the key we get by reading off the labels has top line etoai... and bottom line eatho...: attempting to ``decrypt'' Genesis with this key would yield gibberish. These results pretty conclusively show that using unigram frequencies will not work.⁷

However, observe that it ``almost'' works. Although the two lines of labels should are not identical as they should be, i.e. no letters should be renamed,the two lines are not ``too different''. That is, not every letter is matched with itself, but letters are not ``too far'' from where they should be, e.g. 'a' in Romeo and Juliet is ``two positions'' away from the 'a' in Genesis. An experienced human could probably take the ``almost-solution'' and play around with it to get to the correct solution by thinking of higher-order patterns. For example, vowels and consonants tend to alternate and we're pretty sure that 'u' will follow 'q'. But, higher-order frequencies reflect higher-order patterns, and bigrams are the next higher-order pattern after unigrams! And indeed, if you look at the L¹ distance between bigram frequencies (percentages), you'll see that our attempted approach made these distances go up, indicating that we are moving farther away from a solution. Thus, it makes sense to base our next attempted approach on bigram frequencies.

Roadmap
Section 3.7	Encipher plaintext $\Rightarrow$ *scramble* frequencies.
Section 3.7	Decipher ciphertext $\Rightarrow$ *unscramble* frequencies.
Section 4.2	(Hope) Unscramble frequencies $\Rightarrow$ decipher ciphertext
Section 4.3	Unscramble = Bring ``close'' to *intrinsic* frequencies
	Approximate *intrinsic* frequencies with *training text*
	Assume ciphertext is medium to large so that unscrambled frequencies resemble intrinsic frequencies
Section 4.4	Use the L¹ *distance* to measure ``closeness''; ignore labels.
Section 5	Q: What are legal and effective ways to rearrange frequencies?
Section 5.1	Sorting unigram frequencies does not work, but ``almost'' does
	(Hope) When the frequency table for ciphertext matches the table for training text, read the encryption key off of the labels
$(\rightarrow)$ Section 5.2	Q: Does sorting bigram frequencies work?

5.1.1 Sorting implies Swapping

Next: 5.1.1 Sorting implies Swapping Up: 5. Attempts at Decryption Previous: 5. Attempts at Decryption

Thomas Yan
2000-05-01