Here is an ``obvious'' approach that you, like we did, might think of. Although the discussion in Section 4.2 already warns us it might not work, it is still worth trying. Form the encryption key as follows:
At first, it looks like we are partly successful if we look at only the sorted unigram frequencies for Announcements and Enciphered Announcements. The unigram and bigram distances are 0, which is as close as you can get. Furthermore, you can read off the encryption key (top line "etosian...", bottom line "hwrvld...") from the labels! That is, we have discovered the following important potential fact:
Observe that for all other pairs of plaintexts, letters are not properly matched up, e.g. 't' in Romeo and Juliet is matched in Genesis to 'a' instead of 't'. Also observe that although the distance between unigram frequencies (percentages) went down as a result of sorting --indeed, sorting makes unigram tables as close as possible-- we got a bogus encryption/decryption key. That is, since we ``decrypted'' plaintext, the key should identical top and bottom lines since no letters are renamed. However, the key we get by reading off the labels has top line etoai... and bottom line eatho...: attempting to ``decrypt'' Genesis with this key would yield gibberish. These results pretty conclusively show that using unigram frequencies will not work.7
However, observe that it ``almost'' works. Although the two lines of labels should are not identical as they should be, i.e. no letters should be renamed,the two lines are not ``too different''. That is, not every letter is matched with itself, but letters are not ``too far'' from where they should be, e.g. 'a' in Romeo and Juliet is ``two positions'' away from the 'a' in Genesis. An experienced human could probably take the ``almost-solution'' and play around with it to get to the correct solution by thinking of higher-order patterns. For example, vowels and consonants tend to alternate and we're pretty sure that 'u' will follow 'q'. But, higher-order frequencies reflect higher-order patterns, and bigrams are the next higher-order pattern after unigrams! And indeed, if you look at the L1 distance between bigram frequencies (percentages), you'll see that our attempted approach made these distances go up, indicating that we are moving farther away from a solution. Thus, it makes sense to base our next attempted approach on bigram frequencies.
Roadmap | |
Section 3.7 | Encipher plaintext
![]() |
Section 3.7 | Decipher ciphertext
![]() |
Section 4.2 | (Hope)
Unscramble frequencies
![]() |
Section 4.3 | Unscramble = Bring ``close'' to intrinsic frequencies |
Approximate intrinsic frequencies with training text | |
Assume ciphertext is medium to large so that unscrambled frequencies resemble intrinsic frequencies | |
Section 4.4 | Use the L1 distance to measure ``closeness''; ignore labels. |
Section 5 | Q: What are legal and effective ways to rearrange frequencies? |
Section 5.1 | Sorting unigram frequencies does not work, but ``almost'' does |
(Hope) When the frequency table for ciphertext matches the table for training text, read the encryption key off of the labels | |
![]() |
Q: Does sorting bigram frequencies work? |