Project 7: Cracking a Substititution Cipher Summary: In Project 6, you developed "tools" for analyzing text in terms of unigram and bigram frequencies. Now you will use those tools plus others shown in this write-up for cracking a substititution cipher. What to do: + You are welcome to program in either MATLAB or Java + You must choose only one of those languages. + You may design your program however you wish. + Your program must be clear, concise, and use good style. + You MUST use the algorithm shown below. Bonus: For bonus points, you may use any other approach to improve the decryption. We will assign bonus points based on the clarity of your deciphered text and the elegance of your code. Prize(s): Whoever best deciphers the text with the "best" code wins a prize...free Java textbook(s)! What to hand in: + All code. + All output. + (For bonus code) An explanation of your algorithm. We will supply the ciphertext we wish you to decipher sometime during the week of 4/25. _______________________________________________________________________________ Algorithm for cracking a substititution cipher: (The motivation and justification will be in a separate write-up.) 0. Obtain our solutions to Project 6. Rewrite Project 6 methods and classes using fractions, e.g. percentages, for frequencies -- do not use tallies. 1. Obtain a large "training" plaintext: Pick a large webpage written in English. For example, you can use Romeo & Juliet: http://etext.library.cornell.edu/cgi-bin/shakeEd-idx?type=HTML&byte=186196093&rgn=play or Genesis from the New English Bible: http://etext.library.cornell.edu/cgi-bin/bie-idx?type=HTML&byte=116542360&rgn=book Note 1: enter each URL as a single line. Note 2: Save documents as *text*, not as HTML. Compute the table of bigram frequencies for the large "training" plaintext. Include the column and row labels as part of the "table." How? Use separate arrays or arrays of objects, objects with arrays, or anything else you prefer. 2. Compute the table of bigram frequencies for ciphertext. Include row and column labels for this step, too. You may use the following for ciphertext: http://courses.cs.cornell.edu/cs100/2000sp/cryptannounce.txt 3. To crack the ciphertext, you need to bring the ciphertext bigram frequencies "close" to training frequencies. To measure "closeness", use the notion of array distance from Milestone 4 of Project 6. Here is how to bring the ciphertext bigram table close to the training bigram table: repeat until the ciphertext table stops getting closer: for each pair of letters swap the labels and swap the columns and rows compute the distance between ciphertext and training bigram frequencies choose the table closest to training frequencies Each time you swap a pair of labels, you are really testing a new decryption key. So, this algorithm tries out multiple decryption keys. Notes: + Milestone 5 from Project 6 accomplishes "swap the columns and the rows"; don't forget to also swap the labels. See "Elaboration" below for an example. + You might need to "unswap" the labels and "unswap" the columns and rows. (Think of choosing your next move in a game of chess. For each piece, you try moving it to another place ("swap its location"), and then move it back to the original place ("unswap its location") before trying another move.) 4. Set the decryption key to: top line = the labels of the ciphertext bigram table bottom line = the labels of the training bigram table 6. Transform the ciphertext according to the decryption key. _______________________________________________________________________________ Elaboration of the body of the inner loop in Step 3. Suppose the ciphertext bigram table is currently a b c d a 1 2 3 4 b 5 6 7 8 c 9 0 1 2 d 3 4 5 6 If we choose the pair ('a', 'd'), then swapping the labels and the columns and rows yields: d b c a d 6 4 5 3 b 8 6 7 5 c 2 0 1 9 a 4 2 3 1 Performing the swaps again (swap the labels and the columns and rows) undoes the swaps to yield the original table a b c d a 1 2 3 4 b 5 6 7 8 c 9 0 1 2 d 3 4 5 6