Problem Set 4: AVL Trees and RSA

Due Thursday, October 25, 2007, 11:00 pm.

Oct 23: Part4: The height and size values for Empty should be swapped. Changes are highlighted in red.
Oct 15: Part1: there was no way to check equality of two AVL tree leaves. This can now be done by the = operator in the updated ps4.zip.
Oct 12: Part1: the upper bound k for balance factors was missing from ord.sig. This file has been updated in ps4.zip.

This problem set has 3 parts. In parts 1 you'll implement a map structure using balanced binary trees. In Part 2 you'll learn about the RSA cryptosystem and implement algorithms in this system. Part 3 is a written problem. As before, you will write your solutions in the appropriate files and submit the files separately on CMS. To get you started, we are providing a zip file containing several files. You can edit the necessary files, and submit your solution using CMS when you have finished. Non-compiles and warnings will receive an automatic zero.

It is recommended (but not required) that you work with one partner for this assignment. Please use CMS to pair up with your partner.

For this assignment you are required to use CVS (or any other versioning system) for the concurrent code development within your team. A quick CVS tutorial can be found on the main page of the course web site.

[Download the zip file ps4.zip]

Part 1: AVL Trees

Recall that an AVL tree is a binary search tree where all nodes have the property that the height difference between their children (of balance factor b) is -1, 0, or 1. Operations that insert and remove elements from such trees must perform re-balancing operations to ensure that each node is properly balanced. It is possible to generalize the notion of balancing by requiring that balance factors (height difference between children) be bounded by a constant k: -k <= balance factor <= k. The parameter k controls how balanced such trees are. We say that a tree is k-AVL if it is a binary search and each node in the tree has a balance factor between -k and k.

For k = 1 we obtain the AVL trees discussed in class. For larger values of k, trees become taller and less balanced; insertions and deletions take longer (still O(log n) if k is a constant with respect to n), but will require fewer rebalancing operations. If k is arbitrarily large, rotations will never be necessary, and the trees will behave as standard binary search trees.

In this part of the assignment you will implement a map abstraction using k-AVL, i.e., AVL trees with k-bounded balance factors. The AVL tree set is parameterized on a signature ORD_BIND containing the types of keys and values, a function that compares two keys, and the constant k representing the upper bound for balance factors. Each node of the tree contains an element and the height of the current node. The datatype definition of tree sets is as follows:

datatype bind = key * value 
datatype avl = Nil | Node of bind * bal * avl * avl
type map = avl

Your task is the following:

repOk. Implement a repOk function that checks whether a tree is a valid k-AVL tree map. For full score, your implementation must be O(n) in the number of nodes of the tree.
Balancing. Implement the rebalancing function bal for k-AVL trees. The argument of the function is a tree whose balance factor is -k-1 or k+1, and all subtrees are k-AVL. It returns a k-AVL tree that contains the same bindings as the given tree.
Put and remove. Implement the put and remove operations using the balancing function bal.
Equals. Write a function compare that returns true if and only if the two maps contain the same bindings.
Other functions. Implement the other functions that complete the map interface map.sig, that is functions get, fold, and size. The fold function will traverse the nodes of the tree using in-order traversal, and function size returns the number of keys in the map.

To do: turn in the solution in the file avlmap.sml.

Part 2: The RSA Cryptosystem

In a world that relies increasingly upon digital information, public key encryption and digital signatures play an important part in achieving private communication. RSA is a popular public-key encryption system. It is based on the fact that there are fast algorithms for exponentiation and for testing prime number—but no known fast algorithms for factoring large numbers. In this problem set you will implement a version of the RSA system. The algorithms will you implement have both a deep mathematical foundation and practical importance.

Cryptographic systems typically use keys for encryption and decryption. An encryption key is used to convert the original message (the plaintext) to coded form (the ciphertext). A corresponding decryption key is used to convert the ciphertext back to the original plaintext. In traditional cryptographic systems, the same key is used for both encryption and decryption. Two parties can exchange coded messages only if they share a secret key. Since anyone who learned that key would be able to decode the messages, keys must be carefully guarded and transmitted only under tight security: for example, couriers handcuffed to locked, tamper-resistant briefcases!

Diffie and Hellman (1976) discovered a new approach to encryption and decryption: public key cryptography. In this approach, the encryption and decryption keys are different. Knowing the encryption key cannot help you find the decryption key. Thus you can publish your encryption key on the web, and anyone who wants to send you a secret message can use it to encode a message to send to you. You do not have to worry about key security at all, for even if everyone in the world knew your encryption key, no one could decrypt messages sent to you without knowing your decryption key, which you keep private to yourself. The RSA system, due to Rivest, Shamir, and Adelman, is the best-known public-key cryptosystem; the security of your web browsing probably depends on it.

How RSA Works

RSA does not work with characters, but with integers. That is, it encrypts numbers. To encrypt a piece of text, just combine the ASCII codes of several characters into a big number, and then encrypt the resulting number. A big number is an integer of arbitrary magnitude (i.e., is not restricted to 32 bits or 64 bits).

Suppose you want to obtain public and private keys in the RSA system. You select two very large prime numbers p and q. (Recall that a prime number is a positive integer greater than 1 with no divisors other than itself and 1). You then compute numbers n and m:

n = pq
m = (p−1)(q−1)

It turns out that no one knows how to compute m, p, or q efficiently, knowing n.

(Notation: [a=b] mod n means that (a mod n) = (b mod n), where a mod n is the remainder obtained when dividing a by n using ordinary integer division. Equivalently, [a=b] mod n if a−b is divisible by n. For example, [17 = 32] mod 5.)

Now you pick a number e < m relatively prime to m; that is, such that e and m have no factors in common except 1. The significance of relative primality is that e is relatively prime to m iff e is invertible mod m; that is, iff there exists a d such that [de = 1] mod m. Moreover, it is possible to compute d from e and m using Euclid's algorithm, described below. Then:

The number n is the modulus.
The public key, which you can advertise to the world, is the pair (n,e), containing the modulus and the public exponent.
The private key is (n,d), containing the modulus and the private exponent.

Anyone who wants to send you a secret message s (represented by an integer) encrypts it by computing a number E(s). The cyphertext E(s) is obtained by raising s to the power e, then taking the remainder modulo n :

    E(s) = s^e mod n

The decryption process is exactly the same, except that d is used instead of e:

    D(s) = s^d mod n

The operations E and D are inverses:

    D(E(s))    = (s^e)^dmod n
                    = s^de mod n
                    = s^1+km mod n
                    = s(s^m)^k mod n
                    = s(1)^k mod n
                    = s mod n
                    = s

The integer s representing the plaintext must be less than n. That's why we break the message up into chunks. In the above reasoning, the equality [s^m=1] mod n holds only if s is relatively prime to n (using a theorem due to Euler and Fermat). Otherwise, if s divides n, that is, s is either p or q, then D(E(s)) = s^de mod n = s, trivially. Therefore, it is always the case that D(E(s)) = s.

In summary, to use the RSA system:

    1. pick large primes p and q;
    2. compute n = pq and m = (p−1)(q−1);
    3. choose e relatively prime to m and use this to compute d such that [de = 1] mod m;
    4. publish the pair (n,e) as your public key, but keep d, p, and q secret.

How secure is RSA? At present, the only known way to obtain d from e and n is to factor n into its prime factors p and q, then compute m and proceed as above. But despite centuries of effort by number theorists, factoring large integers efficiently is still an open problem. Until someone comes up with an efficient way to factor numbers, or discovers some other way to compute d from e and n, the cryptosystem appears to be secure for large numbers n. Although several "better" factorization algorithms have been proposed, the problem remains computationally intractable for numbers that are large enough. For example, it is infeasible with today's best supercomputers and best algorithms known to date to factor an arbitrary 1024-bit number. In fact, the RSA Factoring Challenge offers money prizes for factoring large numbers. To date, the largest number from this challenge that has been successfully factored had 640 bits (or 193 decimal digits). It took about 5 months for several tens of machines working in parallel to factor this number.

Implement RSA

In the file rsa.sml you have been given supporting code for implementing the RSA algorithm. The code relies on the implementation of big integers and random numbers from the SML/NJ library. Note that this library is not loaded by default! Using the SML Compilation Manager, the library is loaded by the line smlnj-lib.cm in the sources.cm file (check the comments in that file if you have trouble loading the library).

The given code provides a variety of functions, including functions to generate prime numbers, and to test numbers for primality; code that generates key pairs; to read and write keys to and from files; or to convert between lists of bytes and big numbers. We briefly discuss a few of these functions.

Function expmod(b, e, m) computes b^e mod m. Function euclid(m,n) yields numbers (s,t,g) such that sm + tn = g, where g is the greatest common divisor of m and n.
Function isPrime(n) tests is number is prime. This is a probabilistic test: it yields true if there is a high probability that n is prime; and yields false if n is definitely not prime. This function is based on Fermat's theorem, which states that if n is prime, then all numbers a in the range 0 < a < n satisfy [a^n-1 = 1] mod n; and if n is not prime, then at most half do. Therefore, successfully checking if [a^n-1 = 1] mod n for k numbers a indicates that n is prime with a probability of error of 1 / 2^k.
The function generateKeyPair(r) generates an RSA key pair. It picks random primes p and q that are in the range from 2^r to 2*2^r so that n = pq will be in the range 2^2r to 2^2r+2. To find the keys e and d given m = (p-1)(q-1), function selectKeys picks a random e, and computes d using Euclid's algorithm.

You must add the encryption and decryption functions specified below. Your functions encode an input list of bytes into an output list of bytes. Therefore, these functions operate on blocks of bytes. Bytes are represented as values of type Word8.word (check the Word8 structure for operations on values of this type). Each block consists of M bytes or (M-1) bytes, where M is the number of bytes required to store the modulus n of the key (n,e) or (n,d).

Encryption then works as follows: take each (M-1)-byte block at a time (i.e. a list of M-1 bytes), treat it as a large integer, encode it with RSA, and then convert the resulting number to an M-byte list. The resulting block must have exactly M bytes, so you need to pad the resulting number with leading zeros if necessary (the zero bytes must be added at the beginning of the last block).
You must implement the function that encrypts an input message represented as a list of bytes:
```
  (* encrypt k l encrypts byte list l with key k.
   * If M is the number of bytes in k's modulus, then break l into
   * (M-1)-byte blocks, and encrypt each block into an M-byte output
   * block. Pad last input block to form an (M-1)-byte block.
   * Returns: a list n'::l', where n' is the amount of padding in the
   * last block, and l' is the concatenation of all encrypted blocks. *)
  val encrypt : key -> word list -> word list
```

Decryption performs the opposite operation: it decodes each M-byte block into an M-1 byte block. Remember that each block in the cyphertext is the RSA-transform of the current block and the encrypted previous block. Decrypting a block must yield exactly M-1 bytes.

  (* decrypt k l decrypts byte list l with key k.
   * Requires that l = n'::l', where n' is the amount of padding that
   * must be removed from the last block, and l' consists of encrypted
   * M-byte blocks. As above, M is the number of bytes of k's modulus.
   * The function decrypts each M-byte block into an (M-1)-byte block.
   * Returns: the decrypted list. *)
  val decrypt : key -> word list -> word list

Note: You may find it useful to use a single function that performs the RSA-transform, and use this function for both encryption and decryption:

(* transform k n l encodes a list of bytes l into an
 * n-byte output list l', using key k. 
 * Pads the result with leading zeroes if necessary. *)
  val transform : key -> int -> word list -> word list

To do: Complete encrypt and decrypt in rsa.sml.

Digital Signatures

Public key encryption can be used to solve another problem of secure communication. Suppose you wish to send a message by electronic mail and sign it so that the recipient can be sure that it really came from you. What is required is some scheme for signing a message in a way that cannot be forged. This is called a digital signature.

Suppose you want to sign a digital message M in a way that convinces anyone else that the message came from you. This can be achieved by providing a signature that could only have been sent by someone with your private key. This signature is simply D(M). Anyone who receives M and D(M) can verify that the signature corresponding to the message by applying E to it, obtaining E(D(M)) = M. Only someone in possession of the private key d could have constructed a signature with this property. If someone wanted to forge a signature to a bogus message M, they would have to compute D(M), but assuming that RSA is secure, this cannot be done without the private key d.

We can combine the two techniques as follows. Suppose Amy wants to send Bob a message that only Bob can read, and sign the message so that Bob knows it could only be from her. She encrypts the message using Bob's public key. Then she signs the encrypted result using her own private key using the signature method just described. When Bob receives the message, he first uses Amy's public key to authenticate the signature, then decrypts the message using his own private key. Bob can be sure that only someone with Amy's private key could have sent the message. Amy can be sure that only someone with Bob's private key can read the message. This is accomplished without exchanging any secret information between Bob and Amy. It's this capacity for achieving secure communication without having to worry about exchanging secret keys that makes public key cryptography such an appealing technique.

Note: For efficiency and other reasons, real implementations of digital signatures do not sign the message itself, but just sign a small hashcode of the message. That is, for a message M, the digital signature is not D(M), but D(H(M)), where H is a secure one-way hash function (an example hash function is MD5). The pair (M, D(H(M))) is the signed document.

To Do: Implement the functions encrypt_and_sign and unsign_and_decrypt in rsa.sml.

Part 3: Written Problem

Consider two functions nodes() and height() defined for binary trees as follows:

  datatype BST = Empty | Node of BST * int * BST
  
  fun height(Empty) = -1
    | height(Node(l,_,r) = 1 + Int.max(height l, height r)

  fun size(Empty) = 0
    | size(Node(l,_,r) = 1 + (size l) + (size r)

Show that size(t) <= 2 ^{height(t) + 1} - 1

Consider two functions f(n) and g(n). For each of the statements below, indicate whether the claim is true or false. Justify each of your answers.
1. if f is O(n²) and g is O(n³) then f is O(g)
2. if f is O(n²) and and f is O(g) then g is O(n³)
3. if f is Θ(n²) and g is Θ(n³) then f is O(g)
4. if f is Θ(n²) and g is O(n³) then f is O(g)
Analyze the asymptotic complexity of your implementation of function equals from Part 1.
Give an asymptotic upper bound for T(n) defined by the recurrence relations below. Prove your answer.
1. T(0) = T(1) = 1
2. T(n) = 4T(n/2) + n²

To do: turn in the solution in the file written.txt.

Other Requirements

CVS: We ask you to use CVS while developing code with your partner. CVS is a popular, convenient versioning system that allow you and your partner to share code in a common repository. It allows you to share common code, commit your code into the repository, or update your version with the changes of your partner.

To do: submit a file log file cvslog showing your cvs activity during the assignment. For this, look into the "cvs log" command. This will count for 5% of the grade for this assignment.
After submission, you will meet with the consultants to explain your work. This count for another 5% of the grade.