CS100 M Spring 2002
Project 3
Due Thursday 3/7 at beginning of lecture

1. Goals

You will design and write a program to solve problems in the field of computational genomics. Completing this project will help you learn about the following:

Most of us are not experts in computational genomics, so do not despair if you do not understand everything about the genome and DNA. You need only to read carefully and write a program to solve the specific tasks identified in sections 4, 6, and 7. Follow the instructions on CS100M-->Homework-->Projects in assembling your project for submission.

2. Motivation

Today's advanced technology can perform DNA sequencing on a truly industrial scale. Research generates new data at an explosive rate. The field of computational genomics endeavors to provide biological "understanding" from this data-flood through mathematical models and computer algorithms. Since DNA is so fundamental to biology, advances in this field will have impact on medicine and agriculture and will improve insights into many questions we have in biology.

3. Background

3.1 Genome and DNA

A genome is all the DNA in an organism, including the organism's genes. A gene carries information for making all the proteins required by all organisms. These proteins determine, among other things, how the organism looks, behaves, and fights infections.

DNA is short for deoxyribonucleic acid. A group of similar chemicals called nucleotides make up DNA. The four types of nucleotides, also called bases, are Adenine, Cytosine, Guanine, and Thymine. We will refer to the four bases by their first letters, A, C, G, and T. These bases repeat millions or billions of times throughout a genome. The human genome, for example, has three billion pairs of bases. DNA forms a double helix. A double helix has two strands of DNA that bind together and twist into a helical shape. One strand in the helix "complements" the other. Knowing one strand allows you to determine uniquely the complementary strand. Therefore, considering only one strand of DNA does not discard any essential data in some applications. The particular order of As, Cs, Gs, and Ts underlies all of life's diversity. Different orders generate humans, flies, rice, and so on!

3.2 Computer scientist's view of DNA sequences

Computer scientists think of a DNA sequence as a string from an alphabet set of four characters: A, C, G, and T. Like the English alphabet set of 26 characters, the order is extremely important. For example, the English strings mate, meat, tame, and team certainly mean different things!

3.3 DNA Sequence Analysis

DNA sequence analysis requires tools that take large amounts of DNA sequence data and derive information, such as statistics, for organizing and understanding the "data flood." Some of this derived information includes:

4. DNA Sequence Analysis Program

Develop a DNA Sequence Analysis Program to perform the following tasks:
4.1 Determine the reverse complement
For every DNA sequence in the input, determine the complementary bases in reverse order. Recall that in the DNA double helix, two strands twist together and "face" each other. A base from one strand forms a complementary pair with a base from the other strand at the same position. A and T are complementary, and C and G are complementary. For example, given the DNA sequence
                                  AGTAGCAT
the complementary sequence must be
                                  TCATCGTA
The reverse complement is the complementary sequence in reverse order:
                                  ATGCTACT
Write a function rComplement to perform this task. Function rComplement takes one string as the input argument and returns a string containing the reverse complement.
4.2 Calculate the frequencies of the bases
For each DNA sequence in the input, calculate
4.3 Determine the Hamming distance
The number of bases two DNA sequences have in common provides a simple measure of similarity. A converse criterion called Hamming distance measures the dissimilarity by counting the number of positions with different bases. The lines in the diagram below identify the locations at which the bases are different. (Note: these are two different sequences, not the two complementary strands in a sequence.)
     AAGTACATTGC
      |     |
     AGGTACACTGC
The Hamming distance between the two sequences above is 2.
Write a function hammingDist to calculate the Hamming distance between two sequences of equal length. Function hammingDist takes two strings as input arguments and returns the Hamming distance value.
4.4 Determine the weighted distance
Another criterion for measuring the dissimilarity between two sequences is the weighted distance. The idea is that some bases replace others more easily. E.g., a G replaces an A more easily than does a C. Therefore, a weight is assigned to each difference between two DNA sequences. These weights are shown below in a substitution matrix:
       A  C  G  T
     |------------
    A| 0  3  1  3
    C| 3  0  3  1
    G| 1  3  0  3
    T| 3  1  3  0
Each row corresponds to the base in one sequence; each column corresponds to the base in the other sequence. The number at location (row,col)=(A,T) represents the weight for changing A to T. The diagonal values are zero because there is no difference between A and A. E.g., the weighted distance between the two sequences AAG and CAA is 3+0+1=4 using the above substitution matrix.
(Note: The Hamming distance can be considered a weighted distance where the substitution matrix has zeros in the diagonal and the value one everywhere else.)
Write a function weightedDist to calculate the weighted distance between two sequences of equal length. Function weightedDist takes two strings as input arguments and returns the weighted distance value.
4.5 Program input and components

The DNA Sequence Analysis Program will work with two types of input:

The input strings can contain a mix of upper and lower case letters. Given the two input DNA sequences, the program will:

5. Specifications and example

Do not use global variables in your program. You may write additional functions not specified above. The program should start execution from a script M-file that "drives" the entire program. Upon execution, the program should display the menu shown below:

  Welcome to DNA Sequence Analysis Program.

  Select input type:
  (I) User enters 2 sequences
  (F) File DNAdata.mat will be read as input

  Enter I or F:

After reading the input (from interactive entries or file), the program checks the input strings and then start computation if the input strings are "legal." Suppose the input strings are AgTAGcAT and cAgtcagt, then the program output should be:

  Sequence 1:  AGTAGCAT
  Reverse complement:  ATGCTACT

  Frequencies:
      A      C      G      T
    0.375  0.125  0.250  0.250

  Pairwise frequencies:
      0      0      2      1
      1      0      0      0
      0      1      0      1
      1      0      0      0

  Sequence 2:  CAGTCAGT
  Reverse complement:  ACTGACTG

  Frequencies:
      A      C      G      T
    0.250  0.250  0.250  0.250

  Pairwise frequencies:
      0      0      2      0
      2      0      0      0
      0      0      0      2
      0      1      0      0

  Number of bases in each sequence:  8
  Hamming distance between sequences 1 and 2:  7
  Weighted distance between sequences 1 and 2:  17

Your program output does not need to have the exact same format but should be similar to that shown above.

6. Testing

Test your program using the following five input cases:

7. Submission

In less than one page, describe the organization of your program. This description should include a list of all the scripts and functions that you write. Each script and function should be described concisely. Any functions not specified in the above project statement must be clearly marked and carefully described. You are the software developer who needs to tell users (graders) how your program is organized and how it should be used. Your project will not be graded unless the organization of your DNA Sequence Analysis Program is described clearly. This description should be the first page after the table of contents. Submit all script and function files and all output from the test cases listed above.