CS100M Spring 2002

CS100 M Spring 2002
Project 3
Due Thursday 3/7 at beginning of lecture

1. Goals

You will design and write a program to solve problems in the field of computational genomics. Completing this project will help you learn about the following:

program design and organization
user-defined function
characters and strings
two-dimensional array (matrix)
file input and output
program documentation

Most of us are not experts in computational genomics, so do not despair if you do not understand everything about the genome and DNA. You need only to read carefully and write a program to solve the specific tasks identified in sections 4, 6, and 7. Follow the instructions on CS100M-->Homework-->Projects in assembling your project for submission.

2. Motivation

Today's advanced technology can perform DNA sequencing on a truly industrial scale. Research generates new data at an explosive rate. The field of computational genomics endeavors to provide biological "understanding" from this data-flood through mathematical models and computer algorithms. Since DNA is so fundamental to biology, advances in this field will have impact on medicine and agriculture and will improve insights into many questions we have in biology.

3. Background

3.1 Genome and DNA

A genome is all the DNA in an organism, including the organism's genes. A gene carries information for making all the proteins required by all organisms. These proteins determine, among other things, how the organism looks, behaves, and fights infections.

DNA is short for deoxyribonucleic acid. A group of similar chemicals called nucleotides make up DNA. The four types of nucleotides, also called bases, are Adenine, Cytosine, Guanine, and Thymine. We will refer to the four bases by their first letters, A, C, G, and T. These bases repeat millions or billions of times throughout a genome. The human genome, for example, has three billion pairs of bases. DNA forms a double helix. A double helix has two strands of DNA that bind together and twist into a helical shape. One strand in the helix "complements" the other. Knowing one strand allows you to determine uniquely the complementary strand. Therefore, considering only one strand of DNA does not discard any essential data in some applications. The particular order of As, Cs, Gs, and Ts underlies all of life's diversity. Different orders generate humans, flies, rice, and so on!

3.2 Computer scientist's view of DNA sequences

Computer scientists think of a DNA sequence as a string from an alphabet set of four characters: A, C, G, and T. Like the English alphabet set of 26 characters, the order is extremely important. For example, the English strings mate, meat, tame, and team certainly mean different things!

3.3 DNA Sequence Analysis

DNA sequence analysis requires tools that take large amounts of DNA sequence data and derive information, such as statistics, for organizing and understanding the "data flood." Some of this derived information includes:

Difference Compare different DNA sequences from different species or from different places in one species' genome to investigate similarities. Very similar DNA sequences encode similar proteins, which perform similar functions. Biologists compare newly discovered genes with all known DNA sequences and hypothesize that the new gene functions similarly to the genes that it closely resembles.
Statistics Find the frequencies of specific bases or sequences of bases (e.g., the number of As or the number of times the substring ATA appears). The frequencies are important as the bonds between different bases have different strengths.

4. DNA Sequence Analysis Program

Develop a DNA Sequence Analysis Program to perform the following tasks:

4.1 Determine the reverse complement

For every DNA sequence in the input, determine the complementary bases in reverse order. Recall that in the DNA double helix, two strands twist together and "face" each other. A base from one strand forms a complementary pair with a base from the other strand at the same position. A and T are complementary, and C and G are complementary. For example, given the DNA sequence

                                  AGTAGCAT

the complementary sequence must be

                                  TCATCGTA

The reverse complement is the complementary sequence in reverse order:

                                  ATGCTACT

Write a function rComplement to perform this task. Function rComplement takes one string as the input argument and returns a string containing the reverse complement.

4.2 Calculate the frequencies of the bases

For each DNA sequence in the input, calculate

the relative frequency of each base. The relative frequency of a base is the number of times that base occurs divided by the total number of bases in the sequence. E.g., in the sequence AGTAGCAT, base T has a relative frequency of 0.25.
Write a function uniFrequency to perform the above task. FunctionuniFrequency takes one string as input argument and returns a vector containing the four frequencies.
all pairwise frequencies of the bases. Since there are four bases, there are 16 possible pairs:
```
    AA AC AG AT
    CA CC CG CT
    GA GC GG GT
    TA TC TG TT
```
E.g., the pairwise frequencies of the sequence AGTAGCAT can be organized in a matrix as shown below:
```
    0  0  2  1
    1  0  0  0
    0  1  0  1
    1  0  0  0
```
The above matrix assumes the row and column "labels" in the order A C G T.
Write a function pairFrequency to perform the above task. Function pairFrequency takes a string as input argument and returns a matrix containing the 16 pairwise frequencies.

4.3 Determine the Hamming distance

The number of bases two DNA sequences have in common provides a simple measure of similarity. A converse criterion called Hamming distance measures the dissimilarity by counting the number of positions with different bases. The lines in the diagram below identify the locations at which the bases are different. (Note: these are two different sequences, not the two complementary strands in a sequence.)

     AAGTACATTGC
      |     |
     AGGTACACTGC

The Hamming distance between the two sequences above is 2.
Write a function hammingDist to calculate the Hamming distance between two sequences of equal length. Function hammingDist takes two strings as input arguments and returns the Hamming distance value.

4.4 Determine the weighted distance

Another criterion for measuring the dissimilarity between two sequences is the weighted distance. The idea is that some bases replace others more easily. E.g., a G replaces an A more easily than does a C. Therefore, a weight is assigned to each difference between two DNA sequences. These weights are shown below in a substitution matrix:

       A  C  G  T
     |------------
    A| 0  3  1  3
    C| 3  0  3  1
    G| 1  3  0  3
    T| 3  1  3  0

Each row corresponds to the base in one sequence; each column corresponds to the base in the other sequence. The number at location (row,col)=(A,T) represents the weight for changing A to T. The diagonal values are zero because there is no difference between A and A. E.g., the weighted distance between the two sequences AAG and CAA is 3+0+1=4 using the above substitution matrix.
(Note: The Hamming distance can be considered a weighted distance where the substitution matrix has zeros in the diagonal and the value one everywhere else.)
Write a function weightedDist to calculate the weighted distance between two sequences of equal length. Function weightedDist takes two strings as input arguments and returns the weighted distance value.

4.5 Program input and components

The DNA Sequence Analysis Program will work with two types of input:

Interactive The program should prompt the user to enter two input strings of equal length containing only the bases. The two input strings should be entered separately.
File The program loads two DNA sequences contained in a file DNAdata.mat. The data is organized as a matrix of characters. The two rows have the same length and each row is one DNA sequence.

The input strings can contain a mix of upper and lower case letters. Given the two input DNA sequences, the program will:

Check that the two input strings are of the same length and contain only the letters A, C, G, and T. If these two criteria are not met, then the program should display a helpful error message and then terminate. Write a function checkBases to perform the task of checking for the letters A, C, G, and T. Function checkBases takes a string as input argument and does not return any value.
Determine the reverse complement for each sequence. In each case, print the original sequence above its complementary sequence. Print all sequences in capital letters.
Calculate and display the relative and pairwise frequencies for all bases for each input sequence.
Calculate and display the Hamming and weighted distances between the two input sequences. Use the substitution matrix shown above. Display also the number of bases in each sequence.

5. Specifications and example

Do not use global variables in your program. You may write additional functions not specified above. The program should start execution from a script M-file that "drives" the entire program. Upon execution, the program should display the menu shown below:

  Welcome to DNA Sequence Analysis Program.

  Select input type:
  (I) User enters 2 sequences
  (F) File DNAdata.mat will be read as input

  Enter I or F:

After reading the input (from interactive entries or file), the program checks the input strings and then start computation if the input strings are "legal." Suppose the input strings are AgTAGcAT and cAgtcagt, then the program output should be:

  Sequence 1:  AGTAGCAT
  Reverse complement:  ATGCTACT

  Frequencies:
      A      C      G      T
    0.375  0.125  0.250  0.250

  Pairwise frequencies:
      0      0      2      1
      1      0      0      0
      0      1      0      1
      1      0      0      0

  Sequence 2:  CAGTCAGT
  Reverse complement:  ACTGACTG

  Frequencies:
      A      C      G      T
    0.250  0.250  0.250  0.250

  Pairwise frequencies:
      0      0      2      0
      2      0      0      0
      0      0      0      2
      0      1      0      0

  Number of bases in each sequence:  8
  Hamming distance between sequences 1 and 2:  7
  Weighted distance between sequences 1 and 2:  17

Your program output does not need to have the exact same format but should be similar to that shown above.

6. Testing

Test your program using the following five input cases:

Case 1: Enter interactively the following two sequences: gtatga, tcatcg
Case 2: Use as input the file DNAdata.mat given in CS100M-->Homework-->Projects
Case 3: Enter incorrect input interactively--two strings of different lengths
Case 4: Enter incorrect input interactively--string(s) contain(s) letters that are not bases
Case 5: Enter interactively two "legal" sequences of your choice

7. Submission

In less than one page, describe the organization of your program. This description should include a list of all the scripts and functions that you write. Each script and function should be described concisely. Any functions not specified in the above project statement must be clearly marked and carefully described. You are the software developer who needs to tell users (graders) how your program is organized and how it should be used. Your project will not be graded unless the organization of your DNA Sequence Analysis Program is described clearly. This description should be the first page after the table of contents. Submit all script and function files and all output from the test cases listed above.