CS100 M Spring 2002
Project 3
Due Thursday 3/7 at beginning of lecture
1. Goals
You will design and write a program to solve problems in the field of computational genomics. Completing this project will help you learn about the following:
- program design and organization
- user-defined function
- characters and strings
- two-dimensional array (matrix)
- file input and output
- program documentation
Most of us are not experts in computational genomics, so do not despair if you do not understand everything about the genome and DNA. You need only to read carefully and write a program to solve the specific tasks identified in sections 4, 6, and 7. Follow the instructions on CS100M-->Homework-->Projects in assembling your project for submission.
2. Motivation
Today's advanced technology can perform DNA sequencing on a truly industrial scale. Research generates new data at an explosive rate. The field of computational genomics endeavors to provide biological "understanding" from this data-flood through mathematical models and computer algorithms. Since DNA is so fundamental to biology, advances in this field will have impact on medicine and agriculture and will improve insights into many questions we have in biology.
3. Background
3.1 Genome and DNA
A genome is all the DNA in an organism, including the organism's genes. A gene carries information for making all the proteins required by all organisms. These proteins determine, among other things, how the organism looks, behaves, and fights infections.
DNA is short for deoxyribonucleic acid. A group of similar chemicals called nucleotides make up DNA. The four types of nucleotides, also called bases, are Adenine, Cytosine, Guanine, and Thymine. We will refer to the four bases by their first letters, A, C, G, and T. These bases repeat millions or billions of times throughout a genome. The human genome, for example, has three billion pairs of bases.
DNA forms a double helix. A double helix has two strands of DNA that bind together and twist into a helical shape. One strand in the helix "complements" the other. Knowing one strand allows you to determine uniquely the complementary strand. Therefore, considering only one strand of DNA does not discard any essential data in some applications. The particular order of As, Cs, Gs, and Ts underlies all of life's diversity. Different orders generate humans, flies, rice, and so on!
3.2 Computer scientist's view of DNA sequences
Computer scientists think of a DNA sequence as a string from an alphabet set of four characters: A, C, G, and T. Like the English alphabet set of 26 characters, the order is extremely important. For example, the English strings mate, meat, tame, and team certainly mean different things!
3.3 DNA Sequence Analysis
DNA sequence analysis requires tools that take large amounts of DNA sequence data and derive information, such as statistics, for organizing and understanding the "data flood." Some of this derived information includes:
- Difference Compare different DNA sequences from different species or from different places in one species' genome to investigate similarities. Very similar DNA sequences encode similar proteins, which perform similar functions. Biologists compare newly discovered genes with all known DNA sequences and hypothesize that the new gene functions similarly to the genes that it closely resembles.
- Statistics Find the frequencies of specific bases or sequences of bases (e.g., the number of As or the number of times the substring ATA appears). The frequencies are important as the bonds between different bases have different strengths.
4. DNA Sequence Analysis Program
Develop a DNA Sequence Analysis Program to perform the following tasks:
4.1 Determine the reverse complement
For every DNA sequence in the input, determine the complementary bases in reverse order. Recall that in the DNA double helix, two strands twist together and "face" each other. A base from one strand forms a complementary pair with a base from the other strand at the same position. A and T are complementary, and C and G are complementary. For example, given the DNA sequence
AGTAGCAT
the complementary sequence must be
TCATCGTA
The reverse complement is the complementary sequence in reverse order:
ATGCTACT
Write a function rComplement to perform this task. Function rComplement takes one string as the input argument and returns a string containing the reverse complement.
4.2 Calculate the frequencies of the bases
For each DNA sequence in the input, calculate
4.3 Determine the Hamming distance
The number of bases two DNA sequences have in common provides a simple measure of similarity. A converse criterion called Hamming distance measures the dissimilarity by counting the number of positions with different bases. The lines in the diagram below identify the locations at which the bases are different. (Note: these are two different sequences, not the two complementary strands in a sequence.)
AAGTACATTGC
| |
AGGTACACTGC
The Hamming distance between the two sequences above is 2.
Write a function hammingDist to calculate the Hamming distance between two sequences of equal length. Function hammingDist takes two strings as input arguments and returns the Hamming distance value.
4.4 Determine the weighted distance
Another criterion for measuring the dissimilarity between two sequences is the weighted distance. The idea is that some bases replace others more easily. E.g., a G replaces an A more easily than does a C. Therefore, a weight is assigned to each difference between two DNA sequences. These weights are shown below in a substitution matrix:
A C G T
|------------
A| 0 3 1 3
C| 3 0 3 1
G| 1 3 0 3
T| 3 1 3 0
Each row corresponds to the base in one sequence; each column corresponds to the base in the other sequence. The number at location (row,col)=(A,T) represents the weight for changing A to T. The diagonal values are zero because there is no difference between A and A. E.g., the weighted distance between the two sequences AAG and CAA is 3+0+1=4 using the above substitution matrix.
(Note: The Hamming distance can be considered a weighted distance where the substitution matrix has zeros in the diagonal and the value one everywhere else.)
Write a function weightedDist to calculate the weighted distance between two sequences of equal length. Function weightedDist takes two strings as input arguments and returns the weighted distance value.
4.5 Program input and components
The DNA Sequence Analysis Program will work with two types of input:
- Interactive The program should prompt the user to enter two input strings of equal length containing only the bases. The two input strings should be entered separately.
- File The program loads two DNA sequences contained in a file DNAdata.mat. The data is organized as a matrix of characters. The two rows have the same length and each row is one DNA sequence.
The input strings can contain a mix of upper and lower case letters. Given the two input DNA sequences, the program will:
- Check that the two input strings are of the same length and contain only the letters A, C, G, and T. If these two criteria are not met, then the program should display a helpful error message and then terminate. Write a function checkBases to perform the task of checking for the letters A, C, G, and T. Function checkBases takes a string as input argument and does not return any value.
- Determine the reverse complement for each sequence. In each case, print the original sequence above its complementary sequence. Print all sequences in capital letters.
- Calculate and display the relative and pairwise frequencies for all bases for each input sequence.
- Calculate and display the Hamming and weighted distances between the two input sequences. Use the substitution matrix shown above. Display also the number of bases in each sequence.
5. Specifications and example
Do not use global variables in your program. You may write additional functions not specified above. The program should start execution from a script M-file that "drives" the entire program. Upon execution, the program should display the menu shown below:
Welcome to DNA Sequence Analysis Program.
Select input type:
(I) User enters 2 sequences
(F) File DNAdata.mat will be read as input
Enter I or F:
After reading the input (from interactive entries or file), the program checks the input strings and then start computation if the input strings are "legal." Suppose the input strings are AgTAGcAT and cAgtcagt, then the program output should be:
Sequence 1: AGTAGCAT
Reverse complement: ATGCTACT
Frequencies:
A C G T
0.375 0.125 0.250 0.250
Pairwise frequencies:
0 0 2 1
1 0 0 0
0 1 0 1
1 0 0 0
Sequence 2: CAGTCAGT
Reverse complement: ACTGACTG
Frequencies:
A C G T
0.250 0.250 0.250 0.250
Pairwise frequencies:
0 0 2 0
2 0 0 0
0 0 0 2
0 1 0 0
Number of bases in each sequence: 8
Hamming distance between sequences 1 and 2: 7
Weighted distance between sequences 1 and 2: 17
Your program output does not need to have the exact same format but should be similar to that shown above.
6. Testing
Test your program using the following five input cases:
- Case 1: Enter interactively the following two sequences: gtatga, tcatcg
- Case 2: Use as input the file DNAdata.mat given in CS100M-->Homework-->Projects
- Case 3: Enter incorrect input interactively--two strings of different lengths
- Case 4: Enter incorrect input interactively--string(s) contain(s) letters that are not bases
- Case 5: Enter interactively two "legal" sequences of your choice
7. Submission
In less than one page, describe the organization of your program. This description should include a list of all the scripts and functions that you write. Each script and function should be described concisely. Any functions not specified in the above project statement must be clearly marked and carefully described. You are the software developer who needs to tell users (graders) how your program is organized and how it should be used. Your project will not be graded unless the organization of your DNA Sequence Analysis Program is described clearly. This description should be the first page after the table of contents. Submit all script and function files and all output from the test cases listed above.