You will design and write a program to solve problems in the field of computational genomics. Completing this project will help you learn about the following:
Today's advanced technology can perform DNA sequencing on a truly industrial scale. Research generates new data at an explosive rate. The field of computational genomics endeavors to provide biological "understanding" from this data-flood through mathematical models and computer algorithms. Since DNA is so fundamental to biology, advances in this field will have impact on medicine and agriculture and will improve insights into many questions we have in biology.
A genome is all the DNA in an organism, including the organism's genes. A gene carries information for making all the proteins required by all organisms. These proteins determine, among other things, how the organism looks, behaves, and fights infections.
DNA is short for deoxyribonucleic acid. A group of similar chemicals called nucleotides make up DNA. The four types of nucleotides, also called bases, are Adenine, Cytosine, Guanine, and Thymine. We will refer to the four bases by their first letters, A, C, G, and T. These bases repeat millions or billions of times throughout a genome. The human genome, for example, has three billion pairs of bases. DNA forms a double helix. A double helix has two strands of DNA that bind together and twist into a helical shape. One strand in the helix "complements" the other. Knowing one strand allows you to determine uniquely the complementary strand. Therefore, considering only one strand of DNA does not discard any essential data in some applications. The particular order of As, Cs, Gs, and Ts underlies all of life's diversity. Different orders generate humans, flies, rice, and so on!
Computer scientists think of a DNA sequence as a string from an alphabet set of four characters: A, C, G, and T. Like the English alphabet set of 26 characters, the order is extremely important. For example, the English strings mate, meat, tame, and team certainly mean different things!
DNA sequence analysis requires tools that take large amounts of DNA sequence data and derive information, such as statistics, for organizing and understanding the "data flood." Some of this derived information includes:
AGTAGCAT
the complementary sequence must be
TCATCGTA
The reverse complement is the complementary sequence in reverse order:
ATGCTACT
Write a function rComplement to perform this task. Function rComplement takes one string as the input argument and returns a string containing the reverse complement.
AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT
E.g., the pairwise frequencies of the sequence AGTAGCAT can be organized in a matrix as shown below:
0 0 2 1
1 0 0 0
0 1 0 1
1 0 0 0
The above matrix assumes the row and column "labels" in the order A C G T.
AAGTACATTGC
| |
AGGTACACTGC
The Hamming distance between the two sequences above is 2.
A C G T
|------------
A| 0 3 1 3
C| 3 0 3 1
G| 1 3 0 3
T| 3 1 3 0
Each row corresponds to the base in one sequence; each column corresponds to the base in the other sequence. The number at location (row,col)=(A,T) represents the weight for changing A to T. The diagonal values are zero because there is no difference between A and A. E.g., the weighted distance between the two sequences AAG and CAA is 3+0+1=4 using the above substitution matrix.
The DNA Sequence Analysis Program will work with two types of input:
The input strings can contain a mix of upper and lower case letters. Given the two input DNA sequences, the program will:
Do not use global variables in your program. You may write additional functions not specified above. The program should start execution from a script M-file that "drives" the entire program. Upon execution, the program should display the menu shown below:
Welcome to DNA Sequence Analysis Program. Select input type: (I) User enters 2 sequences (F) File DNAdata.mat will be read as input Enter I or F:
After reading the input (from interactive entries or file), the program checks the input strings and then start computation if the input strings are "legal." Suppose the input strings are AgTAGcAT and cAgtcagt, then the program output should be:
Sequence 1: AGTAGCAT
Reverse complement: ATGCTACT
Frequencies:
A C G T
0.375 0.125 0.250 0.250
Pairwise frequencies:
0 0 2 1
1 0 0 0
0 1 0 1
1 0 0 0
Sequence 2: CAGTCAGT
Reverse complement: ACTGACTG
Frequencies:
A C G T
0.250 0.250 0.250 0.250
Pairwise frequencies:
0 0 2 0
2 0 0 0
0 0 0 2
0 1 0 0
Number of bases in each sequence: 8
Hamming distance between sequences 1 and 2: 7
Weighted distance between sequences 1 and 2: 17
Your program output does not need to have the exact same format but should be similar to that shown above.