Identifying DNA Sequence by using String Matching Techniques
Computer Science and Engineering
Anil Kumar Pandey
Computer Science and Engineering
Abstract: In Bioinformatics there are pattern in DNA sequences become the subject of many research papers. DNA sequencing is done by string matching techniques. This paper assesses two algorithms that are used for DNA sequencing. These algorithms are- Longest Common Substring (LCS) and Rabin-Karp (R-K) Algorithm. The evaluations of these two algorithms are performed with different code implementation and algorithm. For evaluation of algorithm's implementation, there are two criteria used that are accuracy and performance. In the result there shows that why those two algorithms are popular.
Keywords: DNA Similarity Algorithms; String Search; DNA Sequence Comparison; DNA Analysis; Longest Common Substring; Rabin Karp Algorithm.
It is possible that people may have some similar gene sequences, while DNA sequence may vary from person to person. In these regions the comparing variation allows scientists that whether two different DNA samples come from the same person that is done by comparison. DNA is a molecule with the genetic instructions. DNA is used for living organisms for the growth, development, functioning and reproduction. Within an organism a DNA sequence represents the genetic code. In genetic code there are a set of sequences that define within the organism what proteins build.
In this paper there will discuss on two similarity algorithms. And there is also discuss that which algorithm is better. These algorithms used for simple and complex commercial tools. There is a wide range of applications used for comparison of DNA, its analysis and construction, etc.
There are various algorithms used for DNA comparison such as Longest Common Substring (LCS) and Rabin Karp (R-K). We will consider two parameters to evaluate algorithms that are performance and accuracy.
II. METHODS TO DETECT DOCUMENT SIMILARITY
In this field, there are many techniques to evaluate similarity between documents. The principles of Brute Force string matching are quite simple. In Brute Force there is check for a match between the first characters of the pattern with the first character of the text as on the picture bellow.
Figure 1. Brute force string matching
Brute Force technique compares the subject text document with investigated text documents word by word. There are many pros and corns of this approach. The approaches that are used in this area are time consuming as well as resources consuming. A most efficient approach is based on parameters that include the documents such as the statement numbers, paragraphs and punctuation. To calculate the similarity between documents is based on parameters in which a similarity index is measured. By comparing the approach of document taking word by word and compare it with each and every paragraph. On the other hand, comparison of words can minimize the effect of changing a small number of words that are related to the total document. Brute force may be time exhausting. Sentence or paragraph by paragraph technique is also concerned by several deviations such as the size differences between the compared documents and the number of words revised either in statements or in paragraphs. Hashing algorithms are used for the calculation of document similarity and also used in security to justify the integrity of disk drive that will be investigated and provide protection to the disk drive from being changed. Hashing algorithms are very secure. Hashing can be measured for a word, a paragraph, a page, or a whole text document. For the similarity of various text document 'Manber' presented approximate index concept. 'Sif' is a tool that was developed for finding identical files in a large file system. 'Manber' suggested the approximate index concept to calculate the character strings similarity between documents, which was accepted later by various identical systems. Two search algorithms have been used in 'Sif' tool. Through the file directory the first algorithm searches for possible identical text documents for the subject text document. In second algorithm Internet have been used for searching identical documents. In most cases similarity of cosmetic attributes (such as the file type, size, number of words, etc.) are calculating the similarity between text documents does not desire. Checksum algorithm is also known as 'fingerprint' that is based on explaining keywords in each text document. And to calculate similarity parse an assured number of characters starting from those keywords which are already explained.
A. DNA sequence Comparison
There are four nucleotide bases in a DNA sequence that are Adenine, Guanine, Cytosine, and Thymine. There are various research areas (biological, medicine and agriculture such as: possible disease or abnormality diagnoses, forensics, pattern matching, biotechnology, etc) where DNA sequence and gene analysis is used. For analysis and comparison studies of DNA sequence, connected information technology tools and techniques are used to stimulate knowledge in biological science.
To identify possible errors, DNA sequence analysis can be used. Abnormality in a DNA sequence can be identified by comparing a sequence to a normal one. It can be also used to compare it with other 'similar' genes from same or different organisms and predict the function of a particular gene.
After the comparison there is new DNA sequence is exposed and its functionality is match with known DNA sequences. This technique is also used in various medical fields and research areas.
' Alignment of DNA Sequence
In alignment of DNA sequence process, in condition of quantity there is resemblance between two or more genetic codes. To determine information (such as- evolutionary divergence, the origins of disease, and ways to apply genetic codes from one organism into another) these comparisons can be used. In the alignment of DNA sequence, sequences are regulated and tagging has been done between two similar characters (Figure 2 and Figure 3).
Figure 2. Example of Sequence Alignment
In Figure 2 there is shown an alignment of simple sequence. In this example there is match the characters in two sequences.
Figure 3. Sample of DNA Sequence Alignment
' Tool for DNA Comparison and Analysis
For differentiating and comparing genes or genomes has been used in biology and forensics. There are several tools which are based on factors such as larger size, complexity and functionality. So for this there are some small tools or websites developed for research or experimental purposes as free or open source. Examples of such small size limited purpose tools are: Double Act (http://www.hpa-bioinfotools.org.uk/pise/double_act.html) , Genomatix (http://www.genomatix.de) , Mobyle (http://mobyle.pa steur.fr) , ALIGN, FASTA, etc. Example of large size tool is BLAST: (Basic Local Alignment Search Tool) . Smith'Waterman algorithm is also used for sequence alignment. For the comparison of DNA to DNA sequence and investigation of crimes forensic Smith-Waterman algorithm is used. It selects various segments (such as eight segments) that are selected from the different locations of the DNA. Dynamic programming and 'seeding' has been used in BLAST to find possible matches. By taking significant chunk of time and resources is the main goal of finding matches between DNA sequences.
Ranking of the different matches is another process that can be used for differentiating tools. The process of ranking tool is occurring when there are matches found with same size. (For example pqrs, pacq, all are of size 4). With the comparison some algorithms use the first match and other algorithms use the last match. In figure 4 there is shown a comparison of two DNA sequence and the tool shows the match between two DNA sequences.
LCS: In Longest Common Subsequence (LCS) there is find the longest subsequence that is common sequences in a set of sequences. For comparison at least two sequences are used. Longest common subsequence is different from technique of finding common substrings. In an original sequences substring and subsequences are not get consecutive positions. The Longest Common Subsequence (LCS) is a type of classic technique with string comparison.
Table 1: LCS for several string example
String No 1 String No 2 LCS
ZABZDC ABCBAE AB
SBCRGH SGDFHR S
AGGTVB GXTXVBNB VB
GCGCBATG GCCCTGCG GCG
BCBABD C EBABDDCBDA BABD
cs106b Rocks C
LCS 1 Algorithm: In LCS 1 algorithm there is compare two strings and find the appropriate combination of string, and find the match that is longest common after this compare the strings. For this technique there is historical knowledge is not required. For the returning of the longest common string there are loops form through the two strings. In this algorithm there are strings set into two ways, in first way string is set to be reference and in second way string is set as loop through. With the same length if there are longest common strings then first string can be defined by default. In other approach the length of smaller string is define as loop length, (For example l= minimum (string1.Length, string2.Length) it means that there is take the string which has small length. The algorithm may get slow if the string size is larger than the algorithm.
There are also modifications of this algorithm that are in LCS 2, LCS 3, LCS 4, LCS 5, LCS 6 and LCS 7.
In figure 4 there are showing two strings that are ATGCGATAGAGT and CCCAGTATAGATTA. There make a dot plot by using two strings. In a dot plot longest common diagonal line shows the LCS.
Figure 4. Dot Plot of DNA
Rabin-Karp: Rabin-Karp algorithm is a type of String matching algorithm. In this algorithm there is Hashing used. For search of multiple patterns Rabin-Karp algorithm is used in a large scale.
In Rabin-Karp string matching algorithm there is calculate the hash value of the pattern. After this text subsequence is compared for each M-character. The algorithm will considerate the hash value of next M-character sequence, if the hash values are not equivalent. And if the hash values are equivalent, the algorithm will compare the pattern and the M-character sequence. In this way, when the hash values match only then character matching is needed and there is require only one comparison per text subsequence.
Rabin Karp Algorithm
B. For shifting substring search use of Hashing
By using hash function there is stimulate testing of similarity of the pattern to the part of string. In Rabin-Karp algorithm there is step up the testing of pattern similarity to the substrings in the text by using a hash function. By using hash function every string is converted into a numeric value, that value is called hash value. For example, if there is a hash ("good") =4. The algorithm accomplishes the fact that if are two strings that are equal then their hash values will be equal. Thus, to computing the hash value of search pattern string matching is reduced.
1 function RK (string s[1'm], string pattern[1..n])
2 Hpattern = hash (pattern [1...n]); hs = hash(s[1'n])
3 for j =1 to m-n+1
4 if hs = Hpattern
5 if s[j'j+n-1] = pattern [1...n]
6 return i
7 hs = hash(s[j+1...j+n])
8 return not found
Figure 5. Algorithm for Hashing
There are two problems with Hashing. First problem is that there are so many different strings and there are so few hash values, there is possibility that some different strings have the hash value that will be same. If the pattern and the part of string match then there is not sure that the hash values match or not. There is comparison can be done in search pattern and the substring. In comparison there is take a factor that is time if the substring is large then comparison can take large amount of time and if the substring is small in size then comparison can take less amount of time. There are many collisions in a good hash function on suitable strings for this expected search so the time will be adequate.
C. Generation of hash function
The performance of Rabin'Karp algorithm is depending on the adequate calculation of hash values of the consecutive part of strings of the text. There is a adequate rolling hash function that is known as Rabin fingerprint. The Rabin fingerprint delights each substring as a number in some base. Base is the base being commonly a large prime number. For example, if we take a substring that is "mn" and the base is 102, the hash value would be 109 '' 1021 + 110 '' 1020 = 11228 (ASCII value of 'm' is 109 and ASCII value of 'n' is 110).
In a non-decimal representation hash function is same as the true number. We can also have a 'base' that is less than one of 'digits'. By using rolling hash function there are many benefits like we can also calculate the hash value of the later substring from the preceding substring, this can be compassed by perform a consistent number of operations that are not dependent on length of the substrings.
For example, if we take text that is "acrbcadabra" and we are searching for a pattern of length 3, then the hash of the first substring, "acr" and using base 102 is:
ASCII values of a, c and r
a = 97, c = 99, r = 114
Hash ("abr") = (97 '' 1022) + (99 '' 1021) + (114 '' 1020) = 1019400
Now we can calculate the hash value of the next substring, that is "crb", from the hash value of "acr" by subtracting the number added for the first 'a' of "acr", i.e. 97 '' 1022, multiply it by the base and add for the last a of "crb", it means 99 '' 1020. For example:
base old hash old 'a' new 'a'
Hash ("cra") = [102 '' (1019400- (99 '' 1022))] + (99 '' 1020) =
If the substrings are lengthy then this algorithm gets big accumulation with compare to other hashing algorithms.
III. RELATED WORK
There are several methods for finding the similarity in a sequence. Some of these search methods that allow no alignment [1, 2, 6] and others allow for alignment such as insertions or deletions there are trying to find the best possible alignment .
In Sequence similarity search method there allows insertions and deletions was published in where a computer program for finding similarities in the amino acid sequences of two proteins was developed. .
Some similarity algorithms depend on the longest common subsequence (LCS) idea that is commonly used in computer science to find the similarity between different sequences. In  the authors introduced new variants of LCS problem and presented efficient algorithms to solve them. They showed the ability of their algorithms to solve several molecular biology problems.
Furthermore, a parallel version of the LCS algorithm that finds the alignment between DNA and protein sequences was built in BLAST that is large size tool. The algorithm was tested and showed an increase in the performance of about 24-30% than the serial LCS.
LCS is a building block for algorithm that searches for specific motifs in a DNA database. Then the algorithm was generalized to solve the common sub-sequence problem from the computational aspect. Although the complexity of the algorithm is exponential in general but it is polynomial when the threshold value (t) and the length of the largest common subsequence (c) are sufficiently close.
IV. GOALS AND APPROACHES
We can calculate DNA sequences similarity based on:
1. Number of string matches to the total size of the DNA sequences.
2. In the maximum string match the number of characters between the two DNA sequences.
3. LCS and R-K are two popular metrics to measure the level of similarity between two DNA sequences. We will evaluate different implementations for the algorithms LCS and R-K based on performance and accuracy.
4. It is noticed while surveying related research papers and articles that there are some conflicting results in calculating LCS and R-K. In this experiment, we tried to define the different approaches used to develop those algorithms in order to compare their results in terms of accuracy and performance.
5. In the comparison of these two algorithms (LCS and Rabin-Karp) Rabin-Karp algorithm is better than LCS because LCS is time consuming algorithm than Rabin-Karp algorithm.
6. The performance and accuracy of Rabin-Karp algorithm is better than LCS because in Rabin-Karp algorithm there is calculate hash value.
All those algorithms are implemented in C#, C and Java programming languages. Some of those codes are taken from research resources, while we developed other algorithms based on either algorithmic description or pseudo codes described in the literature.
In this paper, we evaluated the code implementation of two widely popular DNA sequence comparison algorithms: Longest common substring (LCS) and Rabin Karp (R-K). Rabin-Karp algorithm is better than LCS algorithm. A survey of those widely used algorithms in bioinformatics and DNA sequence comparison showed that they have different implementations. In addition, if evaluating the same DNA sequences on different tools may show different results. While some of the differences are shown to be expected and are part of the different default considerations or interpretations of those algorithms, other results showed that implementations for the same algorithm are somewhat different and inconsistent. Using new programming data structures and algorithms showed significant improvement in terms of the efficiency in finding the solution. Further, reduction algorithms and techniques should be used to reduce the calculation speed.
 G. Huang, H. Zhou, Y. Li and L. Xu, 'Alignment-free comparison of genome sequences by a new numerical characterization', Journal of Theoretical Biology, vol. 281, no. 1, (2011), pp. 107-112.
 C. Yu, S.-Y. Cheng, R. L. He and S. S. -T. Yau, 'Protein map: An alignment-free sequence comparison method based on various properties of amino acids', Gene, vol. 486, (2011), pp. 110-118.
 Double Act Tool Home Page: http://www.hpa-bioinfotools.org .uk/pise/double_act.html
 Genomatix Tool Home Page: http://www.genomatix.de
 Y. Guo and T. -m. Wang, 'A new method to analyze the similarity of the DNA sequences', Journal of Molecular Structure: THEOCHEM, vol. 853, (2008), pp. 62'67.
 BLAST, http://blast.ncbi.nlm.nih.gov/Blast.cgi, (2011) September.
 C. S. Iliopoulos and M. S. Rahman, 'Algorithms for Computing Variants of the Longest Common Subsequence Problem', Theoretical Computer Science archive Journal, vol. 395, no. 2-3, (2008), pp. 255-267.
 Izzat Alsmadi,Maryam Nuser, 'String Matching Evaluation Methods for DNA Comparison', International Journal of Advanced Science and Technology Vol. 47, October, 2012
 Inbamalar T M and Sivakumar R, 'Study of DNA Sequence Analysis Using DSP Technique', Journal of Automation and Control Engineering Vol. 1, No. 4, December 2013
...(download the rest of the essay above)