Sub-linear algorithms for shortest unique substring and maximal unique matches
Senanayaka, Sanjeewa Bandara
University of Wisconsin--Whitewater
MetadataShow full item record
Living in an era where we produce quintillions of data on a daily basis means that there are many intriguing (possibly unrevealed) facts behind them. Some of the data that exists in large quantities is encountered in Computational Biology; our DNA strands are extremely long ( 3.2 billion characters). Genomic problems such as comparing similarities between two strands of DNA or finding the uniqueness between two similar strands of DNA can hugely benefit in the understanding of evolution and possibly create medical advances. A couple of avenues of addressing a problem of similar flavor (known as gene alignment) are using Shortest Unique Substrings and Maximal Unique Matches. Existing techniques for these problems either necessitate the use of high performance computers, or using techniques that have large memory footprints, or use of (theoretical) techniques that are complicated to construct. Prior to our work, Ganguly et al. [TCS’ 17] showed that these problems can be solved in O(n2 log n ) time and O(n=) space for any parameter = !(1) and = o(n). However, these techniques require the use of complex data structures and techniques, which makes them practically intractable. Presented here are sub-linear algorithms for the Shortest Unique Substring problem and the Maximal Unique Matches problem. Besides using minimal amounts of work space and being (reasonably) fast, our algorithms are a combination of two simple techniques – Rabin-Karp fingerprint and Bloom Filters; this makes our work easily amenable to implementations. However, our algorithms suffer from a bounded error probability, which can be reduced using (slightly) more space.
Data structures (Computer science)