Huwebes, Hunyo 28, 2012

An Introduction to Hashing

Hashing a convenient technique that is used to search for a key in an almost constant time. Hashing basically places the keys of a record to an array H[1...m-1] using a hash function. A hash function must be chosen well because it can affect the performance of the algorithm. A good hash function is something that distributes the keys almost evenly on the table, yet it is easy to compute since excessive computation can slow down the algorithm. The modulo operation is often used here.

    One of the problems of hashing is collisions. Collisions happen when two or more keys are hashed on the same cell. Collisions are inevitable when your m<n where m is the size of the table and n is the number of keys. This follows from the pigeonhole principle. Two ways of preventing collisions are the open hashing and the closed hashing.

    Open hashing makes linked list whenever a collision occur. Basically, especially when the hash function spreads the keys evenly, the hash table in open hashing are a collection of linked list, but this time each set has smaller numbers as opposed to the original collection if one is to make them into a linked list.

    Closed hashing doesn’t use linked lists, instead the keys are stored in the table itself, thus it is necessary for m to be at least the same size of n. The simplest form of closed hashing is the linear probing. The idea is if a collision occurs, the computer would check for the next cell, if it is free then the key is written there other wise it checks the next one and so on. The array in here is a circular one.

    A problem that may occur during linear probing is clustering. Hash tables that are nearly full are the most commonly vulnerable to this. A cluster is a sequence of contiguously occupied cells. This results in the computer spending more time finding for a free cell. One of the solution for this is double hashing, another form of closed hashing, where one use another hash function s(K) to determine a fixed increment for the probing sequence to be used after a collision at location l = h(K).

References
Levitin, Anany. The Design and Analysis of Algorithms. 2nd ed.

Suffix Trees and Pattern Matching: An Introduction

    With the advent of the Human Genome Project, the need for efficient and fast algorithms for pattern matching has become more apparent. The problem of pattern matching is a simple one, given a text t, one must find a string p called the pattern. Applying this to the genes, given a long string of DNA sample, one is to find if a sequence of nucleotides (that perhaps codes for a trait) exists in that sample. Fairly easy, but as one can see, brute force algorithms could take as long as O(nm) to search for a match. Of course, one has to factor the length of the DNA sample and of the nucleotide sequence. Oftentimes, the DNA sample could contain thousands, if not millions, of nucleotides. This could definitely mean that matching patterns via brute force is not something that could be done quickly.

    Pattern matching are of two types. Exact pattern matching searches for the exact pattern. So if the string that is being searched is AGTAAGTC, then that’s the string that it would search, no variations would result into a positive. The other type of pattern matching is the approximate one. As the name implies, it searches for an approximate match of a sequence.

    One type of data structure that is effective for exact matching is the suffix tree. It is basically a tree of suffixes of the whole text. So for example the word is BANANA, then its suffixes are {BANANA,ANANA,NANA,ANA,NA,A,$}. Suffix trees are basically a keyword tree with nonbranching nodes collapsed. It is a good algorithm because it can match pattern in linear time. It applies the concept of threading. The string that one desires to match is fed into the tree, if the threading is complete, that is it ended on a edge or a vertex after it traversed the tree, then a match is found, otherwise, no match.

    One of the problem of suffix trees is that its construction could be in quadratic time. This happens when you do the keyword style of construction. One good thing is that there are algorithms that exists that can solve this problem in linear time. One such algorithm was designed by Weiner.

    In general, suffix trees are extremely efficient if there are several patterns that is being matched on one long text. It has a linear search time, faster than other known algorithms. The only down side to it is that it searches for a string exactly, so when a nucleotide from the text was mutated, it is possible that the suffix tree would fail to find a match.

References
Jones, Neil C. and Pevzner, Pavel A. An Introduction to Bioinformatics Algorithms.

Levitin, Anany. The Design and Analysis of Algorithms. 2nd ed.