Uses of suffix trees
1) Finding a
string that occurs more than ‘m’ times in the text
A
node (with contents from root to that node as ‘c’) which
has ‘k’ leaf level children indicates that ‘c’
occurs ‘k’ times in the text. For example, in the suffix
tree above, ‘e’ has 3 leaf-level-children, so it must
occur 3 times in ‘peeper’.
Similarly, ‘pe’ has 2 leaf level children, so it occurs 2
times in ‘peeper’.
Hence
to find a string that occurs more than ‘m’ times in a
text, we find all nodes in the trie which have more than ‘m’
leaf level children. For all those nodes, strings from root to those
nodes occur more than ‘m’ times.
2) To find all
the occurrences of a pattern
We
search the pattern normally as we search a given string. If the
search terminates at a leaf, then the pattern occurs exactly once,
else all its occurrences can be found in the paths leading from that
node to all its leaves. To further speed this up, one can link all
the leaves of the trie into a linked list and keep first-leaf and
last-leaf in every node. To find all occurrences then, we can
directly traverse from first-leaf to last-leaf.
3) Common
substring among N substrings
To
find common substring (not subsequence*) among N strings, following
can be done:
1)
Append a unique end-of-string marker to each input string like $1,
$2, $3 etc.
2)
Form a suffix tree by getting suffixes from all the above strings.
3)
In this suffix tree, find the node which has the longest depth from
the root and its leaf nodes have at least one suffix node in each of
the input strings ($1, $2, $3... end-of-string markers help here to
determine that the chosen node’s leaves are present in all the
strings).
*
Elements in a subsequence need not be contiguous like a substring
4) Longest
palindrome in a string
Palindrome
is a string which reads the same from front to end as it reads from
end to front. Example: abcdcba
To
find the longest palindrome in string S1, following can be done:
1)
Reverse the string S1 to get another string S2
Find
the longest repeated substring in S1 and S2 by making a common
suffix tree for both the strings.
A
naïve solution for this could be to check every point in the
string and note the length of the maximum palindrome around it.
Order for this solution would be O(n2)
5) Longest
Repeated Substring (LRS)
This
is a variation of #1 above. We need to find the deepest node which
has more than 1 children. The characters from root to that deepest
node form the longest repeated substring and the leaves form those
suffixes where that LRS is present.
A
naive LRS finding algorithm is shown below:
(
Note the
order of searching LRS is O(n)3
)
|