Chapter 6: Strings in Algorithms

Strings are a fundamental data type in computer science, representing sequences of characters. The study of algorithms and data structures for processing strings is a crucial part of any computer science curriculum. In this chapter, we explore several important string-processing algorithms, including string sorts, tries, substring search, regular expressions, and data compression. These algorithms have numerous practical applications, from text editing and document retrieval to bioinformatics and natural language processing.

String Sorts

Sorting is a fundamental operation in computer science, and sorting strings is a common task in many applications. While general-purpose sorting algorithms like quicksort and mergesort can be used to sort strings, there are more efficient algorithms that take advantage of the special properties of strings.

Key-Indexed Counting

Key-indexed counting is a simple and efficient algorithm for sorting strings that are composed of a small set of distinct characters. The basic idea is to count the number of occurrences of each character and then use these counts to determine the position of each string in the sorted order.

Here's an example of key-indexed counting in action:

Input:  ["dab", "cab", "fad", "bad", "dad", "ebb", "ace"]
Output: ["ace", "bad", "cab", "dab", "dad", "ebb", "fad"]

The algorithm works as follows:

Count the frequency of each character in the strings.
Compute the starting index for each character in the sorted array.
Copy the strings to the sorted array, using the counts to determine the positions.

Key-indexed counting runs in O(n + R) time, where n is the total number of characters in the strings and R is the size of the alphabet (the number of distinct characters). This makes it a very efficient algorithm for sorting strings with a small alphabet size.

LSD Radix Sort

Least-significant-digit-first (LSD) radix sort is an extension of key-indexed counting that can sort strings of equal length over a large alphabet. The basic idea is to sort the strings character by character, starting from the last character and moving towards the first.

Here's an example of LSD radix sort:

Input:  ["4PGC938", "2IYE230", "3CIO720", "1ICK750", "1OHV845", "4JZY524", "1ICK750", "3CIO720", "1OHV845", "1OHV845", "2RLA629", "2RLA629", "3ATW723"]
Output: ["1ICK750", "1ICK750", "1OHV845", "1OHV845", "1OHV845", "2IYE230", "2RLA629", "2RLA629", "3ATW723", "3CIO720", "3CIO720", "4JZY524", "4PGC938"]

The algorithm works as follows:

Sort the strings by the last character using key-indexed counting.
Repeat step 1 for each character moving towards the first.

LSD radix sort runs in O(w * n) time, where w is the length of the strings and n is the number of strings. This makes it an efficient choice for sorting fixed-length strings.

MSD Radix Sort

Most-significant-digit-first (MSD) radix sort is a recursive variant of radix sort that can handle strings of different lengths. Like quicksort, MSD radix sort partitions the array into subarrays that can be sorted independently.

Here's an example of MSD radix sort:

Input:  ["she", "sells", "seashells", "by", "the", "sea", "shore", "the", "shells", "she", "sells", "are", "surely", "seashells"]
Output: ["are", "by", "sea", "seashells", "seashells", "sells", "sells", "she", "she", "shells", "shore", "surely", "the", "the"]

The algorithm works as follows:

Partition the array into subarrays based on the first character of each string.
Recursively sort each subarray.

MSD radix sort has a worst-case running time of O(n * w), but in practice, it is often faster than LSD radix sort for strings of varying lengths.

Tries

A trie (pronounced "try") is a tree-like data structure for storing strings that enables efficient prefix matching and string searches. Each node in a trie represents a character, and the path from the root to a node represents a prefix of the strings stored in the trie.

Here's an example of a trie storing the strings "sea", "sells", "she", "shells", and "shore":

     (root)
    /  |  \
   s   h   o
  / \     / \
 e   h   r   t
 |   |   |   |
 a   e   e   h
     |       |
     l       e
     |       |
     l       r
     |
     s

Tries support the following operations:

Insert a string into the trie.
Search for a string in the trie.
Delete a string from the trie.
Find all strings with a given prefix.

The time complexity of these operations is O(w), where w is the length of the string being inserted, searched for, or deleted. This makes tries a very efficient data structure for string-related tasks.

Substring Search

Substring search is the problem of finding all occurrences of a pattern string within a larger text string. This is a fundamental problem in string processing, with applications in text editing, bioinformatics, and many other domains.

Brute-Force Search

The simplest approach to substring search is the brute-force algorithm, which checks each possible position in the text for a match with the pattern.

Here's an example of brute-force search:

Text:    "abacadabrabracabracadabrabrabracad"
Pattern: "abracadabra"
Output:  [13]

The brute-force algorithm has a worst-case running time of O(n * m), where n is the length of the text and m is the length of the pattern. While simple to implement, this can be inefficient for large texts and patterns.

Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt (KMP) algorithm is an efficient substring search algorithm that uses a precomputed "failure function" to avoid unnecessary comparisons.

The failure function tells us the length of the longest proper prefix of the pattern that is also a suffix of the substring of the pattern that we have matched so far. This allows us to "jump ahead" in the text when we find a mismatch, rather than backtracking.

Here's an example of the KMP algorithm:

Text:    "abacadabrabracabracadabrabrabracad"
Pattern: "abracadabra"
Output:  [13]

The KMP algorithm has a running time of O(n + m), making it much more efficient than the brute-force approach for large texts and patterns.

Boyer-Moore Algorithm

The Boyer-Moore algorithm is another efficient substring search algorithm that works by preprocessing the pattern string. Unlike KMP, which starts comparisons from the beginning of the pattern, Boyer-Moore starts from the end and works backwards.

The key idea behind Boyer-Moore is to use two precomputed functions: the "good suffix" function and the "bad character" function. These functions allow us to skip ahead in the text when we find a mismatch, similar to KMP.

Here's an example of the Boyer-Moore algorithm:

Text:    "abacadabrabracabracadabrabrabracad"
Pattern: "abracadabra"
Output:  [13]

Boyer-Moore has a best-case running time of O(n/m) and a worst-case running time of O(n * m), but in practice, it is often the fastest substring search algorithm for large alphabets.

Regular Expressions

Regular expressions are a powerful tool for describing patterns in strings. They provide a concise and flexible notation for matching, searching, and manipulating text.

A regular expression is a sequence of characters that define a search pattern. The simplest form of a regular expression is a literal string, such as "hello", which matches exactly the string "hello". However, regular expressions also include special characters and constructs that allow for more complex patterns:

. (dot) matches any single character.
* (asterisk) matches zero or more occurrences of the preceding character or group.
+ (plus) matches one or more occurrences of the preceding character or group.
? (question mark) matches zero or one occurrence of the preceding character or group.
^ (caret) matches the start of a line.
$ (dollar) matches the end of a line.
[] (square brackets) define a character class, matching any single character within the brackets.

Here are some examples of regular expressions and the patterns they match:

a.b matches any three-character string that starts with "a" and ends with "b", such as "acb", "a5b", or "a b".
a*b matches any string that consists of zero or more "a" characters followed by a single "b", such as "b", "ab", "aab", or "aaab".
a+b matches any string that consists of one or more "a" characters followed by a single "b", such as "ab", "aab", or "aaab", but not "b".
a?b matches either "ab" or "b".
^a matches any string that starts with "a".
a$ matches any string that ends with "a".
[aeiou] matches any single vowel character.

Regular expressions are supported by many programming languages and tools, including Java, Python, and Unix utilities like grep and sed. They are widely used for tasks such as data validation, text processing, and log analysis.

Data Compression

Data compression is the process of encoding information using fewer bits than the original representation. This is useful for reducing storage requirements and transmission times. There are two main types of compression: lossless and lossy. Lossless compression allows the original data to be perfectly reconstructed from the compressed data, while lossy compression allows for some loss of information in exchange for greater compression ratios.

Run-Length Encoding

Run-length encoding (RLE) is a simple lossless compression technique that replaces repeated sequences of identical symbols (runs) with a single instance of the symbol and a count of the number of times it is repeated.

Here's an example of RLE:

Input:  "AAAABBBCCCDDDEEE"
Output: "4A3B3C3D3E"

RLE is effective for data with many long runs, such as simple graphic images. However, it can actually increase the size of the data if there are few or no runs.

Huffman Coding

Huffman coding is a lossless compression algorithm that assigns variable-length codes to symbols based on their frequencies in the input data. More frequent symbols are assigned shorter codes, while less frequent symbols are assigned longer codes.

The algorithm works by building a binary tree, called a Huffman tree, from the bottom up. Each leaf node represents a symbol and its frequency, while each internal node represents the sum of the frequencies of its children. The tree is constructed by repeatedly merging the two nodes with the lowest frequencies until only one node remains.

Once the tree is built, the code for each symbol is determined by the path from the root to the corresponding leaf node, with a left branch representing a "0" and a right branch representing a "1".

Here's an example of Huffman coding:

Input:  "AAAABBBCCCDDDEEE"
Output: "000010011001011100101"

Huffman tree:
      (15)
     /    \
   (7)    (8)
   / \    / \
 (4) (3) (3) (5)
 /\  /\  /\  /\
A B  C D E

In this example, "A" is assigned the code "00", "B" is assigned "01", "C" is assigned "10", "D" is assigned "110", and "E" is assigned "111". The compressed output is obtained by replacing each symbol in the input with its corresponding code.

Huffman coding is an optimal prefix code, meaning that no code is a prefix of any other code. This allows for unambiguous decoding of the compressed data. Huffman coding is widely used in various file formats and compression tools, such as JPEG, MP3, and ZIP.

LZW Compression

Lempel-Ziv-Welch (LZW) compression is a dictionary-based compression algorithm that builds a dictionary (or codebook) of strings while compressing the input. LZW is widely used in file compression utilities and was used in the GIF image format.

The key idea behind LZW is to replace strings of characters with single codes. It reads the input string character by character and encodes the string into a compact representation by replacing each fixed-length code with a variable-length code. The longer the string, the more space is saved by encoding it as a single number.

Here's a step-by-step description of how LZW compression works:

Initialize the dictionary to contain all single-character strings.
Find the longest string W in the dictionary that matches the current input.
Emit the dictionary index for W to output and remove W from the input.
Add W followed by the next symbol in the input to the dictionary.
Go to Step 2.

Let's consider an example. Suppose we want to compress the string "ABABABABA" using LZW.

Initialize the dictionary to contain "A" and "B".
The longest match is "A". Emit its index (0) and remove it from the input. The dictionary now contains "A", "B", and "AB".
The longest match is "B". Emit its index (1) and remove it from the input. The dictionary now contains "A", "B", "AB", and "BA".
The longest match is "AB". Emit its index (2) and remove it from the input. The dictionary now contains "A", "B", "AB", "BA", and "ABA".
The longest match is "ABA". Emit its index (4) and remove it from the input. The dictionary now contains "A", "B", "AB", "BA", "ABA", and "ABAB".
The longest match is "BA". Emit its index (3). The input is now empty.

The compressed representation of "ABABABABA" is thus the sequence of indices[1], which requires fewer bits to represent than the original ASCII representation.

Decompression works similarly, but in reverse:

Initialize the dictionary to contain all single-character strings.
Read a code X from the input.
Output the string for X from the dictionary.
If the previous code exists, add the previous string concatenated with the first character of the string for X to the dictionary.
Go to Step 2.

LZW compression is simple and fast, making it a good choice for many applications. However, it does have some limitations. The size of the dictionary can grow quite large, consuming a significant amount of memory. Additionally, the dictionary resets after each block of input, which can reduce the compression ratio for small files.

Despite these limitations, LZW remains a popular and effective compression algorithm, particularly for applications where speed is more important than achieving the highest possible compression ratios.

Conclusion

In this chapter, we explored several important string-processing algorithms, including string sorts, tries, substring search, regular expressions, and data compression. These algorithms form the basis for many real-world applications and are essential tools for any programmer working with textual data.

We began by discussing string sorts, which are optimized sorting algorithms that take advantage of the special properties of strings. Key-indexed counting, LSD radix sort, and MSD radix sort provide efficient methods for sorting strings based on their individual characters.

Next, we examined tries, a tree-like data structure for storing and retrieving strings. Tries enable fast prefix matching and are commonly used in applications such as autocomplete and IP routing tables.

Substring search algorithms, such as the Knuth-Morris-Pratt and Boyer-Moore algorithms, allow us to efficiently search for patterns within larger strings. These algorithms have numerous applications in text editing, computational biology, and information retrieval.

Regular expressions provide a powerful and flexible way to describe string patterns. We discussed the basic syntax of regular expressions and how they can be used for pattern matching and string manipulation in various programming languages and tools.

Finally, we explored data compression algorithms, which reduce the size of data by exploiting redundancy and patterns within the input. We covered run-length encoding, Huffman coding, and Lempel-Ziv-Welch compression, each of which has its own strengths and applications.

Understanding these string-processing algorithms and data structures is crucial for anyone working with textual data. As the amount of unstructured data continues to grow, the ability to efficiently manipulate, search, and compress strings will only become more valuable. By mastering the techniques covered in this chapter, you will be well-equipped to tackle a wide range of string-processing challenges in your own projects and applications.

Chapter 5 Graphs Chapter 7 Context