Longest Common Subsequence


The Longest Common Subsequence (LCS) problem is one where you're trying to find the longest sequence in common between two sequences. In the basic form of the problem, the sequence doesn't have to be contiguous. For example, the LCS of the character sequences (strings) "pine" and "springtime" is "pine" even though the string "pine" doesn't appear contiguously within the word "springtime".

Brute Force Approach

A brute force algorithm for finding the LCS of two sequences X and Y involves generating each subsequence in X and checking if it is a subsequence of Y. This clearly would take exponential time in the length of X.

Dynamic Programming Approach

Another approach involves Dynamic Programming. You start by characterizing the structure of the optimal solution. You might discover that the LCS of two sequences contains as a prefix an LCS of prefixes of the sequences. Here's how you can formulate that idea recursively:

Let X and Y be the two strings we want to find the length of the LCS of, and Xi be the substring <X0,X1,...,Xi> and Yj be the substring <Y0,Y1,...,Yj>. Xi and Yj are called prefixes of the strings X and Y. Further, let c[i,j] be the length of the optimal LCS between the strings Xi and Yj. Then we can have the recursive relation:


c[i,j] = 
\begin{cases}
0 & \text{if $i=0$ or $j=0$}\\
c[i-1][j-1]+1 & \text{if $i,j>0$ and $x_i = y_i$}\\
max(c[i,j-1],c[i-1,j]) & \text{if $i,j>0$ and $x_i \neq y_j$}
\end{cases}

Doing this recursion the naive way in C++ isn't hard but that can easily take exponential time. Instead, we can use dynamic programming because there are only m*n distinct subproblems, and compute the values in a bottom up fashion. In other words, you set up a table pre-loaded with the trivial solution (X has length 0, Y has length 0). Each entry can be calculated depending only on the neighbors on its top, left, and top-left, as shown in the recurrence relation. For example, (2,3) depends on (1,3) (2,2) and (1,2). Fill in the table, and you get the length of the longest common substring in c[m, n].

As you might notice in the recursive relation, case 2 (where we use the table entry c[i-1, j-1]) represents every time we've extended the LCS by one character. As you fill in the table, you may want to mark which table entry was used ("backpointers" is the term used in many textbooks for these markings). After finding the length of the LCS, you can use these backpointers to trace backwards from c[m][n] and print out the LCS.
Computer Science @ George Mason University