Longest Common Subsequence (LCS)
The Longest Common Subsequence (LCS) is one of the most fundamental problems in dynamic programming, with applications ranging from DNA sequence analysis to version control systems. Understanding LCS deeply provides the foundation for solving many string-based optimization problems.
Understanding Longest Common Subsequence
Subsequence vs Substring: A Critical Distinction
Before diving into the problem, let's understand the fundamental difference:
Substring: Characters must be contiguous (adjacent in the original string)
- Example: In "ABCDE", "ABC", "BCD", "CDE" are substrings
- "ACE" is NOT a substring (characters not adjacent)
Subsequence: Characters maintain relative order but need not be contiguous
- Example: In "ABCDE", "ACE", "ABE", "BCE" are all valid subsequences
- Order matters: "CEA" is NOT a subsequence (C comes before E before A)
The flexibility of subsequences (not requiring contiguity) makes LCS more powerful than substring matching. This is why diff algorithms use LCS - they can find similarities even when content is rearranged.
The LCS Problem Definition
Given: Two sequences X = x₁x₂...xₘ and Y = y₁y₂...yₙ
Find: The longest sequence Z that is a subsequence of both X and Y
Key Insight: If we know the LCS of all prefixes, we can build the LCS of the full strings.
Building Intuition: Small Examples
Let's start with simple cases to understand the pattern:
Example 1: No Common Subsequence
X = "ABC"
Y = "DEF"
LCS = "" (empty string, length = 0)
Example 2: One Character Match
X = "ABC"
Y = "DAE"
LCS = "A" (length = 1)
Example 3: Multiple Matches, Different Orders
X = "ABCD"
Y = "ACBD"
LCS = "ABD" or "ACD" (both length = 3)
Notice: Both maintain relative order from original strings
Example 4: Identical Strings
X = "HELLO"
Y = "HELLO"
LCS = "HELLO" (length = 5)
The Recurrence Relation: Understanding the "Why"
The LCS problem has a beautiful recursive structure:
LCS(X[0..i], Y[0..j]) =
if i = 0 or j = 0:
0 (base case: empty string has no common subsequence)
else if X[i] = Y[j]:
1 + LCS(X[0..i-1], Y[0..j-1]) (characters match: include and move both pointers)
else:
max(LCS(X[0..i-1], Y[0..j]), (skip character from X)
LCS(X[0..i], Y[0..j-1])) (skip character from Y)
Why do we take the maximum when characters don't match?
This is the crucial insight: When X[i] ≠ Y[j], we have two possibilities:
- The LCS doesn't include X[i] → Look at LCS(X[0..i-1], Y[0..j])
- The LCS doesn't include Y[j] → Look at LCS(X[0..i], Y[0..j-1])
We don't know which character to skip, so we try both and take the maximum!
When characters don't match, at least one of them is NOT in the optimal LCS. We explore both possibilities and choose the better one. This is the "optimal substructure" property of DP.
DP Table Structure
We build a 2D table dp[m+1][n+1] where:
dp[i][j]= length of LCS of X[0..i-1] and Y[0..j-1]- First row and column are 0 (base case: empty string)
- We fill the table row by row, left to right
Visual Structure:
"" Y[0] Y[1] Y[2] ...
"" 0 0 0 0 ...
X[0] 0 ? ? ? ...
X[1] 0 ? ? ? ...
X[2] 0 ? ? ? ...
. . . . .
. . . . .
Complete Worked Example with Traceback
Let's work through a detailed example that demonstrates every aspect of the algorithm.
Example: DNA Sequence Comparison
Input Sequences:
X = "AGGTAB" (length = 6)
Y = "GXTXAYB" (length = 7)
Goal: Find the longest common subsequence
Step 1: Initialize the DP Table
First, create the table with base cases (first row and column = 0):
"" G X T X A Y B
"" 0 0 0 0 0 0 0 0
A 0
G 0
G 0
T 0
A 0
B 0
Step 2: Fill the DP Table
Let's fill this systematically, showing the reasoning for each cell.
For X[0]='A', Y[0]='G': A ≠ G
dp[1][1] = max(dp[0][1], dp[1][0]) = max(0, 0) = 0
For X[0]='A', Y[1]='X': A ≠ X
dp[1][2] = max(dp[0][2], dp[1][1]) = max(0, 0) = 0
For X[0]='A', Y[2]='T': A ≠ T
dp[1][3] = max(dp[0][3], dp[1][2]) = max(0, 0) = 0
For X[0]='A', Y[3]='X': A ≠ X
dp[1][4] = max(dp[0][4], dp[1][3]) = max(0, 0) = 0
For X[0]='A', Y[4]='A': A = A ✓ (MATCH!)
dp[1][5] = 1 + dp[0][4] = 1 + 0 = 1
For X[0]='A', Y[5]='Y': A ≠ Y
dp[1][6] = max(dp[0][6], dp[1][5]) = max(0, 1) = 1
For X[0]='A', Y[6]='B': A ≠ B
dp[1][7] = max(dp[0][7], dp[1][6]) = max(0, 1) = 1
After filling all rows, complete DP table:
"" G X T X A Y B
"" 0 0 0 0 0 0 0 0
A 0 0 0 0 0 1 1 1
G 0 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1
T 0 1 1 2 2 2 2 2
A 0 1 1 2 2 3 3 3
B 0 1 1 2 2 3 3 4
Each cell dp[i][j] tells us: "The longest common subsequence between X[0..i-1] and Y[0..j-1] has this length."
For example, dp[4][3] = 2 means LCS of "AGGT" and "GXT" has length 2.
Step 3: Traceback to Construct the LCS
The length is dp[6][7] = 4, but what is the actual subsequence?
Traceback Algorithm:
Start at dp[m][n] and work backwards:
- If X[i-1] = Y[j-1]: This character is in LCS → Add to result, move diagonally (i--, j--)
- Else if dp[i-1][j] > dp[i][j-1]: Move up (i--)
- Else: Move left (j--)
Detailed Traceback with Arrows:
"" G X T X A Y B
"" 0 0 0 0 0 0 0 0
A 0 0 0 0 0 1← 1← 1
↑
G 0 1 1→ 1→ 1→ 1 1← 1
↖
G 0 1 1 1 1 1 1← 1
↑
T 0 1 1 2 2→ 2→ 2← 2
↖ ↑
A 0 1 1 2 2 3 3→ 3
↖ ↑
B 0 1 1 2 2 3 3 4*
↖
Step-by-step traceback:
-
Start at dp[6][7] = 4, compare X[5]='B' and Y[6]='B'
- MATCH! ✓ LCS includes 'B'
- Move diagonally to dp[5][6] = 3
- Current LCS: "B"
-
At dp[5][6] = 3, compare X[4]='A' and Y[5]='Y'
- No match. dp[4][6] = 3 = dp[5][5] = 3 (equal, choose either)
- Move left to dp[5][5] = 3
-
At dp[5][5] = 3, compare X[4]='A' and Y[4]='A'
- MATCH! ✓ LCS includes 'A'
- Move diagonally to dp[4][4] = 2
- Current LCS: "AB"
-
At dp[4][4] = 2, compare X[3]='T' and Y[3]='X'
- No match. dp[3][4] = 1 < dp[4][3] = 2
- Move left to dp[4][3] = 2
-
At dp[4][3] = 2, compare X[3]='T' and Y[2]='T'
- MATCH! ✓ LCS includes 'T'
- Move diagonally to dp[3][2] = 1
- Current LCS: "TAB"
-
At dp[3][2] = 1, compare X[2]='G' and Y[1]='X'
- No match. dp[2][2] = 1 = dp[3][1] = 1
- Move up to dp[2][2] = 1
-
At dp[2][2] = 1, compare X[1]='G' and Y[1]='X'
- No match. dp[1][2] = 0 < dp[2][1] = 1
- Move up to dp[2][1] = 1
-
At dp[2][1] = 1, compare X[1]='G' and Y[0]='G'
- MATCH! ✓ LCS includes 'G'
- Move diagonally to dp[1][0] = 0
- Current LCS: "GTAB"
-
At dp[1][0] = 0, we've reached the boundary. Done!
Final LCS (reversed): "GTAB"
We build the LCS backwards during traceback, so remember to reverse it! The sequence "GTAB" maintains relative order in both original strings:
- X = "AGGTAB": G (pos 1 or 2), T (pos 3), A (pos 4), B (pos 5) ✓
- Y = "GXTXAYB": G (pos 0), T (pos 2), A (pos 4), B (pos 6) ✓
Verification
Let's verify "GTAB" is indeed a subsequence of both:
X = "AGGTAB"
- G appears at index 1 (or 2) ✓
- T appears at index 3 ✓
- A appears at index 4 ✓
- B appears at index 5 ✓
Y = "GXTXAYB"
- G appears at index 0 ✓
- T appears at index 2 ✓
- A appears at index 4 ✓
- B appears at index 6 ✓
Both maintain relative order, and length = 4 matches dp[6][7] = 4 ✓
LCS Applications and Problem Variations
Real-World Applications
1. Diff Algorithms and Version Control
Git and other version control systems use LCS to compute minimal differences between files.
Original: "ABCDEFG"
Modified: "ABXCDEYG"
LCS = "ABCDEG" (length 6)
Changes needed:
- Insert 'X' after B
- Insert 'Y' after E
The diff is computed as: insertions + deletions = m + n - 2*LCS
- Lines in LCS = unchanged lines (shown without markers)
- Lines in original but not LCS = deletions (shown with -)
- Lines in modified but not LCS = insertions (shown with +)
2. DNA Sequence Alignment
Biologists use LCS to find similarities between DNA sequences:
Sequence 1: "AGGTAB"
Sequence 2: "GXTXAYB"
Common DNA: "GTAB"
This reveals shared genetic material, suggesting:
- Common evolutionary origin
- Functional similarities
- Potential gene relationships
3. Plagiarism Detection
LCS helps identify copied content:
Document A: "The quick brown fox jumps"
Document B: "The brown fox jumps high"
Common: "The brown fox jumps" (high similarity = potential plagiarism)
LCS length / max(|A|, |B|) = similarity ratio
4. Speech Recognition
Comparing spoken output with expected text:
Expected: "HELLO WORLD"
Recognized: "HELO WORD"
LCS: "HELO WOR"
Error rate = (m + n - 2*LCS) / max(m, n)
Connection to Edit Distance
LCS and Edit Distance are deeply connected:
Edit Distance Formula:
edit_distance = m + n - 2 * LCS
Where:
- m = length of string X
- n = length of string Y
- LCS = length of longest common subsequence
Why this works:
Think of the relationship visually:
- Total characters in both strings: m + n
- Characters in LCS appear in both: subtract LCS twice
- Remaining characters need to be inserted or deleted
Example:
X = "AGGTAB" (m = 6)
Y = "GXTXAYB" (n = 7)
LCS = "GTAB" (length = 4)
Edit Distance = 6 + 7 - 2(4) = 13 - 8 = 5
Verification:
- Delete: A (from X)
- Delete: G (from X)
- Insert: X (after G in result)
- Insert: X (after T in result)
- Insert: Y (after A in result)
Total: 5 operations ✓
- LCS: Maximizes commonality (what stays)
- Edit Distance: Minimizes changes (what's removed/added)
- Together they tell the complete story of string similarity
Implementation
Java Implementation with Traceback
public class LongestCommonSubsequence {
/**
* Compute LCS length using dynamic programming
* Time Complexity: O(m * n)
* Space Complexity: O(m * n)
*/
public static int lcsLength(String X, String Y) {
int m = X.length();
int n = Y.length();
// Create DP table
int[][] dp = new int[m + 1][n + 1];
// Fill the DP table
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// Characters match: include in LCS
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
// Characters don't match: take maximum
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}
return dp[m][n];
}
/**
* Compute LCS and return the actual subsequence
* Time Complexity: O(m * n)
* Space Complexity: O(m * n)
*/
public static String lcs(String X, String Y) {
int m = X.length();
int n = Y.length();
// Build DP table
int[][] dp = new int[m + 1][n + 1];
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}
// Traceback to construct LCS
StringBuilder lcs = new StringBuilder();
int i = m, j = n;
while (i > 0 && j > 0) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// This character is part of LCS
lcs.append(X.charAt(i - 1));
i--;
j--;
} else if (dp[i - 1][j] > dp[i][j - 1]) {
// Move up
i--;
} else {
// Move left
j--;
}
}
// Reverse because we built it backwards
return lcs.reverse().toString();
}
/**
* Space-optimized version (only stores length, not string)
* Space Complexity: O(min(m, n))
*/
public static int lcsLengthOptimized(String X, String Y) {
// Ensure X is the shorter string to minimize space
if (X.length() > Y.length()) {
String temp = X;
X = Y;
Y = temp;
}
int m = X.length();
int n = Y.length();
// Only need two rows
int[] prev = new int[m + 1];
int[] curr = new int[m + 1];
for (int j = 1; j <= n; j++) {
for (int i = 1; i <= m; i++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
curr[i] = 1 + prev[i - 1];
} else {
curr[i] = Math.max(prev[i], curr[i - 1]);
}
}
// Swap rows
int[] temp = prev;
prev = curr;
curr = temp;
}
return prev[m];
}
/**
* Compute edit distance using LCS
*/
public static int editDistance(String X, String Y) {
int lcsLen = lcsLength(X, Y);
return X.length() + Y.length() - 2 * lcsLen;
}
// Example usage
public static void main(String[] args) {
String X = "AGGTAB";
String Y = "GXTXAYB";
System.out.println("String X: " + X);
System.out.println("String Y: " + Y);
System.out.println("LCS Length: " + lcsLength(X, Y));
System.out.println("LCS String: " + lcs(X, Y));
System.out.println("Edit Distance: " + editDistance(X, Y));
// Edge cases
System.out.println("\nEdge Cases:");
System.out.println("Empty strings: " + lcs("", "ABC"));
System.out.println("Identical strings: " + lcs("HELLO", "HELLO"));
System.out.println("No common: " + lcs("ABC", "DEF"));
}
}
Problem Variations
1. Shortest Common Supersequence (SCS)
Problem: Find the shortest string that has both X and Y as subsequences.
Solution: SCS_length = m + n - LCS_length
Example:
X = "AGGTAB"
Y = "GXTXAYB"
LCS = "GTAB" (length 4)
SCS length = 6 + 7 - 4 = 9
SCS string = "AGGXTXAYB"
Verification:
- "AGGTAB" is a subsequence of "AGGXTXAYB" ✓
- "GXTXAYB" is a subsequence of "AGGXTXAYB" ✓
Implementation:
public static String shortestCommonSupersequence(String X, String Y) {
int m = X.length();
int n = Y.length();
// Build DP table for LCS
int[][] dp = new int[m + 1][n + 1];
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}
// Build SCS using traceback
StringBuilder scs = new StringBuilder();
int i = m, j = n;
while (i > 0 && j > 0) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// Common character: add once
scs.append(X.charAt(i - 1));
i--;
j--;
} else if (dp[i - 1][j] > dp[i][j - 1]) {
// X[i-1] not in LCS: add from X
scs.append(X.charAt(i - 1));
i--;
} else {
// Y[j-1] not in LCS: add from Y
scs.append(Y.charAt(j - 1));
j--;
}
}
// Add remaining characters
while (i > 0) {
scs.append(X.charAt(i - 1));
i--;
}
while (j > 0) {
scs.append(Y.charAt(j - 1));
j--;
}
return scs.reverse().toString();
}
2. Minimum Delete Operations to Make Strings Equal
Problem: Given two strings, find minimum deletions needed to make them equal.
Solution: deletions = m + n - 2 * LCS_length
Example:
X = "sea"
Y = "eat"
LCS = "ea" (length 2)
Deletions = 3 + 3 - 2(2) = 2
- Delete 's' from X
- Delete 't' from Y
Result: Both become "ea"
Implementation:
public static int minDeleteOperations(String X, String Y) {
int lcsLen = lcsLength(X, Y);
return X.length() + Y.length() - 2 * lcsLen;
}
3. Longest Palindromic Subsequence
Problem: Find the longest palindromic subsequence in a string.
Solution: LPS(X) = LCS(X, reverse(X))
Example:
X = "BBABCBCAB"
Y = reverse(X) = "BACBCBABB"
LCS = "BABCBAB" (length 7)
This is the longest palindromic subsequence
Implementation:
public static String longestPalindromicSubsequence(String X) {
String Y = new StringBuilder(X).reverse().toString();
return lcs(X, Y);
}
4. Print All LCS
Problem: There can be multiple LCS of the same length. Print all of them.
Example:
X = "ABCD"
Y = "ACBD"
All LCS of length 3:
- "ABD"
- "ACD"
Implementation:
public static Set<String> allLCS(String X, String Y) {
int m = X.length();
int n = Y.length();
// Build DP table
int[][] dp = new int[m + 1][n + 1];
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}
// Find all LCS using backtracking
Set<String> result = new HashSet<>();
findAllLCS(X, Y, m, n, dp, "", result);
return result;
}
private static void findAllLCS(String X, String Y, int i, int j,
int[][] dp, String current, Set<String> result) {
// Base case
if (i == 0 || j == 0) {
result.add(new StringBuilder(current).reverse().toString());
return;
}
// If characters match
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
findAllLCS(X, Y, i - 1, j - 1, dp,
current + X.charAt(i - 1), result);
} else {
// Explore all directions that lead to optimal solution
if (dp[i - 1][j] == dp[i][j]) {
findAllLCS(X, Y, i - 1, j, dp, current, result);
}
if (dp[i][j - 1] == dp[i][j]) {
findAllLCS(X, Y, i, j - 1, dp, current, result);
}
}
}
Practice Problems
Apply your understanding of LCS to these problems:
-
Uncrossed Lines (LeetCode 1035): Maximum number of connecting lines without crossing
- This is exactly the LCS problem in disguise!
-
Longest Common Subpath (LeetCode 1923): Find longest common subarray in 2D arrays
- Extend LCS concept to 2D
-
Distinct Subsequences (LeetCode 115): Count distinct occurrences of subsequence
- Variation counting all occurrences, not just longest
-
Minimum ASCII Delete Sum (LeetCode 712): Minimize ASCII sum of deleted characters
- Weighted version of LCS
-
Maximum Length of Repeated Subarray (LeetCode 718): Longest common substring
- Remember: substring (contiguous) vs subsequence (non-contiguous)
- Understand the basic LCS algorithm thoroughly
- Practice reconstructing the LCS string (traceback)
- Solve edit distance using LCS
- Tackle SCS and related variations
- Apply to real-world problems (diff, DNA alignment)
Complexity Analysis
Time Complexity: O(m × n)
- We fill m × n cells in the DP table
- Each cell computation is O(1)
Space Complexity:
- Standard: O(m × n) for the DP table
- Optimized (length only): O(min(m, n)) using rolling array
- With traceback: O(m × n) needed to reconstruct string
Practical Considerations:
- For very long strings (m, n > 10,000), consider:
- Using space-optimized version if only length needed
- Memory-mapped files for extremely large sequences
- Parallel computation for multiple comparisons
- Approximate algorithms for faster results with acceptable accuracy