Skip to main content

Longest Common Subsequence (LCS)

The Longest Common Subsequence (LCS) is one of the most fundamental problems in dynamic programming, with applications ranging from DNA sequence analysis to version control systems. Understanding LCS deeply provides the foundation for solving many string-based optimization problems.

Understanding Longest Common Subsequence

Subsequence vs Substring: A Critical Distinction

Before diving into the problem, let's understand the fundamental difference:

Substring: Characters must be contiguous (adjacent in the original string)

  • Example: In "ABCDE", "ABC", "BCD", "CDE" are substrings
  • "ACE" is NOT a substring (characters not adjacent)

Subsequence: Characters maintain relative order but need not be contiguous

  • Example: In "ABCDE", "ACE", "ABE", "BCE" are all valid subsequences
  • Order matters: "CEA" is NOT a subsequence (C comes before E before A)
Why This Matters

The flexibility of subsequences (not requiring contiguity) makes LCS more powerful than substring matching. This is why diff algorithms use LCS - they can find similarities even when content is rearranged.

The LCS Problem Definition

Given: Two sequences X = x₁x₂...xₘ and Y = y₁y₂...yₙ

Find: The longest sequence Z that is a subsequence of both X and Y

Key Insight: If we know the LCS of all prefixes, we can build the LCS of the full strings.

Building Intuition: Small Examples

Let's start with simple cases to understand the pattern:

Example 1: No Common Subsequence

X = "ABC"
Y = "DEF"
LCS = "" (empty string, length = 0)

Example 2: One Character Match

X = "ABC"
Y = "DAE"
LCS = "A" (length = 1)

Example 3: Multiple Matches, Different Orders

X = "ABCD"
Y = "ACBD"
LCS = "ABD" or "ACD" (both length = 3)
Notice: Both maintain relative order from original strings

Example 4: Identical Strings

X = "HELLO"
Y = "HELLO"
LCS = "HELLO" (length = 5)

The Recurrence Relation: Understanding the "Why"

The LCS problem has a beautiful recursive structure:

LCS(X[0..i], Y[0..j]) =
if i = 0 or j = 0:
0 (base case: empty string has no common subsequence)

else if X[i] = Y[j]:
1 + LCS(X[0..i-1], Y[0..j-1]) (characters match: include and move both pointers)

else:
max(LCS(X[0..i-1], Y[0..j]), (skip character from X)
LCS(X[0..i], Y[0..j-1])) (skip character from Y)

Why do we take the maximum when characters don't match?

This is the crucial insight: When X[i] ≠ Y[j], we have two possibilities:

  1. The LCS doesn't include X[i] → Look at LCS(X[0..i-1], Y[0..j])
  2. The LCS doesn't include Y[j] → Look at LCS(X[0..i], Y[0..j-1])

We don't know which character to skip, so we try both and take the maximum!

Key Insight

When characters don't match, at least one of them is NOT in the optimal LCS. We explore both possibilities and choose the better one. This is the "optimal substructure" property of DP.

DP Table Structure

We build a 2D table dp[m+1][n+1] where:

  • dp[i][j] = length of LCS of X[0..i-1] and Y[0..j-1]
  • First row and column are 0 (base case: empty string)
  • We fill the table row by row, left to right

Visual Structure:

       ""  Y[0]  Y[1]  Y[2] ...
"" 0 0 0 0 ...
X[0] 0 ? ? ? ...
X[1] 0 ? ? ? ...
X[2] 0 ? ? ? ...
. . . . .
. . . . .

Complete Worked Example with Traceback

Let's work through a detailed example that demonstrates every aspect of the algorithm.

Example: DNA Sequence Comparison

Input Sequences:

X = "AGGTAB"  (length = 6)
Y = "GXTXAYB" (length = 7)

Goal: Find the longest common subsequence

Step 1: Initialize the DP Table

First, create the table with base cases (first row and column = 0):

      ""  G   X   T   X   A   Y   B
"" 0 0 0 0 0 0 0 0
A 0
G 0
G 0
T 0
A 0
B 0

Step 2: Fill the DP Table

Let's fill this systematically, showing the reasoning for each cell.

For X[0]='A', Y[0]='G': A ≠ G

  • dp[1][1] = max(dp[0][1], dp[1][0]) = max(0, 0) = 0

For X[0]='A', Y[1]='X': A ≠ X

  • dp[1][2] = max(dp[0][2], dp[1][1]) = max(0, 0) = 0

For X[0]='A', Y[2]='T': A ≠ T

  • dp[1][3] = max(dp[0][3], dp[1][2]) = max(0, 0) = 0

For X[0]='A', Y[3]='X': A ≠ X

  • dp[1][4] = max(dp[0][4], dp[1][3]) = max(0, 0) = 0

For X[0]='A', Y[4]='A': A = A ✓ (MATCH!)

  • dp[1][5] = 1 + dp[0][4] = 1 + 0 = 1

For X[0]='A', Y[5]='Y': A ≠ Y

  • dp[1][6] = max(dp[0][6], dp[1][5]) = max(0, 1) = 1

For X[0]='A', Y[6]='B': A ≠ B

  • dp[1][7] = max(dp[0][7], dp[1][6]) = max(0, 1) = 1

After filling all rows, complete DP table:

      ""  G   X   T   X   A   Y   B
"" 0 0 0 0 0 0 0 0
A 0 0 0 0 0 1 1 1
G 0 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1
T 0 1 1 2 2 2 2 2
A 0 1 1 2 2 3 3 3
B 0 1 1 2 2 3 3 4
Understanding the Numbers

Each cell dp[i][j] tells us: "The longest common subsequence between X[0..i-1] and Y[0..j-1] has this length."

For example, dp[4][3] = 2 means LCS of "AGGT" and "GXT" has length 2.

Step 3: Traceback to Construct the LCS

The length is dp[6][7] = 4, but what is the actual subsequence?

Traceback Algorithm: Start at dp[m][n] and work backwards:

  • If X[i-1] = Y[j-1]: This character is in LCS → Add to result, move diagonally (i--, j--)
  • Else if dp[i-1][j] > dp[i][j-1]: Move up (i--)
  • Else: Move left (j--)

Detailed Traceback with Arrows:

      ""  G   X   T   X   A   Y   B
"" 0 0 0 0 0 0 0 0
A 0 0 0 0 0 1← 1← 1

G 0 1 1→ 1→ 1→ 1 1← 1

G 0 1 1 1 1 1 1← 1

T 0 1 1 2 2→ 2→ 2← 2
↖ ↑
A 0 1 1 2 2 3 3→ 3
↖ ↑
B 0 1 1 2 2 3 3 4*

Step-by-step traceback:

  1. Start at dp[6][7] = 4, compare X[5]='B' and Y[6]='B'

    • MATCH! ✓ LCS includes 'B'
    • Move diagonally to dp[5][6] = 3
    • Current LCS: "B"
  2. At dp[5][6] = 3, compare X[4]='A' and Y[5]='Y'

    • No match. dp[4][6] = 3 = dp[5][5] = 3 (equal, choose either)
    • Move left to dp[5][5] = 3
  3. At dp[5][5] = 3, compare X[4]='A' and Y[4]='A'

    • MATCH! ✓ LCS includes 'A'
    • Move diagonally to dp[4][4] = 2
    • Current LCS: "AB"
  4. At dp[4][4] = 2, compare X[3]='T' and Y[3]='X'

    • No match. dp[3][4] = 1 < dp[4][3] = 2
    • Move left to dp[4][3] = 2
  5. At dp[4][3] = 2, compare X[3]='T' and Y[2]='T'

    • MATCH! ✓ LCS includes 'T'
    • Move diagonally to dp[3][2] = 1
    • Current LCS: "TAB"
  6. At dp[3][2] = 1, compare X[2]='G' and Y[1]='X'

    • No match. dp[2][2] = 1 = dp[3][1] = 1
    • Move up to dp[2][2] = 1
  7. At dp[2][2] = 1, compare X[1]='G' and Y[1]='X'

    • No match. dp[1][2] = 0 < dp[2][1] = 1
    • Move up to dp[2][1] = 1
  8. At dp[2][1] = 1, compare X[1]='G' and Y[0]='G'

    • MATCH! ✓ LCS includes 'G'
    • Move diagonally to dp[1][0] = 0
    • Current LCS: "GTAB"
  9. At dp[1][0] = 0, we've reached the boundary. Done!

Final LCS (reversed): "GTAB"

Order Matters

We build the LCS backwards during traceback, so remember to reverse it! The sequence "GTAB" maintains relative order in both original strings:

  • X = "AGGTAB": G (pos 1 or 2), T (pos 3), A (pos 4), B (pos 5) ✓
  • Y = "GXTXAYB": G (pos 0), T (pos 2), A (pos 4), B (pos 6) ✓

Verification

Let's verify "GTAB" is indeed a subsequence of both:

X = "AGGTAB"

  • G appears at index 1 (or 2) ✓
  • T appears at index 3 ✓
  • A appears at index 4 ✓
  • B appears at index 5 ✓

Y = "GXTXAYB"

  • G appears at index 0 ✓
  • T appears at index 2 ✓
  • A appears at index 4 ✓
  • B appears at index 6 ✓

Both maintain relative order, and length = 4 matches dp[6][7] = 4 ✓

LCS Applications and Problem Variations

Real-World Applications

1. Diff Algorithms and Version Control

Git and other version control systems use LCS to compute minimal differences between files.

Original: "ABCDEFG"
Modified: "ABXCDEYG"

LCS = "ABCDEG" (length 6)

Changes needed:
- Insert 'X' after B
- Insert 'Y' after E

The diff is computed as: insertions + deletions = m + n - 2*LCS

Understanding Diff
  • Lines in LCS = unchanged lines (shown without markers)
  • Lines in original but not LCS = deletions (shown with -)
  • Lines in modified but not LCS = insertions (shown with +)

2. DNA Sequence Alignment

Biologists use LCS to find similarities between DNA sequences:

Sequence 1: "AGGTAB"
Sequence 2: "GXTXAYB"
Common DNA: "GTAB"

This reveals shared genetic material, suggesting:
- Common evolutionary origin
- Functional similarities
- Potential gene relationships

3. Plagiarism Detection

LCS helps identify copied content:

Document A: "The quick brown fox jumps"
Document B: "The brown fox jumps high"
Common: "The brown fox jumps" (high similarity = potential plagiarism)

LCS length / max(|A|, |B|) = similarity ratio

4. Speech Recognition

Comparing spoken output with expected text:

Expected: "HELLO WORLD"
Recognized: "HELO WORD"
LCS: "HELO WOR"

Error rate = (m + n - 2*LCS) / max(m, n)

Connection to Edit Distance

LCS and Edit Distance are deeply connected:

Edit Distance Formula:

edit_distance = m + n - 2 * LCS

Where:
- m = length of string X
- n = length of string Y
- LCS = length of longest common subsequence

Why this works:

Think of the relationship visually:

  • Total characters in both strings: m + n
  • Characters in LCS appear in both: subtract LCS twice
  • Remaining characters need to be inserted or deleted

Example:

X = "AGGTAB" (m = 6)
Y = "GXTXAYB" (n = 7)
LCS = "GTAB" (length = 4)

Edit Distance = 6 + 7 - 2(4) = 13 - 8 = 5

Verification:
- Delete: A (from X)
- Delete: G (from X)
- Insert: X (after G in result)
- Insert: X (after T in result)
- Insert: Y (after A in result)
Total: 5 operations ✓
Two Sides of the Same Coin
  • LCS: Maximizes commonality (what stays)
  • Edit Distance: Minimizes changes (what's removed/added)
  • Together they tell the complete story of string similarity

Implementation

Java Implementation with Traceback

public class LongestCommonSubsequence {

/**
* Compute LCS length using dynamic programming
* Time Complexity: O(m * n)
* Space Complexity: O(m * n)
*/
public static int lcsLength(String X, String Y) {
int m = X.length();
int n = Y.length();

// Create DP table
int[][] dp = new int[m + 1][n + 1];

// Fill the DP table
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// Characters match: include in LCS
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
// Characters don't match: take maximum
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}

return dp[m][n];
}

/**
* Compute LCS and return the actual subsequence
* Time Complexity: O(m * n)
* Space Complexity: O(m * n)
*/
public static String lcs(String X, String Y) {
int m = X.length();
int n = Y.length();

// Build DP table
int[][] dp = new int[m + 1][n + 1];

for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}

// Traceback to construct LCS
StringBuilder lcs = new StringBuilder();
int i = m, j = n;

while (i > 0 && j > 0) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// This character is part of LCS
lcs.append(X.charAt(i - 1));
i--;
j--;
} else if (dp[i - 1][j] > dp[i][j - 1]) {
// Move up
i--;
} else {
// Move left
j--;
}
}

// Reverse because we built it backwards
return lcs.reverse().toString();
}

/**
* Space-optimized version (only stores length, not string)
* Space Complexity: O(min(m, n))
*/
public static int lcsLengthOptimized(String X, String Y) {
// Ensure X is the shorter string to minimize space
if (X.length() > Y.length()) {
String temp = X;
X = Y;
Y = temp;
}

int m = X.length();
int n = Y.length();

// Only need two rows
int[] prev = new int[m + 1];
int[] curr = new int[m + 1];

for (int j = 1; j <= n; j++) {
for (int i = 1; i <= m; i++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
curr[i] = 1 + prev[i - 1];
} else {
curr[i] = Math.max(prev[i], curr[i - 1]);
}
}
// Swap rows
int[] temp = prev;
prev = curr;
curr = temp;
}

return prev[m];
}

/**
* Compute edit distance using LCS
*/
public static int editDistance(String X, String Y) {
int lcsLen = lcsLength(X, Y);
return X.length() + Y.length() - 2 * lcsLen;
}

// Example usage
public static void main(String[] args) {
String X = "AGGTAB";
String Y = "GXTXAYB";

System.out.println("String X: " + X);
System.out.println("String Y: " + Y);
System.out.println("LCS Length: " + lcsLength(X, Y));
System.out.println("LCS String: " + lcs(X, Y));
System.out.println("Edit Distance: " + editDistance(X, Y));

// Edge cases
System.out.println("\nEdge Cases:");
System.out.println("Empty strings: " + lcs("", "ABC"));
System.out.println("Identical strings: " + lcs("HELLO", "HELLO"));
System.out.println("No common: " + lcs("ABC", "DEF"));
}
}

Problem Variations

1. Shortest Common Supersequence (SCS)

Problem: Find the shortest string that has both X and Y as subsequences.

Solution: SCS_length = m + n - LCS_length

Example:

X = "AGGTAB"
Y = "GXTXAYB"
LCS = "GTAB" (length 4)

SCS length = 6 + 7 - 4 = 9
SCS string = "AGGXTXAYB"

Verification:
- "AGGTAB" is a subsequence of "AGGXTXAYB" ✓
- "GXTXAYB" is a subsequence of "AGGXTXAYB" ✓

Implementation:

public static String shortestCommonSupersequence(String X, String Y) {
int m = X.length();
int n = Y.length();

// Build DP table for LCS
int[][] dp = new int[m + 1][n + 1];
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}

// Build SCS using traceback
StringBuilder scs = new StringBuilder();
int i = m, j = n;

while (i > 0 && j > 0) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// Common character: add once
scs.append(X.charAt(i - 1));
i--;
j--;
} else if (dp[i - 1][j] > dp[i][j - 1]) {
// X[i-1] not in LCS: add from X
scs.append(X.charAt(i - 1));
i--;
} else {
// Y[j-1] not in LCS: add from Y
scs.append(Y.charAt(j - 1));
j--;
}
}

// Add remaining characters
while (i > 0) {
scs.append(X.charAt(i - 1));
i--;
}
while (j > 0) {
scs.append(Y.charAt(j - 1));
j--;
}

return scs.reverse().toString();
}

2. Minimum Delete Operations to Make Strings Equal

Problem: Given two strings, find minimum deletions needed to make them equal.

Solution: deletions = m + n - 2 * LCS_length

Example:

X = "sea"
Y = "eat"
LCS = "ea" (length 2)

Deletions = 3 + 3 - 2(2) = 2
- Delete 's' from X
- Delete 't' from Y
Result: Both become "ea"

Implementation:

public static int minDeleteOperations(String X, String Y) {
int lcsLen = lcsLength(X, Y);
return X.length() + Y.length() - 2 * lcsLen;
}

3. Longest Palindromic Subsequence

Problem: Find the longest palindromic subsequence in a string.

Solution: LPS(X) = LCS(X, reverse(X))

Example:

X = "BBABCBCAB"
Y = reverse(X) = "BACBCBABB"
LCS = "BABCBAB" (length 7)

This is the longest palindromic subsequence

Implementation:

public static String longestPalindromicSubsequence(String X) {
String Y = new StringBuilder(X).reverse().toString();
return lcs(X, Y);
}

4. Print All LCS

Problem: There can be multiple LCS of the same length. Print all of them.

Example:

X = "ABCD"
Y = "ACBD"

All LCS of length 3:
- "ABD"
- "ACD"

Implementation:

public static Set<String> allLCS(String X, String Y) {
int m = X.length();
int n = Y.length();

// Build DP table
int[][] dp = new int[m + 1][n + 1];
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
dp[i][j] = 1 + dp[i - 1][j - 1];
} else {
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
}
}
}

// Find all LCS using backtracking
Set<String> result = new HashSet<>();
findAllLCS(X, Y, m, n, dp, "", result);
return result;
}

private static void findAllLCS(String X, String Y, int i, int j,
int[][] dp, String current, Set<String> result) {
// Base case
if (i == 0 || j == 0) {
result.add(new StringBuilder(current).reverse().toString());
return;
}

// If characters match
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
findAllLCS(X, Y, i - 1, j - 1, dp,
current + X.charAt(i - 1), result);
} else {
// Explore all directions that lead to optimal solution
if (dp[i - 1][j] == dp[i][j]) {
findAllLCS(X, Y, i - 1, j, dp, current, result);
}
if (dp[i][j - 1] == dp[i][j]) {
findAllLCS(X, Y, i, j - 1, dp, current, result);
}
}
}

Practice Problems

Apply your understanding of LCS to these problems:

  1. Uncrossed Lines (LeetCode 1035): Maximum number of connecting lines without crossing

    • This is exactly the LCS problem in disguise!
  2. Longest Common Subpath (LeetCode 1923): Find longest common subarray in 2D arrays

    • Extend LCS concept to 2D
  3. Distinct Subsequences (LeetCode 115): Count distinct occurrences of subsequence

    • Variation counting all occurrences, not just longest
  4. Minimum ASCII Delete Sum (LeetCode 712): Minimize ASCII sum of deleted characters

    • Weighted version of LCS
  5. Maximum Length of Repeated Subarray (LeetCode 718): Longest common substring

    • Remember: substring (contiguous) vs subsequence (non-contiguous)
Mastery Path
  1. Understand the basic LCS algorithm thoroughly
  2. Practice reconstructing the LCS string (traceback)
  3. Solve edit distance using LCS
  4. Tackle SCS and related variations
  5. Apply to real-world problems (diff, DNA alignment)

Complexity Analysis

Time Complexity: O(m × n)

  • We fill m × n cells in the DP table
  • Each cell computation is O(1)

Space Complexity:

  • Standard: O(m × n) for the DP table
  • Optimized (length only): O(min(m, n)) using rolling array
  • With traceback: O(m × n) needed to reconstruct string

Practical Considerations:

  • For very long strings (m, n > 10,000), consider:
    • Using space-optimized version if only length needed
    • Memory-mapped files for extremely large sequences
    • Parallel computation for multiple comparisons
    • Approximate algorithms for faster results with acceptable accuracy