Skip to main content

Pattern 4: Sequence Alignment (Edit Distance)

The DNA Matching Pattern: Find optimal alignment between two sequences using dynamic programming with insertions, deletions, and substitutions. Classic application in computational genomics.


Understanding Sequence Alignment

THE REAL-WORLD PROBLEM

Sequence Alignment in Simple Terms

Imagine you have two strings of letters:

  • String A has n letters (the pattern you're searching for)
  • String B has m letters (the longer sequence, where m >> n)

Goal: Find the substring within B that most closely matches A.

Why "Perturbed" Matching?

The match doesn't have to be perfect! You can modify sequences by:

  1. Inserting extra letters (gap penalty)
  2. Deleting letters (gap penalty)
  3. Substituting letters (mismatch penalty)

Each modification has a cost, and you want to find the alignment with the minimum total cost.

REAL-WORLD EXAMPLE: DNA MATCHING

Say you're looking for a gene sequence "ACGT" (A) inside a longer DNA strand "AACGGTC" (B).

You might find that "ACGG" matches best if you:

  • Delete one G (cost = 1)
  • Total cost = 1

This is better than other possible matches with higher costs!

The Dynamic Programming Structure

Think of it as a grid/table problem:

  1. You process the longer sequence B in m stages (one for each position)
  2. At each stage, you track n possible states (how well you're matching each position of A)
  3. At each state, you have d decisions (insert, delete, substitute, or match)

Computational complexity: O(m × n × d) = O(m × n)

This is why it's called an "m-stage problem with n values at each stage and d decisions per state".

THIS IS INTERVAL DP!

State: dp[i][j] = minimum cost to align A[0..i] with B[0..j]

Transition: Try all three operations:

dp[i][j] = min(
dp[i-1][j] + deletion_cost,
dp[i][j-1] + insertion_cost,
dp[i-1][j-1] + match/mismatch_cost
)

Build order: Fill the 2D table row by row (or column by column)


The Needleman-Wunsch Algorithm

The Needleman-Wunsch algorithm is the foundational global alignment method in bioinformatics, published in 1970. It guarantees finding the optimal alignment of two entire sequences.

Algorithm Components

Scoring System:

  • Match: score = +1 (or 0 for edit distance)
  • Mismatch: penalty = -1 (or +1 for edit distance)
  • Gap (insertion/deletion): penalty = -1 (or +1 for edit distance)

State Definition: dp[i][j] = minimum cost to transform first i characters of sequence A into first j characters of sequence B

Base Cases:

dp[0][j] = j * gap_penalty  // insert j characters
dp[i][0] = i * gap_penalty // delete i characters

Recurrence Relation:

dp[i][j] = min(
dp[i-1][j] + gap_penalty, // deletion
dp[i][j-1] + gap_penalty, // insertion
dp[i-1][j-1] + match/mismatch_cost // match or substitution
)

Implementation

public int minDistance(String word1, String word2) {
int m = word1.length();
int n = word2.length();
int[][] dp = new int[m + 1][n + 1];

// Base cases
for (int i = 0; i <= m; i++) {
dp[i][0] = i; // deletions
}
for (int j = 0; j <= n; j++) {
dp[0][j] = j; // insertions
}

// Fill the DP table
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (word1.charAt(i-1) == word2.charAt(j-1)) {
// Characters match - no cost
dp[i][j] = dp[i-1][j-1];
} else {
// Take minimum of three operations
dp[i][j] = 1 + Math.min(
Math.min(
dp[i-1][j], // delete from word1
dp[i][j-1] // insert into word1
),
dp[i-1][j-1] // substitute
);
}
}
}

return dp[m][n];
}
THE THREE CHOICES AT EACH CELL

When filling dp[i][j], visualize three arrows:

  • From above dp[i-1][j]: Delete from sequence A
  • From left dp[i][j-1]: Insert into sequence A
  • From diagonal dp[i-1][j-1]: Match or substitute

Pick the arrow with minimum total cost!


Complete Example: DNA Sequence Alignment

Let's solve a realistic genomics problem: aligning two DNA sequences.

PROBLEM

Given:

  • Sequence A: "GCATGCU"
  • Sequence B: "GATTACA"

Find: Minimum edit distance and optimal alignment

Operations:

  • Match: cost 0
  • Mismatch: cost 1
  • Gap: cost 1

Step 1: Initialize the DP Table

     ""  G   A   T   T   A   C   A
"" 0 1 2 3 4 5 6 7
G 1
C 2
A 3
T 4
G 5
C 6
U 7

Explanation:

  • First row: Cost to insert 0, 1, 2, ... characters from B
  • First column: Cost to delete 0, 1, 2, ... characters from A

Step 2: Fill the DP Table

For each cell dp[i][j], compute:

if A[i-1] == B[j-1]:
dp[i][j] = dp[i-1][j-1] // match, no cost
else:
dp[i][j] = 1 + min(
dp[i-1][j], // delete
dp[i][j-1], // insert
dp[i-1][j-1] // substitute
)

Complete DP Table:

     ""  G   A   T   T   A   C   A
"" 0 1 2 3 4 5 6 7
G 1 0 1 2 3 4 5 6
C 2 1 1 2 3 4 4 5
A 3 2 1 2 3 3 4 4
T 4 3 2 1 2 3 4 5
G 5 4 3 2 2 3 4 5
C 6 5 4 3 3 3 3 4
U 7 6 5 4 4 4 4 4
HOW TO FILL EACH CELL

Example: dp[2][6] (C vs C)

  • Characters match (C == C)
  • Take diagonal value: dp[1][5] = 4
  • Result: dp[2][6] = 4

Example: dp[3][2] (A vs A)

  • Characters match (A == A)
  • Take diagonal value: dp[2][1] = 1
  • Result: dp[3][2] = 1

Example: dp[4][3] (T vs T)

  • Characters match (T == T)
  • Take diagonal value: dp[3][2] = 1
  • Result: dp[4][3] = 1

Example: dp[7][7] (U vs A)

  • Characters don't match (U ≠ A)
  • Options:
    • Delete U: 1 + dp[6][7] = 1 + 4 = 5
    • Insert A: 1 + dp[7][6] = 1 + 4 = 5
    • Substitute: 1 + dp[6][6] = 1 + 3 = 4
  • Result: dp[7][7] = 4 (minimum)

Step 3: Traceback to Find Optimal Alignment

Starting from dp[7][7] = 4, trace back following the minimum path:

public String[] getAlignment(String A, String B, int[][] dp) {
StringBuilder alignA = new StringBuilder();
StringBuilder alignB = new StringBuilder();

int i = A.length();
int j = B.length();

while (i > 0 || j > 0) {
if (i > 0 && j > 0 && A.charAt(i-1) == B.charAt(j-1)) {
// Match
alignA.insert(0, A.charAt(i-1));
alignB.insert(0, B.charAt(j-1));
i--;
j--;
} else if (i > 0 && j > 0 &&
dp[i][j] == dp[i-1][j-1] + 1) {
// Substitution
alignA.insert(0, A.charAt(i-1));
alignB.insert(0, B.charAt(j-1));
i--;
j--;
} else if (j > 0 && dp[i][j] == dp[i][j-1] + 1) {
// Insertion
alignA.insert(0, '-');
alignB.insert(0, B.charAt(j-1));
j--;
} else {
// Deletion
alignA.insert(0, A.charAt(i-1));
alignB.insert(0, '-');
i--;
}
}

return new String[]{alignA.toString(), alignB.toString()};
}

Optimal Alignment:

A: GCA-TGCU
B: G-ATTACA
| || |

Operations:
1. G matches G
2. C → delete (or A → insert)
3. A matches A
4. T → delete (or T → insert)
5. T matches T
6. G → C (substitute)
7. C → A (substitute)
8. U → A (substitute)

Total cost: 4

Alternative Valid Alignment

A: GCATGCU
B: GATTACA
| | |

Another valid alignment with cost 4:
A: G-CATGCU
B: GATT-ACA
MULTIPLE OPTIMAL ALIGNMENTS

There can be multiple alignments with the same minimum cost!

The traceback algorithm finds one of them. To find all optimal alignments, you need to explore all paths that give the minimum cost.


Why This Matters for Genomics

In computational genomics, scientists use these algorithms to:

  1. Find genes in large genomes

    • Search for known gene sequences in newly sequenced DNA
    • Cost: O(m × n) for sequences of length m and n
  2. Compare DNA sequences across species

    • Identify evolutionary relationships
    • Measure genetic similarity/distance
  3. Identify mutations or variations

    • Find single nucleotide polymorphisms (SNPs)
    • Detect insertions/deletions (indels)
  4. Protein sequence alignment

    • Compare protein structures
    • Predict protein function from similar sequences
REAL-WORLD OPTIMIZATION

For very long sequences (millions of bases), optimized versions exist:

  • BLAST: Heuristic approach for faster approximate matching
  • Smith-Waterman: Local alignment (find best matching substring)
  • Affine gap penalties: More realistic gap scoring
  • Banded alignment: Limit search space for similar sequences

But they all build on this fundamental DP approach!


Pattern Recognition

WHEN YOU SEE THIS PATTERN

Keywords: "edit distance", "sequence alignment", "minimum operations", "DNA matching", "transform one string to another"

Structure:

  • Two sequences to align
  • Operations: insert, delete, substitute
  • Each operation has a cost
  • Find minimum total cost

State: dp[i][j] = minimum cost to align first i characters of A with first j characters of B

Transition: Three choices at each step

dp[i][j] = min(
dp[i-1][j] + gap_cost, // delete
dp[i][j-1] + gap_cost, // insert
dp[i-1][j-1] + match/mismatch // match or substitute
)

Build Order: Row by row (or column by column)


Variations and Extensions

Variation 1: Edit Distance (LC 72)

Problem: Transform word1 to word2 using minimum operations

Solution: Exactly the Needleman-Wunsch algorithm above!

LeetCode 72: https://leetcode.com/problems/edit-distance/

Variation 2: One Edit Distance (LC 161)

Problem: Check if two strings are exactly one edit away

Optimization: Can be solved in O(n) without DP by comparing character by character

Variation 3: Delete Operation for Two Strings (LC 583)

Problem: Minimum deletions to make two strings equal

Solution: Find longest common subsequence (LCS), then:

min_deletions = (len(s1) - LCS) + (len(s2) - LCS)

Variation 4: Minimum ASCII Delete Sum (LC 712)

Problem: Minimize sum of ASCII values of deleted characters

Solution: Same DP structure, but use ASCII values as costs


Space Optimization

SPACE-OPTIMIZED VERSION

The basic version uses O(m × n) space. We can optimize to O(min(m, n)) by keeping only two rows:

public int minDistanceOptimized(String word1, String word2) {
int m = word1.length();
int n = word2.length();

// Use shorter string for columns
if (m > n) {
return minDistanceOptimized(word2, word1);
}

int[] prev = new int[m + 1];
int[] curr = new int[m + 1];

// Initialize first row
for (int i = 0; i <= m; i++) {
prev[i] = i;
}

for (int j = 1; j <= n; j++) {
curr[0] = j; // first column

for (int i = 1; i <= m; i++) {
if (word1.charAt(i-1) == word2.charAt(j-1)) {
curr[i] = prev[i-1];
} else {
curr[i] = 1 + Math.min(
Math.min(prev[i], curr[i-1]),
prev[i-1]
);
}
}

// Swap arrays
int[] temp = prev;
prev = curr;
curr = temp;
}

return prev[m];
}

Space: O(m) where m is the shorter string length

Trade-off: Cannot reconstruct the alignment path with this optimization!


Connection to Other Patterns

RELATIONSHIP TO OTHER DP PATTERNS

Pattern 4 (Interval DP):

  • Edit Distance is a 2D interval problem
  • State depends on smaller subproblems
  • Similar to LCS, palindrome problems

Pattern 8 (String DP):

  • Edit Distance is a fundamental string DP problem
  • Template for many string transformation problems

Pattern 10 (LIS/LCS):

  • LCS can be used to solve edit distance
  • edit_distance = m + n - 2 * LCS(A, B)

Practice Problems

RECOMMENDED PRACTICE ORDER

Easy (Build Foundation):

  1. Edit Distance (LC 72) - The classic problem
  2. Delete Operation for Two Strings (LC 583)
  3. One Edit Distance (LC 161)

Medium (Apply Variations): 4. Minimum ASCII Delete Sum (LC 712) 5. Distinct Subsequences (LC 115) - Related counting problem 6. Longest Common Subsequence (LC 1143) - Foundation for edit distance

Hard (Advanced Applications): 7. Regular Expression Matching (LC 10) - Complex state transitions 8. Wildcard Matching (LC 44) - Similar to regex 9. Interleaving String (LC 97) - 2D matching problem

Real-World Application:

  • Implement BLAST algorithm basics
  • Create a simple spell checker
  • Build a DNA sequence matcher

Summary: The Edit Distance Template

// Template for Edit Distance / Sequence Alignment
public int editDistance(String A, String B) {
int m = A.length();
int n = B.length();
int[][] dp = new int[m + 1][n + 1];

// Base cases: empty string transformations
for (int i = 0; i <= m; i++) dp[i][0] = i;
for (int j = 0; j <= n; j++) dp[0][j] = j;

// Fill DP table
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (A.charAt(i-1) == B.charAt(j-1)) {
dp[i][j] = dp[i-1][j-1]; // match
} else {
dp[i][j] = 1 + Math.min(
Math.min(
dp[i-1][j], // delete
dp[i][j-1] // insert
),
dp[i-1][j-1] // substitute
);
}
}
}

return dp[m][n];
}
REMEMBER THE VISUALIZATION
Each cell dp[i][j] has three incoming arrows:
↖ diagonal (match/substitute)
↑ from above (delete)
← from left (insert)

Pick the arrow with minimum total cost!

This is the essence of sequence alignment.


Final Thoughts

SEQUENCE ALIGNMENT IS FUNDAMENTAL

The dynamic programming approach guarantees finding the optimal alignment efficiently, rather than testing every possible combination (which would be impossibly slow).

Time Complexity: O(m × n) Space Complexity: O(m × n) or O(min(m, n)) with optimization

This algorithm is used billions of times daily in:

  • NCBI BLAST searches
  • Genomic databases
  • Protein structure prediction
  • Evolutionary biology research
  • Spell checkers and diff tools

Master this pattern, and you'll understand one of the most impactful algorithms in computational biology!