Pattern 4: Sequence Alignment (Edit Distance)

The DNA Matching Pattern: Find optimal alignment between two sequences using dynamic programming with insertions, deletions, and substitutions. Classic application in computational genomics.

Understanding Sequence Alignment

THE REAL-WORLD PROBLEM

Sequence Alignment in Simple Terms

Imagine you have two strings of letters:

String A has n letters (the pattern you're searching for)
String B has m letters (the longer sequence, where m >> n)

Goal: Find the substring within B that most closely matches A.

Why "Perturbed" Matching?

The match doesn't have to be perfect! You can modify sequences by:

Inserting extra letters (gap penalty)
Deleting letters (gap penalty)
Substituting letters (mismatch penalty)

Each modification has a cost, and you want to find the alignment with the minimum total cost.

REAL-WORLD EXAMPLE: DNA MATCHING

Say you're looking for a gene sequence "ACGT" (A) inside a longer DNA strand "AACGGTC" (B).

You might find that "ACGG" matches best if you:

Delete one G (cost = 1)
Total cost = 1

This is better than other possible matches with higher costs!

The Dynamic Programming Structure

Think of it as a grid/table problem:

You process the longer sequence B in m stages (one for each position)
At each stage, you track n possible states (how well you're matching each position of A)
At each state, you have d decisions (insert, delete, substitute, or match)

Computational complexity: O(m × n × d) = O(m × n)

This is why it's called an "m-stage problem with n values at each stage and d decisions per state".

THIS IS INTERVAL DP!

State: dp[i][j] = minimum cost to align A[0..i] with B[0..j]

Transition: Try all three operations:

dp[i][j] = min(
    dp[i-1][j] + deletion_cost,
    dp[i][j-1] + insertion_cost,
    dp[i-1][j-1] + match/mismatch_cost
)

Build order: Fill the 2D table row by row (or column by column)

The Needleman-Wunsch Algorithm

The Needleman-Wunsch algorithm is the foundational global alignment method in bioinformatics, published in 1970. It guarantees finding the optimal alignment of two entire sequences.

Algorithm Components

Scoring System:

Match: score = +1 (or 0 for edit distance)
Mismatch: penalty = -1 (or +1 for edit distance)
Gap (insertion/deletion): penalty = -1 (or +1 for edit distance)

State Definition: dp[i][j] = minimum cost to transform first i characters of sequence A into first j characters of sequence B

Base Cases:

dp[0][j] = j * gap_penalty  // insert j characters
dp[i][0] = i * gap_penalty  // delete i characters

Recurrence Relation:

dp[i][j] = min(
    dp[i-1][j] + gap_penalty,           // deletion
    dp[i][j-1] + gap_penalty,           // insertion
    dp[i-1][j-1] + match/mismatch_cost  // match or substitution
)

Implementation

public int minDistance(String word1, String word2) {
    int m = word1.length();
    int n = word2.length();
    int[][] dp = new int[m + 1][n + 1];

    // Base cases
    for (int i = 0; i <= m; i++) {
        dp[i][0] = i;  // deletions
    }
    for (int j = 0; j <= n; j++) {
        dp[0][j] = j;  // insertions
    }

    // Fill the DP table
    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            if (word1.charAt(i-1) == word2.charAt(j-1)) {
                // Characters match - no cost
                dp[i][j] = dp[i-1][j-1];
            } else {
                // Take minimum of three operations
                dp[i][j] = 1 + Math.min(
                    Math.min(
                        dp[i-1][j],      // delete from word1
                        dp[i][j-1]       // insert into word1
                    ),
                    dp[i-1][j-1]         // substitute
                );
            }
        }
    }

    return dp[m][n];
}

THE THREE CHOICES AT EACH CELL

When filling dp[i][j], visualize three arrows:

↓ From above dp[i-1][j]: Delete from sequence A
← From left dp[i][j-1]: Insert into sequence A
↖ From diagonal dp[i-1][j-1]: Match or substitute

Pick the arrow with minimum total cost!

Complete Example: DNA Sequence Alignment

Let's solve a realistic genomics problem: aligning two DNA sequences.

PROBLEM

Given:

Sequence A: "GCATGCU"
Sequence B: "GATTACA"

Find: Minimum edit distance and optimal alignment

Operations:

Match: cost 0
Mismatch: cost 1
Gap: cost 1

Step 1: Initialize the DP Table

     ""  G   A   T   T   A   C   A
""    0   1   2   3   4   5   6   7
G     1
C     2
A     3
T     4
G     5
C     6
U     7

Explanation:

First row: Cost to insert 0, 1, 2, ... characters from B
First column: Cost to delete 0, 1, 2, ... characters from A

Step 2: Fill the DP Table

For each cell dp[i][j], compute:

if A[i-1] == B[j-1]:
    dp[i][j] = dp[i-1][j-1]  // match, no cost
else:
    dp[i][j] = 1 + min(
        dp[i-1][j],      // delete
        dp[i][j-1],      // insert
        dp[i-1][j-1]     // substitute
    )

Complete DP Table:

     ""  G   A   T   T   A   C   A
""    0   1   2   3   4   5   6   7
G     1   0   1   2   3   4   5   6
C     2   1   1   2   3   4   4   5
A     3   2   1   2   3   3   4   4
T     4   3   2   1   2   3   4   5
G     5   4   3   2   2   3   4   5
C     6   5   4   3   3   3   3   4
U     7   6   5   4   4   4   4   4

HOW TO FILL EACH CELL

Example: dp[2][6] (C vs C)

Characters match (C == C)
Take diagonal value: dp[1][5] = 4
Result: dp[2][6] = 4

Example: dp[3][2] (A vs A)

Characters match (A == A)
Take diagonal value: dp[2][1] = 1
Result: dp[3][2] = 1

Example: dp[4][3] (T vs T)

Characters match (T == T)
Take diagonal value: dp[3][2] = 1
Result: dp[4][3] = 1

Example: dp[7][7] (U vs A)

Characters don't match (U ≠ A)
Options:
- Delete U: 1 + dp[6][7] = 1 + 4 = 5
- Insert A: 1 + dp[7][6] = 1 + 4 = 5
- Substitute: 1 + dp[6][6] = 1 + 3 = 4
Result: dp[7][7] = 4 (minimum)

Step 3: Traceback to Find Optimal Alignment

Starting from dp[7][7] = 4, trace back following the minimum path:

public String[] getAlignment(String A, String B, int[][] dp) {
    StringBuilder alignA = new StringBuilder();
    StringBuilder alignB = new StringBuilder();

    int i = A.length();
    int j = B.length();

    while (i > 0 || j > 0) {
        if (i > 0 && j > 0 && A.charAt(i-1) == B.charAt(j-1)) {
            // Match
            alignA.insert(0, A.charAt(i-1));
            alignB.insert(0, B.charAt(j-1));
            i--;
            j--;
        } else if (i > 0 && j > 0 &&
                   dp[i][j] == dp[i-1][j-1] + 1) {
            // Substitution
            alignA.insert(0, A.charAt(i-1));
            alignB.insert(0, B.charAt(j-1));
            i--;
            j--;
        } else if (j > 0 && dp[i][j] == dp[i][j-1] + 1) {
            // Insertion
            alignA.insert(0, '-');
            alignB.insert(0, B.charAt(j-1));
            j--;
        } else {
            // Deletion
            alignA.insert(0, A.charAt(i-1));
            alignB.insert(0, '-');
            i--;
        }
    }

    return new String[]{alignA.toString(), alignB.toString()};
}

Optimal Alignment:

A: GCA-TGCU
B: G-ATTACA
   |   || |

Operations:
1. G matches G
2. C → delete (or A → insert)
3. A matches A
4. T → delete (or T → insert)
5. T matches T
6. G → C (substitute)
7. C → A (substitute)
8. U → A (substitute)

Total cost: 4

Alternative Valid Alignment

A: GCATGCU
B: GATTACA
   |  | |

Another valid alignment with cost 4:
A: G-CATGCU
B: GATT-ACA

MULTIPLE OPTIMAL ALIGNMENTS

There can be multiple alignments with the same minimum cost!

The traceback algorithm finds one of them. To find all optimal alignments, you need to explore all paths that give the minimum cost.

Why This Matters for Genomics

In computational genomics, scientists use these algorithms to:

Find genes in large genomes
- Search for known gene sequences in newly sequenced DNA
- Cost: O(m × n) for sequences of length m and n
Compare DNA sequences across species
- Identify evolutionary relationships
- Measure genetic similarity/distance
Identify mutations or variations
- Find single nucleotide polymorphisms (SNPs)
- Detect insertions/deletions (indels)
Protein sequence alignment
- Compare protein structures
- Predict protein function from similar sequences

REAL-WORLD OPTIMIZATION

For very long sequences (millions of bases), optimized versions exist:

BLAST: Heuristic approach for faster approximate matching
Smith-Waterman: Local alignment (find best matching substring)
Affine gap penalties: More realistic gap scoring
Banded alignment: Limit search space for similar sequences

But they all build on this fundamental DP approach!

Pattern Recognition

WHEN YOU SEE THIS PATTERN

Keywords: "edit distance", "sequence alignment", "minimum operations", "DNA matching", "transform one string to another"

Structure:

Two sequences to align
Operations: insert, delete, substitute
Each operation has a cost
Find minimum total cost

State: dp[i][j] = minimum cost to align first i characters of A with first j characters of B

Transition: Three choices at each step

dp[i][j] = min(
    dp[i-1][j] + gap_cost,          // delete
    dp[i][j-1] + gap_cost,          // insert
    dp[i-1][j-1] + match/mismatch   // match or substitute
)

Build Order: Row by row (or column by column)

Variations and Extensions

Variation 1: Edit Distance (LC 72)

Problem: Transform word1 to word2 using minimum operations

Solution: Exactly the Needleman-Wunsch algorithm above!

LeetCode 72: https://leetcode.com/problems/edit-distance/

Variation 2: One Edit Distance (LC 161)

Problem: Check if two strings are exactly one edit away

Optimization: Can be solved in O(n) without DP by comparing character by character

Variation 3: Delete Operation for Two Strings (LC 583)

Problem: Minimum deletions to make two strings equal

Solution: Find longest common subsequence (LCS), then:

min_deletions = (len(s1) - LCS) + (len(s2) - LCS)

Variation 4: Minimum ASCII Delete Sum (LC 712)

Problem: Minimize sum of ASCII values of deleted characters

Solution: Same DP structure, but use ASCII values as costs

Space Optimization

SPACE-OPTIMIZED VERSION

The basic version uses O(m × n) space. We can optimize to O(min(m, n)) by keeping only two rows:

public int minDistanceOptimized(String word1, String word2) {
    int m = word1.length();
    int n = word2.length();

    // Use shorter string for columns
    if (m > n) {
        return minDistanceOptimized(word2, word1);
    }

    int[] prev = new int[m + 1];
    int[] curr = new int[m + 1];

    // Initialize first row
    for (int i = 0; i <= m; i++) {
        prev[i] = i;
    }

    for (int j = 1; j <= n; j++) {
        curr[0] = j;  // first column

        for (int i = 1; i <= m; i++) {
            if (word1.charAt(i-1) == word2.charAt(j-1)) {
                curr[i] = prev[i-1];
            } else {
                curr[i] = 1 + Math.min(
                    Math.min(prev[i], curr[i-1]),
                    prev[i-1]
                );
            }
        }

        // Swap arrays
        int[] temp = prev;
        prev = curr;
        curr = temp;
    }

    return prev[m];
}

Space: O(m) where m is the shorter string length

Trade-off: Cannot reconstruct the alignment path with this optimization!

Connection to Other Patterns

RELATIONSHIP TO OTHER DP PATTERNS

Pattern 4 (Interval DP):

Edit Distance is a 2D interval problem
State depends on smaller subproblems
Similar to LCS, palindrome problems

Pattern 8 (String DP):

Edit Distance is a fundamental string DP problem
Template for many string transformation problems

Pattern 10 (LIS/LCS):

LCS can be used to solve edit distance
edit_distance = m + n - 2 * LCS(A, B)

Practice Problems

RECOMMENDED PRACTICE ORDER

Easy (Build Foundation):

Edit Distance (LC 72) - The classic problem
Delete Operation for Two Strings (LC 583)
One Edit Distance (LC 161)

Medium (Apply Variations): 4. Minimum ASCII Delete Sum (LC 712) 5. Distinct Subsequences (LC 115) - Related counting problem 6. Longest Common Subsequence (LC 1143) - Foundation for edit distance

Hard (Advanced Applications): 7. Regular Expression Matching (LC 10) - Complex state transitions 8. Wildcard Matching (LC 44) - Similar to regex 9. Interleaving String (LC 97) - 2D matching problem

Real-World Application:

Implement BLAST algorithm basics
Create a simple spell checker
Build a DNA sequence matcher

Summary: The Edit Distance Template

// Template for Edit Distance / Sequence Alignment
public int editDistance(String A, String B) {
    int m = A.length();
    int n = B.length();
    int[][] dp = new int[m + 1][n + 1];

    // Base cases: empty string transformations
    for (int i = 0; i <= m; i++) dp[i][0] = i;
    for (int j = 0; j <= n; j++) dp[0][j] = j;

    // Fill DP table
    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            if (A.charAt(i-1) == B.charAt(j-1)) {
                dp[i][j] = dp[i-1][j-1];  // match
            } else {
                dp[i][j] = 1 + Math.min(
                    Math.min(
                        dp[i-1][j],      // delete
                        dp[i][j-1]       // insert
                    ),
                    dp[i-1][j-1]         // substitute
                );
            }
        }
    }

    return dp[m][n];
}

REMEMBER THE VISUALIZATION

Each cell dp[i][j] has three incoming arrows:
    ↖ diagonal (match/substitute)
    ↑ from above (delete)
    ← from left (insert)

Pick the arrow with minimum total cost!

This is the essence of sequence alignment.

Final Thoughts

SEQUENCE ALIGNMENT IS FUNDAMENTAL

The dynamic programming approach guarantees finding the optimal alignment efficiently, rather than testing every possible combination (which would be impossibly slow).

Time Complexity: O(m × n) Space Complexity: O(m × n) or O(min(m, n)) with optimization

This algorithm is used billions of times daily in:

NCBI BLAST searches
Genomic databases
Protein structure prediction
Evolutionary biology research
Spell checkers and diff tools

Master this pattern, and you'll understand one of the most impactful algorithms in computational biology!

Understanding Sequence Alignment​

Why "Perturbed" Matching?​

The Dynamic Programming Structure​

The Needleman-Wunsch Algorithm​

Algorithm Components​

Implementation​

Complete Example: DNA Sequence Alignment​

Step 1: Initialize the DP Table​

Step 2: Fill the DP Table​

Step 3: Traceback to Find Optimal Alignment​

Alternative Valid Alignment​

Why This Matters for Genomics​

Pattern Recognition​

Variations and Extensions​

Variation 1: Edit Distance (LC 72)​

Variation 2: One Edit Distance (LC 161)​

Variation 3: Delete Operation for Two Strings (LC 583)​

Variation 4: Minimum ASCII Delete Sum (LC 712)​

Space Optimization​

Connection to Other Patterns​

Practice Problems​

Summary: The Edit Distance Template​

Final Thoughts​