How can I calculate the diff between two strings in JavaScript?

Have you ever wondered how Git's diff algorithm works? From what I've read, it's based on the Myers diff algorithm, which itself is based on solving the longest common subsequence problem, which I've covered in a previous article. So, let's take a look at how we can implement this algorithm in JavaScript!
I highly recommend you read the article on the longest common subsequence before diving into this one, as it will help you understand the underlying concepts better.
Problem statement
I find that examples always help clarify things, so let's start with one. Assuming two sequences, a
and b
, we want to find the minimum number of edits required to convert a
into b
. These edits can be either insertions, deletions, or replacements. For this example, we'll start with two strings, but we'll later build a more general solution that can handle any sequence.
const a = 'kitten'; const b = 'sitting';
The expected output for this example should be something like this:
const diff = [ { value: 'k', operation: 'delete' }, { value: 's', operation: 'insert' }, { value: 'i', operation: 'equal' }, { value: 't', operation: 'equal' }, { value: 't', operation: 'equal' }, { value: 'e', operation: 'delete' }, { value: 'i', operation: 'insert' }, { value: 'n', operation: 'equal' }, { value: 'g', operation: 'insert' } ];
As you can see, we have a list of objects, each containing a value
and an operation
. The value
is the character from the original sequence, and the operation
is the type of edit required to convert the original sequence into the new one. If there is no edit required, the operation is equal
. We also make sure that delete
operations take precedence over insert
operations. This makes the output more readable and easier to understand.
Solution explanation
Same as the previous problem, we'll use dynamic programming. The approach is very much similar to LCS, but with some tweaks. Instead of tiring you with all the details, I'll just explain the algorithm's steps, given two sequences, a
and b
, of lengths m
and n
:
- Create a
(m + 1) x (n + 1)
matrix,dp
. Each column represents a character ofa
, and each row represents a character ofb
. - Fill the first row with the numbers
0
ton
and the first column with the numbers0
tom
, representing the number of deletions and insertions required to convert an empty string into the corresponding string. - Fill the matrix, starting from the second row and second column, using the following rules:
- If
a[i - 1]
equalsb[j - 1]
, setdp[i][j] = dp[i - 1][j - 1]
(no edit required). - Otherwise, set
dp[i][j] = Math.min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1
(minimum of the three operations).
- If
- Backtrack through the matrix to find the differences between the two sequences:
- If both
i > 0
andj > 0
anda[i - 1]
equalsb[j - 1]
, add an object withvalue: a[i - 1]
andoperation: 'equal'
to thediffs
array. - If
i > 0
and eitherj === 0
ordp[i][j] === dp[i - 1][j] + 1
, add an object withvalue: a[i - 1]
andoperation: 'delete'
to thediffs
array (deletion takes precedence over insertion). - Otherwise, add an object with
value: b[j - 1]
andoperation: 'insert'
to thediffs
array.
- If both
- Return the
diffs
array when bothi
andj
are0
.
There are various optimizations that can be made to this algorithm, but this article is meant primarily as a learning resource. Therefore, I'll keep things nice and simple and leave these optimizations out for now.
Algorithm implementation
As you can probably tell, this is very similar to the LCS algorithm, but with a few tweaks to handle deletions and insertions. Let's implement this in JavaScript. We'll again make sure it can handle both strings and arrays as input.
const myersDiff = (a, b) => { // Find the lengths of the two sequences const m = a.length, n = b.length; // Create a 2D array to store the differences const dp = Array(m + 1).fill(0).map(() => Array(n + 1).fill(0)); // Initialize the first row and column dp[0] = dp[0].map((_, j) => j); dp.forEach((row, i) => row[0] = i); // Fill in the rest of the 2D array for (let i = 1; i <= m; i++) { for (let j = 1; j <= n; j++) { if (a[i - 1] === b[j - 1]) { dp[i][j] = dp[i - 1][j - 1]; } else { dp[i][j] = 1 + Math.min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]); } } } // Reconstruct the differences let i = m, j = n, diffs = []; while (i > 0 || j > 0) { if (i > 0 && j > 0 && a[i - 1] === b[j - 1]) { diffs.unshift({ value: a[i - 1], operation: 'equal' }); i--; j--; } else if (i > 0 && (j === 0 || dp[i][j] === dp[i - 1][j] + 1)) { diffs.unshift({ value: a[i - 1], operation: 'delete' }); i--; } else { diffs.unshift({ value: b[j - 1], operation: 'insert' }); j--; } } return diffs; } myersDiff('kitten', 'sitting'); // [ // { value: 'k', operation: 'delete' }, // { value: 's', operation: 'insert' }, // { value: 'i', operation: 'equal' }, // { value: 't', operation: 'equal' }, // { value: 't', operation: 'equal' }, // { value: 'e', operation: 'delete' }, // { value: 'i', operation: 'insert' }, // { value: 'n', operation: 'equal' }, // { value: 'g', operation: 'insert' } // ]
Visualizing the diff
This is nice and all, but how would one go about visualizing it the way Git does? It's simple, actually! All we need to do is iterate over the diffs
array and print the values and a prefix based on the operation. While we're at it, we may as well add the number of additions and deletions required as an overview at the beginning of our output.
const visualizeDiff = (diffs) => { const { insertions, deletions, output } = diffs.reduce( (acc, { value, operation }) => { if (operation === 'insert') { acc.insertions++; acc.output.push(`+${value}`); } else if (operation === 'delete') { acc.deletions++; acc.output.push(`-${value}`); } else { acc.output.push(` ${value} `); } return acc; }, { insertions: 0, deletions: 0, output: [] }); const insertionsCount = insertions > 0 ? `${insertions} insertions(+), ` : ''; const deletionsCount = deletions > 0 ? `${deletions} deletions(-)` : ''; return `${insertionsCount}${deletionsCount}\n\n${output.join('\n')}`; };
We'll explore a more complex example for this, using arrays of strings, instead of just strings, so that it looks a little more interesting.
const fileA = [ '20 bottles of beer on the wall', '20 bottles of beer', 'Take one down, pass it around', '19 bottles of beer on the wall' ]; const fileB = [ '19 bottles of beer on the wall', '19 bottles of beer', 'Take one down, pass it around', '18 bottles of beer on the wall' ]; const diffs = myersDiff(fileA, fileB); visualizeDiff(diffs); // 3 insertions(+), 3 deletions(-) // // -20 bottles of beer on the wall // +19 bottles of beer on the wall // -20 bottles of beer // +19 bottles of beer // Take one down, pass it around // -19 bottles of beer on the wall // +18 bottles of beer on the wall
Looks a lot like a Git diff, doesn't it? You can definitely make it even better by adding colors or skipping over long unchanged subsequences, but I'll leave that up to you.
Conclusion
And there you have it! You can now find the differences between two sequences using the Myers diff algorithm and visualize them in a readable format.