Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 602993
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T16:52:59+00:00 2026-05-13T16:52:59+00:00

I am in the process of writing a diff text tool to compare two

  • 0

I am in the process of writing a diff text tool to compare two similar source code files.

There are many such “diff” tools around, but mine shall be a little improved:

If it finds a set of lines are mismatched on both sides (ie. in both files), it shall not only highlight those lines but also highlight the individual changes in these lines (I call this inter-line comparison here).

An example of my somewhat working solution:

alt text http://files.tempel.org/tmp/diff_example.png

What it currently does is to take a set of mismatched lines and running their single chars thru the diff algo once more, producing the pink highlighting.

However, the second set of mismatches, containing “original 2”, requires more work: Here, the first two right lines (“added line a/b”) were added, while the third line is an altered version of the left side. I wish my software to detect this difference between a likely alteration and a probable new line.

When looking at this simple example, I can rather easily detect this case:

With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.

So far, so good. But I am still stuck with how to turn this into a more general algorithm for this purpose.

In a more complex situation, a set of different lines could have added lines on both sides, with a few closely matching lines in between. This gets quite complicated:

I’d have to match not only the first line on the left to the best on the right, but vice versa as well, and so on with all other lines. Basically, I have to match every line on the left against every one on the right. At worst, this might create even crossings, so that it’s not easily clear any more which lines were newly inserted and which were just altered (Note: I do not want to deal with possible moved lines in such a block, unless that would actually simplify the algorithm).

Sure, this is never going to be perfect, but I’m trying to get it better than it’s now. Any suggestions that aren’t too theoerical but rather practical (I’m not good understanding abstract algos) are appreciated.

Update

I must admit that I do not even understand how the LCS algo works. I simply feed it two arrays of strings and out comes a list of which sequences do not match. I am basically using the code from here: http://www.incava.org/projects/java/java-diff

Looking at the code I find one function equal() that is responsible for telling the algorithm whether two lines match or not. Based on what Pavel suggested, I wonder if that’s the place where I’d make the changes. But how? This function only returns a boolean – not a relative value that could identify the quality of the match. And I can not simply used a fixed Levenshtein ration that would decide whether a similar line is still considered equal or not – I’ll need something that’s self-adopting to the entire set of lines in question.

So, what I’m basically saying is that I still do not understand where I’d apply the fuzzy value that relates to the relative similarity of lines that do not (exactly) match.

  • 1 1 Answer
  • 2 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T16:52:59+00:00Added an answer on May 13, 2026 at 4:52 pm

    With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.

    After you have determined it, use the same algorithm to determine what lines in these two chinks match each other. But you need to make slight modificaiton. When you used the algorithm to match equal lines, the lines could either match or not match, so that added either 0 or 1 to the cell of the table you used.

    When comparing strings in one chunk some of them are “more equal” than others (ack. to Orwell). So they can add a real number from 0 to 1 to the cell when considering what sequence matches best so far.

    To compute this metrics (from 0 to 1), you can apply to each pair of strings you encounter… right, the same algorithm again (actually, you already did this when you were doing the first pass of Levenstein algorithm). This will compute the length of LCS, whose ratio to the average length of two strings would be the the metric value.

    Or, you can borrow the algorithm from one of diff tools. For instance, vimdiff can highlight the matches you require.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Let's say you had an external process writing files to some directory, and you
I'm in the process of writing a small text-editor that is supposed to have
I am in the process of writing some validation code based on these assumptions:
Am in a process of writing a javascript to replace a text within []
Currently in the process of writing some TCP socket code and running into a
I am in the process of writing a text editor. After looking at other
I am in the process of writing some code to parse questions into objects.
I'm in the process of writing a python module to POST files to a
I'm in the process of writing an assembly program that takes two strings as
Im in the process of writing a python script to act as a glue

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.