My code just creates inline-diff (on a per-word basis) of a string using HTML

Question

0

Asked: May 27, 20262026-05-27T23:31:00+00:00 2026-05-27T23:31:00+00:00

My code just creates inline-diff (on a per-word basis) of a string using HTML

0

My code just creates inline-diff (on a per-word basis) of a string using HTML tags, so CSS can hide/show that which was removed / added.

In my tests, I use () for additions and {} for removals.

Here is my text:

Inputs:

"e&nbsp;<b><u>Zerg</u></b>&nbsp;a"
"e Zerg a"

Output:

"e(?)(\240){&nbsp;<b>}{<u>}Zerg(?)(\240){</u>}{</b>}{&nbsp;}a"

Now, I don’t do anything with changing the encoding at all, so… I’m really confused as to how a question mark and \240 got in there. o.o

What kind of encoding is this?

I’m using Ruby 1.8.7.

I found the source of the problem. It happens when I convert the new string to an array for Diff::LCS to use:

The code for that:

  def self.convert_html_string_to_html_array(str)
=begin
  Things like &nbsp (and other char codes), and tags need to be considered the same element
  also handles the decision to diff per char or per word

  also need to take into consideration JavaScript and CSS that might be in the middle of a selection
=end
    result = Array.new
    compare_words = str.has_at_least_one_word?
    i = 0
    while i < str.length do
      cur_char = str[i..i]
      case cur_char
      when "&"
        # For this we have two situations, a stray char code, and a char code preceeding a tag
        next_index = str.index(";", i)
        case str[next_index + 1..next_index + 1] # Check to see if tag
        when "<"
          next_index = str.index(">", i)
        end
        result << str[i..next_index]
        i = next_index
      when "<"
        next_index = str.index(">", i)
        result << str[i..next_index]
        i = next_index
      when " "
        result << cur_char
      else
        if compare_words
          # In here we need to check the above rules again, cause tags can be touching regular text
          next_index = i + 1
          next_index = str.index(" ", next_index)
          next_index = str.length if next_index.nil?
          next_index -= 1

          if i < str.length and str[i..next_index].include?("<") # Beginning of a tag
            next_index = str.index(">", i)
          end

          result << str[i..next_index] # We don't want to include the space
          i = next_index
        else
          result << cur_char
        end
      end
      i += 1
    end

    return result # Removes the trailing empty string
  end

To clarify, this:

'e Zerg a'

gets turned into this:

[
    [0] "e",
    [1] "\302",
    [2] "\240",
    [3] "Z",
    [4] "e",
    [5] "r",
    [6] "g",
    [7] "\302",
    [8] "\240",
    [9] "a"
]

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T23:31:00+00:00

Editorial Team

2026-05-27T23:31:00+00:00Added an answer on May 27, 2026 at 11:31 pm

Update to 1.9.2 or above (I recommend using RVM). 1.8.7 has some weird stuff going on with strings…

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My code just creates inline-diff (on a per-word basis) of a string using HTML

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply