My code just creates inline-diff (on a per-word basis) of a string using HTML tags, so CSS can hide/show that which was removed / added.
In my tests, I use () for additions and {} for removals.
Here is my text:
Inputs:
"e <b><u>Zerg</u></b> a"
"e Zerg a"
Output:
"e(?)(\240){ <b>}{<u>}Zerg(?)(\240){</u>}{</b>}{ }a"
Now, I don’t do anything with changing the encoding at all, so… I’m really confused as to how a question mark and \240 got in there. o.o
What kind of encoding is this?
I’m using Ruby 1.8.7.
I found the source of the problem. It happens when I convert the new string to an array for Diff::LCS to use:
The code for that:
def self.convert_html_string_to_html_array(str)
=begin
Things like   (and other char codes), and tags need to be considered the same element
also handles the decision to diff per char or per word
also need to take into consideration JavaScript and CSS that might be in the middle of a selection
=end
result = Array.new
compare_words = str.has_at_least_one_word?
i = 0
while i < str.length do
cur_char = str[i..i]
case cur_char
when "&"
# For this we have two situations, a stray char code, and a char code preceeding a tag
next_index = str.index(";", i)
case str[next_index + 1..next_index + 1] # Check to see if tag
when "<"
next_index = str.index(">", i)
end
result << str[i..next_index]
i = next_index
when "<"
next_index = str.index(">", i)
result << str[i..next_index]
i = next_index
when " "
result << cur_char
else
if compare_words
# In here we need to check the above rules again, cause tags can be touching regular text
next_index = i + 1
next_index = str.index(" ", next_index)
next_index = str.length if next_index.nil?
next_index -= 1
if i < str.length and str[i..next_index].include?("<") # Beginning of a tag
next_index = str.index(">", i)
end
result << str[i..next_index] # We don't want to include the space
i = next_index
else
result << cur_char
end
end
i += 1
end
return result # Removes the trailing empty string
end
To clarify, this:
'e Zerg a'
gets turned into this:
[
[0] "e",
[1] "\302",
[2] "\240",
[3] "Z",
[4] "e",
[5] "r",
[6] "g",
[7] "\302",
[8] "\240",
[9] "a"
]
Update to 1.9.2 or above (I recommend using RVM). 1.8.7 has some weird stuff going on with strings…