You may have noticed that we now show an edit summary on Community Wiki posts:
community wiki
220 revisions, 48 users
I’d like to also show the user who ‘most owns’ the final content displayed on the page, as a percentage of the remaining text:
community wiki
220 revisions, 48 users
kronoz 87%
Yes, there could be top (n) ‘owners’, but for now I want the top 1.
Assume you have this data structure, a list of user/text pairs ordered chronologically by the time of the post:
User Id Post-Text ------- --------- 12 The quick brown fox jumps over the lazy dog. 27 The quick brown fox jumps, sometimes. 30 I always see the speedy brown fox jumping over the lazy dog.
Which of these users most ‘owns’ the final text?
I’m looking for a reasonable algorithm — it can be an approximation, it doesn’t have to be perfect — to determine the owner. Ideally expressed as a percentage score.
Note that we need to factor in edits, deletions, and insertions, so the final result feels reasonable and right. You can use any stackoverflow post with a decent revision history (not just retagging, but frequent post body changes) as a test corpus. Here’s a good one, with 15 revisions from 14 different authors. Who is the ‘owner’?
https://stackoverflow.com/revisions/327973/list
Click ‘view source’ to get the raw text of each revision.
I should warn you that a pure algorithmic solution might end up being a form of the Longest Common Substring Problem. But as I mentioned, approximations and estimates are fine too if they work well.
Solutions in any language are welcome, but I prefer solutions that are
- Fairly easy to translate into c#.
- Free of dependencies.
- Put simplicity before efficiency.
It is extraordinarily rare for a post on SO to have more than 25 revisions. But it should ‘feel’ accurate, so if you eyeballed the edits you’d agree with the final decision. I encourage you to test your algorithm out on stack overflow posts with revision histories and see if you agree with the final output.
I have now deployed the following approximation, which you can see in action for every new saved revision on Community Wiki posts
- do a line based diff of every revision where the body text changes
- sum the insertion and deletion lines for each revision as ‘editcount’
- each userid gets sum of ‘editcount’ they contributed
- first revision author gets 2x * ‘editcount’ as initial score, as a primary authorship bonus
- to determine final ownership percentage: each user’s edited line count total divided by total number of edited lines in all revisions
(There are also some guard clauses for common simple conditions like 1 revision, only 1 author, etcetera. The line-based diff makes it fairly speedy to recalc for all revisions; in a typical case of say 10 revisions it’s ~50ms.)
This works fairly well in my testing. It does break down a little when you have small 1 or 2 line posts that several people edit, but I think that’s unavoidable. Accepting Joel Neely’s answer as closest in spirit to what I went with, and upvoted everything else that seemed workable.
Saw your tweet earlier. From the display of the 327973 link, it appears you already have a single-step diff in place. Based on that, I’ll focus on the multi-edit composition:
A, the original poster owns 100% of the post.
When B, a second poster, makes edits such that e.g. 90% of the text is unchanged, the ownership is A:90%, B:10%.
Now C, a third party, changes 50% of the text. (A:45%, B:5%, C:50%)
In other words, when a poster makes edits such that x% is changed and y = (100-x)% is unchanged, then that poster now owns x% of the text and all previous ownership is multiplied by y%.
To make it interesting, now suppose…
A makes a 20% edit. Then A owns a ‘new’ 20%, and the residual ownerships are now multiplied by 80%, leaving (A:36%, B:4%, C:40%). The ‘net’ ownership is therefore (A:56%, B:4%, C:40%).
Applying this to your specimen (327973) with everything rounded to the nearest percent:
Version 0: The original post.
Version 1: Your current diff tool shows pure addition of text, so all those characters belong to the second poster.
Version 2: The diff shows replacement of a word. The new word belong to the third poster, and the remaining text belongs to the prior posters.
Version 3: Tag-only edit. Since your question was about the text, I’m ignoring the tags.
Version 4: Addition of text.
I hope that’s enough to give the sense of this proposal. It does have a couple of limitations, but I’m sliding these in under your statement that an approximation is acceptable. 😉
It brute-forcedly distributes the effect of change across all prior owners. If A posts, B does a pure addition, and C edits half of what B added, this simplistic approach just applies C’s ownership across the entire post, without trying to parse out which prior ownership was changed the most.
It accounts for additions or changes, but doesn’t give any ownership credit for deletion, because the deleter adds 0% to the remaining text. You can either regard this as a bug or a feature. I chose door number 2.
Update: A bit more about issue #1 above. I believe that fully-tracking the ownership of the part of a post that is edited would require one of two things (The margin of the web page is not big enough for a formal proof ;-):
Changing the way text is stored to reflect ownership of individual portions of the text (e.g. A owns words 1-47, B owns words 48-59, A owns words 60-94,…), applying the ‘how much remains’ approach in my proposal to each portion, and updating the portion-ownership data.
Considering all versions from first to current (in effect, recomputing the portion-ownership data on the fly).
So this is a nice example of a trade-off between a quick-and-dirty approximation (at the cost of precision), a change to the entire database (at the cost of space), or every calculation having to look at the entire history (at the cost of time).