It seems that someone has must have done this already, but I cannot find the end product I’m looking for.
Using a version control system for text is laborious. You need newline characters at the end of each sentence, and even in the midst of long sentences. Looking at the git source, it seems that by changing a few routines that check for '\n', it should be possible to have git (or any other version control system) match '\n' or the pattern '\\.\s'. It is, however, a task that needs to be done meticulously, or I can see things breaking pretty badly.
Does anyone know someone that has already done this? Or any other alternatives?
Thanks!
Any version control system should be able to handle prose. The question is how efficiently it can do so.
The
git diffcommand uses something likediff -uto display the differences between two versions of a file. If the file consists of text with very long lines (i.e., many characters between'\n'characters), then it might have some difficulty displaying the differences meaningfully; it might show two 5000-character lines with only a single character change.But that doesn’t necessarily imply that that’s how
gitstores the files. I’m not intimately familiar with git’s internal storage format, but my understanding is that it does reasonably well with binary files, which could have many megabytes of data with no'\n'characters.Note that some older version control systems (SCCS, RCS) probably do store differences between versions on a line-by-line basis. But even for such systems, at worst you’d be storing a full copy of each version plus some overhead. The system should still be able to work properly.
Note that
git diff --word-diffshould at least partially work around the problem of comparing versions.