How can I format text in a natural language taking punctuation into account? The built-in gq command of Vim, or command line tools, such as fmt or par break lines without regard to punctuation. Let me give you an example,
fmt -w 40 gives not what I want:
we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way
smart_formatter -w 40 would give:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
Of course, there are cases when no punctuation mark is found within the given text width, then it can fallback to the standard text formatting behavior.
The reason why I want this is to get a meaningful diff of text where I can spot which sentence or subsentence changed.
Here is a not very elegant, but working method I finally came up with. Suppose, a line break at a punctuation mark is worth 6 characters. It means, I’ll accept a result which is more ragged but contains more lines ending in a punctuation mark if the “raggedness” is less than 6 characters long. For example, this is OK (“raggedness” is 3 characters).
This is not OK (“raggedness” is more than 6 characters)
The method is to add 6 dummy characters after each punctuation mark, format the text, then remove the dummy characters.
Here is the code for this
I used
_(space + underscore) as a pair of dummy characters, supposing they’re not contained in the text. The result looks quite good,