Let’s say that I have written a custom e-mail management application for the company that I work for. It reads e-mails from the company’s support account and stores cleaned-up, plain text versions of them in a database, doing other neat things like associating it with customer accounts and orders in the process. When an employee replies to a message, my program generates an e-mail that is sent to the customer with a formatted version of the discussion thread. If the customer responds, the app looks for a unique number in the subject line to read the incoming message, strip out the previous discussion, and add it as a new item in the thread. For example:
This is a message from Contoso customer service. Recently, you requested customer support. Below is a summary of your request and our reply. -------------------------------------------------------------------- Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m. -------------------------------------------------------------------- John: I've modified your address. You can confirm my work by logging into 'Your Account' on our Web site. Your order should ship out today. Thanks for shopping at Contoso. -------------------------------------------------------------------- You on Tuesday, December 30, 2008 at 8:03 a.m. -------------------------------------------------------------------- Oops, I entered my address incorrectly. Can you change it to Fred Smith 123 Main St Anytown, VA 12345 Thanks! -- Fred Smith Contoso Product Lover
Generally, this all works great, but there’s one area that I’ve kind of putting off cleaning up for a while now, and it deals with text wrapping. In order to generate the pretty e-mail format like the one above, I need to re-wrap the text that the customer originally sent.
I’ve written an algorithm that does this (though looking at the code, I’m not entirely sure how it works anymore–it could use some refactoring). But it can’t distinguish between a hard-wrap newline, an ‘end of paragraph’ newline, and a ‘semantic’ newline. For example, a hard-wrap newline is one that the e-mail client inserted within a paragraph to wrap a long line of text, say, at 79 columns. An end of paragraph newline is one that the user added after the last sentence in a paragraph. And a semantic newline would be something like the br tag, such as the address that the Fred typed above.
My algorithm instead only sees two newlines in a row as indicating a new paragraph, so it would make the customer’s e-mail be formatted something like the following:
Oops, I entered my address incorrectly. Can you change it to Fred Smith 123 Main St Anytown, VA 12345 Thanks! -- Fred Smith Contoso Product Lover
Whenever I try to write a version that would re-wrap this text as intended, I basically hit a wall in that I need to know the semantics of the text, the difference between a ‘hard-wrap’ newline and a ‘I really meant it like a br‘-type newline, such as in the customer’s address. (I use two newlines in a row to determine when to start a new paragraph, which coincides with how the majority of people seem to actually type e-mails.)
Anyone have an algorithm that can re-wrap the text as intended? Or is this implementation ‘good enough’ when weighing the complexity of any given solution?
Thanks.
You could try to check if a newline has been inserted to keep the line length below a maximum (aka hard wrap): Just check for the longest line in the text. Then, for any given line, you append the first word of the following line to it. If the resulting line exceeds the maximum length, the line break probably was a hard wrap.
Even simpler you might just consider all breaks in
(maxlength - 15) <= length <= maxlengthas being hardwraps (with 15 just being an educated guess). This would certainly filter out intentional breaks as in addresses and stuff, and any missed break in this range wouldn’t influence the result too badly.