I’m trying to figure out a regular expression on Ruby 1.8.7 for removing the thread from emails. For doing so I need to remove all content between mail boundaries that matches the thread pattern, for example, on Mac Mail I would need to remove the text in bold (sample HTML is simplified to avoid using a lot of space, real mails’ HTML is far less succinct):
From: XXXX ... mail headers ... Content-Type: multipart/alternative; boundary="Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036" ... more mail headers ... --Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii New comment added from Mac Mail On 12/06/2012, at 12:51, XXXX@example.com wrote: > Thread > text > to be > removed >=20 --Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii <html>... lots of HTML... <span>On 12/06/2012, at 12:51, XXXX@example.com wrote:</span> <span> Thread </span> <span> text </span> <span> to be </span> <span> removed </span> <span>=20 </span> </html>= --Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036--
The regular expression I thought would capture the required text is:
--Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036.+?(\bOn.+?)(?!--Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036)
But this is not working as is capturing from the boundary right until the first “On “.
Ok, so the solution for this was pretty simple, I ended up with an expression like the following:
No need to perform a look-ahead/behind for this.