I have a text like:
I've got a date with this fellow tomorrow. Well me and thousands of others. <br /><br /><img src="http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg"><br /><br />Tomorrow morning I will be getting up at stupid o'clock and driving up to Manchester, NH to see Barak Obama speak. <br /><br />You all should come too!<br /><br /><a href="http://nh.barackobama.com/manchesterchange">RSVP for the event</a>
I would want to like to clean it too :
I’ve got a date with this fellow
tomorrow. Well me and thousands of
others
http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg
Tomorrow morning I
will be getting up at stupid
o’clock and driving up to
Manchester, NH to see Barak Obama
speak.You all
should come too!
h**p://nh.barackobama.com/manchesterchange RSVP
for the event
I would like to write a JAVA program for the same. Any pointers/suggestions would be appreciated.The tags aren’t limited to the above post. This was just an example.
Thanks!
PS: Replace *’s by t’s in the second hyperlink as Stack Overflow doesn’t allow me to post more than one link.
The simplest way of ‘tidying’ text which has XML tags is to use a regular expression that identifies anything that is a tag (i.e. anything that starts with ‘<‘ and ends with ‘>’ and everything in between). Note this works whether or not XML is ‘well-formed’ as it cleans up any tags regardless of whether opening tags match with closing tags.
For example,
will remove all tags from a given string. The downside is that it won’t preserve the image link nor the hyperlink as per your example. Hope this helps though!
Edited 11:58 04/04/10: Try this to remove HTML encoded HTML tags (i.e.. anything that starts with
<and ends with>)…Then to remove any other HTML encoded/formatted bits like
"(i.e. anything that starts with & and ends with ; and in between conforms to a valid word without spaces or breaks) useIf there’s any malformed HTML/XML beyond those, unless there’s a known pattern there’s no way of catching them.