I am currently using Hpple to parse HTML, like so:
TFHpple *htmlParser = [TFHpple hppleWithHTMLData:[currentString dataUsingEncoding:NSUTF8StringEncoding]];
NSString *paragraphsXpathQuery = @"//p//text()";
NSArray *paragraphNodes = [htmlParser searchWithXPathQuery:paragraphsXpathQuery];
if ([paragraphNodes count] > 0) {
NSMutableArray *tempArray = [NSMutableArray array];
for (TFHppleElement *element in paragraphNodes) {
[tempArray addObject:[element content]];
}
article.paragraphs = tempArray;
}
This way I get an array of paragraphs and I can use NSString *result = [myArray componentsJoinedByString:@"\n\n"]; to compile it into a single body of text with line breakes.
However, if the html contains tags, they are interpreted as individual entities and will get line breaked on their own right, so at the end of the day from a line like this:
<p>I went to the <a href="blablabla.html">shop</a> to get some milk!</a></p>
<p>It was awesome.</p>
I get this:
I went to the
shop
to get some milk!
It was awesome!
And of course I would like to get this (ignore other tags inside the p tag):
I went to the shop to get some milk!
It was awesome!
Can you help me out?
don’t forget to include this in your code : #import “RegexKitLite.h” here is the link to download this API : http://regexkit.sourceforge.net/#Downloads