I store the source code from a website to a string, and I successfully removed all the html tags. However, there are random whitespaces between paragraphs. Sometimes it will only be one line, other times it will be 4 or 5 lines.
Here is what I did
- (NSString *)parseHTMLText:(NSString *)text {
NSString *startingPt = @"<!-- (START) Pagination Content Wrapper -->";
NSString *endingPt = @"<!-- (END) Pagination Content Wrapper -->";
//isolate body text from entire source code
NSString *leftTrimmed = [text substringFromIndex:NSMaxRange([text rangeOfString:startingPt])] ;
NSString *completeTrimmed = [leftTrimmed substringToIndex:[leftTrimmed rangeOfString:endingPt].location];
completeTrimmed = [completeTrimmed stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
completeTrimmed = [self removeHTMlTagsFromString:completeTrimmed];
completeTrimmed = [completeTrimmed stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
return completeTrimmed;
}
- (NSString *)removeHTMlTagsFromString:(NSString *)text {
//check if there are any html tags
if ([text rangeOfString:@"<"].location != NSNotFound && [text rangeOfString:@">"].location != NSNotFound) {
//find first index of "<"
int startIndex = [text rangeOfString:@"<"].location;
NSString *startOfTag = [text substringFromIndex:startIndex];
// find length to ">"
int length = [startOfTag rangeOfString:@">"].location + 1;
text = [text stringByReplacingCharactersInRange:NSMakeRange(startIndex, length) withString:@""];
text = [self removeHTMlTagsFromString:text];
}
return text;
}
I tried this, but it doesn’t work
completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@" " withString:@""];
If the original HTML looked like:
Then when you remove all the tags you’ll still have the newlines that separated them.
Use a DOM parsing library rather than primitive string functions, and your problem should be solved.