I am looking at building an app that will display a monthly journal. There is no XML for the journal, they just change the Title header and URL for the PDF each month. This is always stored the same place in source code, so I am looking at finding all the text within the
div class=entry clearfix post /div
Tag, and then extracting the first URL. I have worked on parsing XML before, but never HTML. What would be my best option for this?
UPDATE:
Only at one point in the Source code does the page say To Download the PDF, click here. So, I set up the following scanner:
NSURL *url = [NSURL URLWithString:@"http://www.thejenkinsinstitute.com/Journal/"];
NSString *content = [NSString stringWithContentsOfURL:url];
NSString * aString = content;
NSMutableArray *substrings = [NSMutableArray new];
NSScanner *scanner = [NSScanner scannerWithString:aString];
[scanner scanUpToString:@"<p>To Download the PDF, <a href=\"http://michaelwhitworth.com/wp-content/HE22.pdf\">" intoString:nil]; // Scan all characters before #
while(![scanner isAtEnd]) {
NSString *substring = nil;
[scanner scanString:@"<p>To Download the PDF, <a href=\"" intoString:nil]; // Scan the # character
if([scanner scanUpToString:@"\"" intoString:&substring]) {
// If the space immediately followed the #, this will be skipped
[substrings addObject:substring];
}
[scanner scanUpToString:@"#" intoString:nil]; // Scan all characters before next #
}
NSLog(@"Here is the Substring%@", substrings);
// do something with substrings
[substrings release];
In console, the first thing to be returned is the URL of the PDF, but it includes much more. Here is a brief excerpt.
"2012-11-23 15:33:36.383 Jenkins[8306:c07] Here is the Substring(
"http://michaelwhitworth.com/wp-content/HE22.pdf",
"#8220;As the Bible School Goes So Goes the Congregation” by Ira North</a></p>\n<p style=","
What am I doing wrong to keep this from giving me just the URL, and nothing more?
I did something similar, I put up a small web service (API which was basically a simple Ruby app that was scrapping the html that I needed, and returned it in a REST way. The Web service/API is a good idea since if anything change in the HTML (like the element change of id), you don’t have to update your iOS app to just change path of the node you’re parsing.