I am parsing a table of a HTML page but when I display the data, there are random characters added like in this example here:
Preowiveding but it should be Preding.
I dont know if that is a security feature to prevent people from parsing their data.
It is strange because sometimes the text is shown right and another text is shown wrong…
The page were I get the data from is this one here.
The HTML code of the table looks a bit strange:
<a target='_blank' href='#' class='draggableVerein' >L<span style='display:none;'>i<span style='display:none;'>sivba</span><u></u>vbao</span><u></u>iebenau</a>
Between the text there are span and u tags that seem to be doing nothing in the Browser but produce this errors when parsing.
I use Ben Reeves HTML Parser.
Example:
HTMLNode *node = [rowNode findChildWithAttribute:@"class" matchingName:@"rang" allowPartial:TRUE];
team.rang = [node allContents];
edit:
Now I tried libXML2 with HPPLE:
NSArray *elements = [xpathParser searchWithXPathQuery:@"//table[2]/tr[5]/td/a"];
// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];
NSString *content = [element content];
NSLog(@"content: %@",content);
Output is ersdorfinstead of Eggersdorf.
HTML of this example:
<a target='_blank' href='/datenservice/portal/verein/aktuelles.ds?vereinsNr=8070&sektionsId=485215725|665233118344931246&awVerband=ST_' class='draggableVerein' drag_img='/netzwerk/imagedownload/379402779304830775_383470150383145150-60-60-EfcSAtkX.jpg'>Eggersdorf</a>
It is a really strange code.
Any tips?
It looks like there are two things going on here.
Linstead ofL). This may be an attempt at obfuscation.<span style='display:none'>…</span>to tell the browser not to display certain text. This may be an attempt to introduce invisible garbage into the text. The browser will not display it but an HTML parser will still spit out that text.If you want to discard the garbage text your code will have to process
<span> &</span>tags and automatically discard any text with a style set todisplay:none.NB: The source for the page you linked to has a copyright statement (in German).
IANAL, but you may need a translator and a lawyer to make sure you are not violating their terms of service by scraping the page.