I was using Python and regular expressions to find things an HTML document and unlike what most people say, it was working perfectly, even though things could go wrong. Anyway, I decided Beautiful Soup would be faster and easier but I don’t really know how to make it do what I did with regex, which was fairly easy, but messy.
I am using this page’s HTML:
http://www.locationary.com/places/duplicates.jsp?inPID=1000000001
EDIT:
Here is the HTML for the main place:
<tr>
<td class="Large Bold" nowrap="nowrap">Riverside Tower Hotel </td>
<td class="Large Bold" width="100%">80 Riverside Drive, New York, New York, United States</td>
<td class="Large Bold" nowrap="nowrap" width="55"> <input name="selectCheckBox" type="checkbox" checked="checked" disabled="disabled" />Yes
</td>
</tr>
Example of the first similar place:
<td class="" nowrap="nowrap"><a href="http://www.locationary.com/place/en/US/New_York/New_York/54_Riverside_Dr_Owners_Corp-p1009633680.jsp" target="_blank">54 Riverside Dr Owners Corp</a></td>
<td width="100%"> 54 Riverside Dr, New York, New York, United States</td>
<td nowrap="nowrap" width="55">
When my program gets it and I use Beautiful Soup to make it more readable, the HTML comes out a little different than Firefox’s “view source”…I don’t know why.
These were my regular expressions:
PlaceName = re.findall(r'"nowrap">(.*) </td>', main)
PlaceAddress = re.findall(r'width="100%">(.*)</td>\n<td class="Large Bold"', main)
cNames = re.findall(r'target="_blank">(.*)</a></td>\n<td width="100%"> ', main)
cAddresses = re.findall(r'<td width="100%"> (.*)</td>\n<td nowrap="nowrap" width="55">', main)
cURLs = re.findall(r'<td class="" nowrap="nowrap"><a href="(.*)" target="_blank">', main)
The first two are for the main place and address. The rest are for the information of the rest of the places. After I made these, I decided I only wanted the first 5 results for cNames, cAddresses, and cURLs, because I don’t need 91 or whatever it was.
I don’t know how to find this kind of information with BS. All I can do with BS is find specific tags and do things with them. This HTML is kind of complicated because all of the info. I want is in tables and the table tags are kind of a mess too…
How do you get that info, and limit it only to the first 5 results or so?
Thanks.
People say that you can’t parse HTML with regular expressions for a reason, but here’s a simple reason that applies to your regexp: you’ve got
\nand in your regexp and those can and will change at random on the page(s) you are trying to parse. When that happens your regexp won’t match and your code will stop working.However the task that you are looking to do is really simple
yields all the Anchor tags no matter where they appear in the deeply nested structure of this page. Here are lines I excerpted from the output of that three line script:
It is hard to imagine less code that could do that much work for you.