I’m first time poster here trying to pick up some Python skills; please be kind to me 🙂
While I’m not a complete stranger to programming concepts (I’ve been messing around with PHP before), the transition to Python has turned out to be somewhat difficult for me. I guess this mostly has to do with the fact that I lack most – if not all – basic understanding of common “design patterns” (?) and such.
Having that said, this is the problem. Part of my current project involves writing a simple scraper by utilizing Beautiful Soup. The data to be processed has a somewhat similar structure to the one which is laid out below.
<table>
<tr>
<td class="date">2011-01-01</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
<tr>
<td class="date">2011-01-02</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link"><a href="#">Link</a></td>
</tr>
</table>
The main issue is that I simply can’t get my head around how to 1) keep track of the current date (tr->td class=”date”) while 2) looping over the items in the subsequent tr:s (tr class=”item”->td class=”headline” and tr class=”item”->td class=”link”) and 3) store the processed data in an array.
Additionally, all data will be inserted into a database where each entry must contain the following information;
- date
- headline
- link
Note that crud:ing the database is not part of the problem, I only mentioned this in order to better illustrate what I’m trying to accomplish here 🙂
Now, there are many different ways to skin a cat. So while a solution to the issue at hand is indeed very welcome, I’d be extremely grateful if someone would care to elaborate on the actual logic and strategy you would make use of in order to “attack” this kind of problem 🙂
Last but not least, sorry for such a noobish question.
The basic problem is that this table is marked up for looks, not for semantic structure. Properly done, each date and its related items should share a parent. Unfortunately, they don’t, so we’ll have to make do.
The basic strategy is to iterate through each row in the table
.