as an exercise in RSS I would like to be able to search through pretty much all Unix discussions on this group.
I know enough Python and understand basic RSS, but I am stuck on … how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent?
High level descriptions, pseudo-code is welcome.
Thank you!
EDIT:
I would like to be able to go back more than 100 messages, but do not grabbing like parsing 10 messages at a time such as using this URL:
http://groups.google.com/group/comp.unix.shell/topics?hl=en&start=2000&sa=N
There must be a better way.
As Randal mentioned, this violates Google’s ToS — however, as a hypothetical or for use on another site without these restrictions you could pretty easily rig something up with urllib and BeautifulSoup. Use urllib to open the page and then use BeautifulSoup to grab all the thread topics (and links if you want to crawl deeper). You can then programmatically find the link to the next page of results and then make another urllib to go to page 2 — then repeat the process.
At this point you should have all the raw data, then it is just a matter of manipulating the data and implementing your searching functionality.