I am interested in knowing if there are any open source projects (preferably in Python) which can be used to download (crawl?) the mailing list archives of open source projects such as Lucene/Hadoop (such as http://mail-archives.apache.org/mod_mbox/lucene-java-user/). I am specially looking for a crawler/downloader customized for (Apache) mailing list archives (not a generic crawler such as Scrappy). Any pointers are highly appreciated.
Thank you.
I am interested in knowing if there are any open source projects (preferably in
Share
There’s usually facilities for downloading mbox files. In the link you provided, you can for example append the mbox name and get the mail archive directly. Example, the mbox for October 2012:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201210.mbox
So getting the archives programmatically is pretty straightforward. Once you have them: