I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work.
Suggestions?
Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?
It all depends on the effective amount of text and/or web pages you intent to crawl. A generic solution is probably to
Such info would be stored in a simple table (maybe with a very few support/related tables) containing fields such as Url, FileName (where you’ll be saving it), Offset in File where stored (the idea is to keep several pages in the same file) date of crawl, size, and a few other fields.
The file name and path matters little (i.e. the path may be shallow and the name cryptic/automatically generated). This name / path is stored in the meta-data. Several crawled pages are stored in the same flat file, to optimize the overhead in the OS to manage too many files. The text itself may be compressed (ZIP etc.) on a per-page basis (there’s little extra compression gain to be had by compressing bigger chunks.), allowing a per-file handling (no need to decompress all the text before it!). The decision to use compression depends on various factors; the compression/decompression overhead is typically relatively minimal, CPU-wise, and offers a nice saving on HD Space and generally disk I/O performance.
The advantage of this approach is that the DBMS remains small, but is available for SQL-driven queries (of an ad-hoc or programmed nature) to search on various criteria. There is typically little gain (and a lot of headache) associated with storing many/big files within the SQL server itself. Furthermore as each page gets processed / analyzed, additional meta-data (such as say title, language, most repeated 5 words, whatever) can be added to the database.