I’m writing a webcrawler in Python that will store the HTML code of a large set of pages in a MySQL database. I’d like to make sure my methods of storage and processing are optimal before I begin processing data. I would like to:
-
Minimize storage space used in the database – possibly by minifying HTML code, Huffman encoding, or some other form of compression. I’d like to maintain the possibility of fulltext searching the field – I don’t know if compression algorithms like Huffman encoding will allow this.
-
Minimize the processor usage necessary to encode and store large volumes of rows.
Does anyone have any suggestions or experience in this or a similar issue? Is Python the optimal language to be doing this in, given that it’s going to require a number of HTTP requests and regular expressions plus whatever compression is optimal?
If you don’t mind the HTML being opaque to MySQL, you can use the COMPRESS function to store the data and UNCOMPRESS to retrieve it. You won’t be able to use the HTML contents in a WHERE clause (using, e.g., LIKE).