I’m writing a webcrawler in Python that will store the HTML code of a

Question

0

Asked: May 26, 20262026-05-26T18:58:45+00:00 2026-05-26T18:58:45+00:00

I’m writing a webcrawler in Python that will store the HTML code of a

0

I’m writing a webcrawler in Python that will store the HTML code of a large set of pages in a MySQL database. I’d like to make sure my methods of storage and processing are optimal before I begin processing data. I would like to:

Minimize storage space used in the database – possibly by minifying HTML code, Huffman encoding, or some other form of compression. I’d like to maintain the possibility of fulltext searching the field – I don’t know if compression algorithms like Huffman encoding will allow this.
Minimize the processor usage necessary to encode and store large volumes of rows.

Does anyone have any suggestions or experience in this or a similar issue? Is Python the optimal language to be doing this in, given that it’s going to require a number of HTTP requests and regular expressions plus whatever compression is optimal?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T18:58:46+00:00

Editorial Team

2026-05-26T18:58:46+00:00Added an answer on May 26, 2026 at 6:58 pm

If you don’t mind the HTML being opaque to MySQL, you can use the COMPRESS function to store the data and UNCOMPRESS to retrieve it. You won’t be able to use the HTML contents in a WHERE clause (using, e.g., LIKE).

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a webcrawler in Python that will store the HTML code of a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply