Suppose I want to bruteforce webpages:
for example, http://www.example.com/index.php?id=<1 – 99999>
and search each page to find if there contain a certain text.
If the page contains the text, then store it into a string
I kind of get it working in python, but it is quite slow (around 1-2 second per page, which would take around 24 hours to do that) is there a better solution? I am thinking of using C/C++, because I heard python is not very efficient. However, after a second thought, I think it might not be the efficiency of python, but rather the efficiency of accessing html element (I changed the whole html into a text, then search it… and the content is quite long)
So how can I improve the speed of bruteforcing?
Most likely your problem has nothing to do with your ability to quickly parse HTML and everything to do with the latency of page retrieval and blocking on sequential tasks.
1-2 seconds is a reasonable amount of time to retrieve a page. You should be able to find text on the page orders of magnitude faster. However, if you are processing pages one at a time you are blocked waiting for a response from a web server while you could be finding your results. You could instead retrieve multiple pages at once via worker processes and wait only for their output.
The following code has been modified from Python’s multiprocessing docs to fit your problem a little more closely.