I’ve built a crawler that had to run on about 5M pages (by increasing

Question

0

Asked: May 24, 20262026-05-24T03:53:23+00:00 2026-05-24T03:53:23+00:00

I’ve built a crawler that had to run on about 5M pages (by increasing

0

I’ve built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info’ I need.

after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I’m wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url.

you can see the subtrahends quite fast (a little ex’ of the few first “good IDs”) –

510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201

after crawling about 200K urls which gave me only 14K good results I knew I was wasting my time and need to optimize it, so I run some statistics and built a function that will check the urls while increasing the id with 8\18\17\8 (top returning subtrahends ) etc’.

this is the function –

def checkNextID(ID):
    global numOfRuns, curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(3) # sleep every 10 iterations
            if isValid(ID + 8):
                parseHTML(curRes)
                checkNextID(ID + 8)
                return 0
            if isValid(ID + 18):
                parseHTML(curRes)
                checkNextID(ID + 18)
                return 0
            if isValid(ID + 7):
                parseHTML(curRes)
                checkNextID(ID + 7)
                return 0
            if isValid(ID + 17):
                parseHTML(curRes)
                checkNextID(ID + 17)
                return 0
            if isValid(ID+6):
                parseHTML(curRes)
                checkNextID(ID + 6)
                return 0
            if isValid(ID + 16):
                parseHTML(curRes)
                checkNextID(ID + 16)
                return 0
            else:
                checkNextID(ID + 1)
                return 0
        except Exception, e:
            print "somethin went wrong: " + str(e)

what is basically does is -checkNextID(ID) is getting the first id I know that contain the data minus 8 so the first iteration will match the first “if isValid” clause (isValid(ID + 8) will return True).

lastResult is a variable which saves the last known url id, so we’ll run until numOfRuns is

isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global varibale named – ‘curRes‘, it returns False if the url doesn’t contain the data I need.

parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.

if isValid() returns True, we’ll call parseHTML() and then try to check the next ID+the subtrahends (by calling checkNextID(ID + subtrahends), if none of them will return what I’m looking for I’ll increase it with 1 and check again until I’ll find the next valid url.

you can see the rest of the code here

after running the code I got about 950~ good results and suddenly an exception had raised –

“somethin went wrong: maximum recursion depth exceeded while calling a
Python object”

I could see on WireShark that the scipt stuck on id – 510009541 (I started my script with 510000003), the script tried getting the url with that ID a few times before I noticed the error and stopped it.

I was really exciting to see that I got the same results but 25x-40x times faster then my old script, with fewer HTTP requests, it’s very precise, I have missed only 1 result for 1000 good results, which is find by me, it’s impossible to rum 5M times, I had my old script running for 30 hours and got 14-15K results when my new script gave me 960~ results in 5-10 minutes.

I read about stack limitations, but there must be a solution for the algorithm I’m trying to implement in Python (I can’t go back to my old “algorithm”, it will never end).

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T03:53:24+00:00

this turns the recursion in to a loop:

def checkNextID(ID):
    global numOfRuns, curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(3) # sleep every 10 iterations
            if isValid(ID + 8):
                parseHTML(curRes)
                ID = ID + 8
            elif isValid(ID + 18):
                parseHTML(curRes)
                ID = ID + 18
            elif isValid(ID + 7):
                parseHTML(curRes)
                ID = ID + 7
            elif isValid(ID + 17):
                parseHTML(curRes)
                ID = ID + 17
            elif isValid(ID+6):
                parseHTML(curRes)
                ID = ID + 6
            elif isValid(ID + 16):
                parseHTML(curRes)
                ID = ID + 16
            else:
                ID = ID + 1
        except Exception, e:
            print "somethin went wrong: " + str(e)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve built a crawler that had to run on about 5M pages (by increasing

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply