I am trying to scrape a site that returns its data via Javascript. The

Question

0

Asked: June 17, 20262026-06-17T18:25:58+00:00 2026-06-17T18:25:58+00:00

I am trying to scrape a site that returns its data via Javascript. The

0

I am trying to scrape a site that returns its data via Javascript. The code I wrote using BeautifulSoup works pretty well, but at random points during scraping I get the following error:

Traceback (most recent call last):
File "scraper.py", line 48, in <module>
accessible = accessible[0].contents[0]
IndexError: list index out of range

Sometimes I can scrape 4 urls, sometimes 15, but at some point the script eventually fails and gives me the above error. I can find no pattern behind the failing, so I’m really at a loss here – what am I doing wrong?

from bs4 import BeautifulSoup
import urllib
import urllib2
import jabba_webkit as jw
import csv
import string
import re
import time

countries = csv.reader(open("countries.csv", 'rb'), delimiter=",")
database = csv.writer(open("herdict_database.csv", 'w'), delimiter=',')

basepage = "https://www.herdict.org/explore/"
session_id = "indepth;jsessionid=C1D2073B637EBAE4DE36185564156382"
ccode = "#fc=IN"
end_date = "&fed=12/31/"
start_date = "&fsd=01/01/"

year_range = range(2009, 2011)
years = [str(year) for year in year_range]

def get_number(var):
    number = re.findall("(\d+)", var)

    if len(number) > 1:
        thing = number[0] + number[1]
    else:
        thing = number[0]

    return thing

def create_link(basepage, session_id, ccode, end_date, start_date, year):
    link = basepage + session_id + ccode + end_date + year + start_date + year
    return link



for ccode, name in countries:
    for year in years:
        link = create_link(basepage, session_id, ccode, end_date, start_date, year)
        print link
        html = jw.get_page(link)
        soup = BeautifulSoup(html, "lxml")

        accessible = soup.find_all("em", class_="accessible")
        inaccessible = soup.find_all("em", class_="inaccessible")

        accessible = accessible[0].contents[0]
        inaccessible = inaccessible[0].contents[0]

        acc_num = get_number(accessible)
        inacc_num = get_number(inaccessible)

        print acc_num
        print inacc_num
        database.writerow([name]+[year]+[acc_num]+[inacc_num])

        time.sleep(2)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T18:25:59+00:00

Editorial Team

2026-06-17T18:25:59+00:00Added an answer on June 17, 2026 at 6:25 pm

You need to add error-handling to your code. When scraping a lot of websites, some will be malformed, or somehow broken. When that happens, you’ll be trying to manipulate empty objects.

Look through the code, find all assumptions where you’re assuming it works, and check against errors.

For that specific case, I would do this:

if not inaccessible or not accessible:
    # malformed page
    continue

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to scrape a site that returns its data via Javascript. The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply