I am more than a bit tired, but here goes: I am doing tome

Question

0

Editorial Team

Asked: May 17, 20262026-05-17T14:41:31+00:00 2026-05-17T14:41:31+00:00

I am more than a bit tired, but here goes: I am doing tome

0

I am more than a bit tired, but here goes:

I am doing tome HTML scraping in python 2.6.5 with BeautifulSoap on an ubuntubox

Reason for python 2.6.5: BeautifulSoap sucks under 3.1

I try to run the following code:

# dataretriveal from html files from DETHERM
# -*- coding: utf-8 -*-

import sys,os,re,csv
from BeautifulSoup import BeautifulSoup


sys.path.insert(0, os.getcwd())

raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)

for numdiv in soup.findAll('div', {"id" : "sec"}):
    currenttable = numdiv.find('table',{"class" : "data"})
    if currenttable:
        numrow=0
        numcol=0
        data_list=[]
        for row in currenttable.findAll('td', {"class" : "dataHead"}):
            numrow=numrow+1
        for ncol in currenttable.findAll('th', {"class" : "dataHead"}):
            numcol=numcol+1
        for col in currenttable.findAll('td'):
            col2 = ''.join(col.findAll(text=True))
        if col2.index('±'):
        col2=col2[:col2.index('±')]
            print(col2.encode("utf-8"))
        ref=numdiv.find('a')
        niceref=''.join(ref.findAll(text=True))

Now due to the ± signs i get the following error when trying to interprent the code with:

python code.py

Traceback (most recent call last):
File “detherm-wtest.py”, line 25, in
if col2.index(‘±’):
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc2 in position 0: ordinal not in range(128)

How do i solve this? putting an u in so we have: ‘±’ -> u’±’ results in:

Traceback (most recent call last):
File “detherm-wtest.py”, line 25, in
if col2.index(u’±’):
ValueError: substring not found

current code file encoding is utf-8

thank you

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T14:41:31+00:00

Byte strings like "±" (in Python 2.x) are encoded in the source file’s encoding, which might not be what you want. If col2 is really a Unicode object, you should use u"±" instead like you already tried. You might know that somestring.index raises an exception if it doesn’t find an occurrence whereas somestring.find returns -1. Therefore, this

    if col2.index('±'):
        col2=col2[:col2.index('±')] # this is not indented correctly in the question BTW
        print(col2.encode("utf-8"))

should be

    if u'±' in col2:
        col2=col2[:col2.index(u'±')]
        print(col2.encode("utf-8"))

so that the if statement doesn’t lead to an exception.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am more than a bit tired, but here goes: I am doing tome

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply