I am trying to parse some html page in python. When i reach a certain tag, i would like to start printing all the data. So far i came up with this:
class MyHTMLParser(HTMLParser):
start = False;
counter = 0;
def handle_starttag(self,tag,attrs):
if(tag == 'TBODY'):
start = True;
counter +=1
#if counter == 1
def handle_data(self,data):
if (start == True): # this is the error line
print data
The problem is that there is an error saying that it doesn’t know what start is. I know i could use the global, but that wouldn’t force me to define the variable outside the whole class?
EDIT:
Changing start to self.start solves the problem, but is there a way to define it inside init without messing up the HTMLParser init?
This does not do what you think it does!
In Java, C#, or similar languages, what the analogous code does is declare that the class of objects known as
MyHTMLParserall have an attributestartwith the initial value ofFalse, andcounterwith the initial value of0.In Python classes are objects too. They have their own attributes, just like every other object. So what the above does in Python is create a class object named
MyHTMLParser, with an attributestartset toFalseand an attributecounterset to0.1Another thing to keep in mind is that there is no way whatsoever to make an assignment to a bare name like
start = Trueset an attribute on an object. It always sets a variable namedstart.2So your class contains no code that ever sets any attributes on any of your
MyHTMLParserinstances; the code in the class body is setting attributes on the class object itself, and the code inhandle_starttagis setting local variables which are then discarded when they fall out of scope.Your code in
handle_datais reading from a local variable namedstart(which you never set), for similar reasons. In Python there is no way to read an attribute without specifying in which object to look for it. A barestartis always referring to variable, either in the local function scope or some outer scope. You needself.startto read thestartattribute of theselfobject.Remember, the
defblock defining a method is nothing special, it’s a function like any other. It’s only later, when that function happens to be stored in an attribute of a class object, that the function can be classified as a method. So theselfparameter behaves the same as any other parameter, and indeed any other name. It doesn’t have to be namedself(though that’s a wise convention to follow), and it has no special privileges making reads and writes of bare names look for attributes ofself.So:
Don’t define your attributes with their initial values in the class block; that’s for values which are shared by all instances of the class, not attributes of each instance. Instance attributes can only be initialised once you have a reference to the particular instance; most commonly this is done in the
__init__method, which is called as soon as the object exists.You must specify in which object you want to read or write attributes. This applies always, in every context. In particular, you will usually refer to attributes inside methods as
self.attribute.Applying that (and eliminating the semicolons, which you don’t need in Python):
1 The methods
handle_starttagandhandle_dataare also nothing more than functions which happen to be attributes of an object that is used as a class.2 Usually a local variable; if you’ve declared
startto beglobalornonlocalthen it might be an outer variable. But it’s definitely not an attribute on some object you happen to have nearby, even if that other object is bound to the nameself.