I made a program that extracts the text from a HTML file. It recurses down the HTML document and returns the list of tags. For eg,
input < li >no way < b > you < /b > are doing this < /li >
output [‘no’,’way’,’you’,’are’…].
Here is a highly simplified pseudocode for this:
def get_leaves(node):
kids=getchildren(node)
for i in kids:
if leafnode(i):
get_leaves(i)
else:
a=process_leaf(i)
list_of_leaves.append(a)
def calling_fn():
list_of_leaves=[] #which is now in global scope
get_leaves(rootnode)
print list_of_leaves
I am now using list_of_leaves in a global scope from the calling function. The calling_fn() declares this variable, get_leaves() appends to this.
My question is, how do I modify my function so that I am able to do something like list_of_leaves=get_leaves(rootnode), ie without using a global variable?
I dont want each instance of the function to duplicate the list, as the list can get quite big.
Please dont critisize the design of this particular pseudocode, as I simplified this. It is meant for another purpose: extracting tokens along with associated tags using BeautifulSoup
You can pass the result list as optional argument.
Python objects are always passed by reference. This has been discussed before here. Some of the built-in types are immutable (e.g.
int,string), so you cannot modify them in place (a new string is created when you concatenate two strings and assign them to a variable). Instance of mutable types (e.g.list) can be modified in place. We are taking advantage of this by passing the original list for accumulating result in our recursive calls.For extracting text from HTML in a real application, using a mature library like
BeautifulSouporlxml.htmlis always a much better option (as others have suggested).