I’m working through “Programming collective intelligence“. In chapter 4, Toby Segaran builds an artificial neural network. The following function appears on page of the book:
def generatehiddennode(self,wordids,urls):
if len(wordids)>3: return None
# Check if we already created a node for this set of words
sorted_words=[str(id) for id in wordids]
sorted_words.sort()
createkey='_'.join(sorted_words)
res=self.con.execute(
"select rowid from hiddennode where create_key='%s'" % createkey).fetchone()
# If not, create it
if res==None:
cur=self.con.execute(
"insert into hiddennode (create_key) values ('%s')" % createkey)
hiddenid=cur.lastrowid
# Put in some default weights
for wordid in wordids:
self.setstrength(wordid,hiddenid,0,1.0/len(wordids))
for urlid in urls:
self.setstrength(hiddenid,urlid,1,0.1)
self.con.commit()
What I can’t possibly understand is the reason of the first line in this function: ‘if len(wordids>3): return None`. Is it a debug code that needs to be removed later?
P.S. this is not a homework
For a published book, that’s pretty terrible code! (You can download all the examples for the book from here; the relevant file is
chapter4/nn.py.)wordidsandurlsplay?wordidsprobably come from a user query and so may be untrusted—but then, maybe they are ids rather than words so it’s OK in practice but still a very bad habit to get into).SELECT EXISTS(...)rather than asking the database to send you a bunch of records which you’re then going to ignore.createkey. No error. Is that correct? Who can say?0.1(perhaps there are always 10 URLs, but it would be better style to scale bylen(urls)here).I could go on and on, but I better not.
Anyway, to answer your question, it looks as though this function is adding a database entry for a node in the hidden layer of a neural network. This neural network has, I think, words in the input layer, and URLs in the output layer. The idea of the application is to attempt to train a neural network to find good search results (URLs) based on the words in the query. See the function
trainquery, which takes the arguments(wordids, urlids, selectedurl). Presumably (since there’s no docstring I have to guess)wordidswere the words the user searched for,urlidsare the URLs the search engine offered the user, andselectedurlis the one the user picked. The idea being to train the neural net to better predict which URLs users will pick, and so place those URLs higher in future search results.So the mysterious line of code is preventing nodes being created in the hidden layer with links to more than three nodes in the input layer. In the context of the search application this makes sense: there’s no point in training up the network on queries that are too specialized, because these queries won’t recur often enough for the training to be worth it.