Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3285140
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T20:14:01+00:00 2026-05-17T20:14:01+00:00

I am trying to build a wikipedia link crawler on google app engine. I

  • 0

I am trying to build a wikipedia link crawler on google app engine. I wanted to store an index in the datastore. But I run into the DeadlineExceededError for both cron jobs and task queue.

for the cron job I have this code:


def buildTree(self):

    start=time.time()
    self.log.info(" Start Time: %f" % start)
    nobranches=TreeNode.all()       

    for tree in nobranches:            
        if tree.branches==[]:
            self.addBranches(tree)
            time.sleep(1)
        if (time.time()-start) > 10 :                
            break
        self.log.info("Time Eclipsed: %f" % (time.time()-start))

    self.log.info(" End Time:%f" % time.clock())

I don’t understand why the for loop doesn’t break after 10 seconds. It does on the dev server. Something must be wrong with the time.time() on the server. Is there another function I can use?

for the task queue I have this code:

def addNewBranch(self, keyword, level=0):

    self.log.debug("Add Tree")        
    self.addBranches(keyword)

    t=TreeNode.gql("WHERE name=:1", keyword).get()
    branches=t.nodes

    if level < 3:
        for branch in branches:
            if branch.branches == []:
                taskqueue.add(url="/addTree/%s" % branch.name)
                self.log.debug("url:%s" % "/addTree/%s" % branch.name)

The logs show that they both run into the DeadlineExceededError. Shouldn’t background processing have a longer that the 30 seconds for the page request. Is there a way around the exception?

Here is the code for addBranch()

def addBranches(self, keyword):

    tree=TreeNode.gql("WHERE name=:1", keyword).get()
    if tree is None:
        tree=TreeNode(name=keyword)


    self.log.debug("in addBranches arguments: tree %s", tree.name)     
    t=urllib2.quote(tree.name.encode('utf8'))
    s="http://en.wikipedia.org/w/api.php?action=query&titles=%s&prop=links&pllimit=500&format=xml" % t
    self.log.debug(s)
    try:        
        usock = urllib2.urlopen(s)       
    except :        

        self.log.error( "Could not retrieve doc: %s" % tree.name)
        usock=None

    if usock is not None:

        try:
            xmldoc=minidom.parse(usock)
        except Exception , error:
            self.log.error("Parse Error: %s" % error) 
            return None   
        usock.close()            
        try:
            pyNode= xmldoc.getElementsByTagName('pl')  
            self.log.debug("Nodes to be added: %d" % pyNode.length)
        except Exception, e:
            pyNode=None
            self.log.error("Getting Nodes Error: %s" % e)
            return None
        newNodes=[]    
        if pyNode is not None:
            for child in pyNode: 
                node=None             
                node= TreeNode.gql("WHERE name=:1", child.attributes["title"].value).get()

                if node is None:
                    newNodes.append(TreeNode(name=child.attributes["title"].value))           

                else:
                    tree.branches.append(node.key())  
            db.put(newNodes)
            for node in newNodes:
                tree.branches.append(node.key())
                self.log.debug("Node Added: %s" % node.name)                    
            tree.put()
            return tree.branches 

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T20:14:02+00:00Added an answer on May 17, 2026 at 8:14 pm

    I have had great success with datetimes on GAE.

    from datetime import datetime, timedelta
    time_start = datetime.now()
    time_taken = datetime.now() - time_start
    

    time_taken will be a timedelta. You can compare it against another timedelta that has the duration you are interested in.

    ten_seconds = timedelta(seconds=10)
    if time_taken > ten_seconds:
        ....do something quick.
    

    It sounds like you would be far better served using mapreduce or Task Queues. Both are great fun for dealing with huge numbers of records.

    A cleaner pattern for the code you have is to fetch only some records.

    nobranches=TreeNode.all().fetch(100)
    

    This code will only pull 100 records. If you have a full 100, when you are done, you can throw another item on the queue to launch off more.

    — Based on comment about needing trees without branches —

    I do not see your model up there, but if I were trying to create a list of all of the trees without branches and process them, I would: Fetch the keys only for trees in blocks of 100 or so. Then, I would fetch all of the branches that belong to those trees using an In query. Order by the tree key. Scan the list of branches, the first time you find a tree’s key, pull the key tree from the list. When done, you will have a list of “branchless” tree keys. Schedule each one of them for processing.

    A simpler version is to use MapReduce on the trees. For each tree, find one branch that matches its ID. If you cannot, flag the tree for follow up. By default, this function will pull batches of trees (I think 25) with 8 simultaneous workers. And, it manages the job queues internally so you don’t have to worry about timing out.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to build a .swc containing a .xml config file into Flash Builder,
Trying to build my own user control: <%@ Control Language=C# AutoEventWireup=true CodeFile=TabMenu.ascx.cs Inherits=TabMenu %>
Im trying to build a page that will allow a user to select a
Im trying to build an authenticating api with rest and also post a large
I trying to build a tree with BioPython, Phylo module. What I've done so
I trying to build an application that is using the torrent technology to make
I trying to build a dynamic query similar to: def domain = DomainName def
http://en.wikipedia.org/wiki/Upsert Insert Update stored proc on SQL Server Is there some clever way to
Note below how ã changes to a . NOTE2: Before you blame this on

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.