Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 96851
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T23:53:32+00:00 2026-05-10T23:53:32+00:00

I am using google’s appengine api from google.appengine.api import urlfetch to fetch a webpage.

  • 0

I am using google’s appengine api

from google.appengine.api import urlfetch 

to fetch a webpage. The result of

result = urlfetch.fetch('http://www.example.com/index.html') 

is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don’t think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.

EDIT: Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way… END EDIT

If the document is something like this:

<html><head></head><body> AAA 123 888 2008-10-30 ABC BBB 987 332 2009-01-02 JSE ... A4A       288        AAA </body></html> 

result.content will be this, after urlfetch fetches it:

'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA</body></html>' 

Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don’t know how to split it. I tried

result.content.split('\n') 

and

result.content.split('\r') 

but the resulting list was all just 1 element. I don’t see any options in google’s urlfetch function to not remove newlines.

Any ideas how I can parse this data? Maybe I need to fetch it differently?

Thanks in advance!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T23:53:33+00:00Added an answer on May 10, 2026 at 11:53 pm

    I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.

    I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like

    import re data = re.findall('<body>([^\<]*)</body>', result)[0] 

    then, it should be as easy as:

    start = 0 end = 5 while (end<len(data)):    print data[start:end]    start = end+1    end = end+5 print data[start:] 

    (note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 118k
  • Answers 118k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer Doug Lea (concurrent programming in Java) uses A vertical time-line… May 11, 2026 at 11:35 pm
  • Editorial Team
    Editorial Team added an answer So, after a bit of digging through the Task Tracker… May 11, 2026 at 11:35 pm
  • Editorial Team
    Editorial Team added an answer use class="hilo": <generator class="hilo"> example: <hibernate-mapping xmlns="urn:nhibernate-mapping-2.2" namespace="NHibernate__MyClass" assembly="NHibernate__MyClass"> <class… May 11, 2026 at 11:35 pm

Related Questions

I am using Google Apps for domain to host the email from my domain
I am using google's appengine api from google.appengine.api import urlfetch to fetch a webpage.
I am using google reader for my RSS, i want to export all my
I am using Google Analytics and Google Website Optimizer together. On our development rig
I am using Google Maps in a project and I want to display only

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.