Doug Lea (concurrent programming in Java) uses A vertical time-line…

Question

0

Asked: May 10, 20262026-05-10T23:53:32+00:00 2026-05-10T23:53:32+00:00

I am using google’s appengine api from google.appengine.api import urlfetch to fetch a webpage.

0

I am using google’s appengine api

from google.appengine.api import urlfetch

to fetch a webpage. The result of

result = urlfetch.fetch('http://www.example.com/index.html')

is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don’t think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.

EDIT: Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way… END EDIT

If the document is something like this:

<html><head></head><body> AAA 123 888 2008-10-30 ABC BBB 987 332 2009-01-02 JSE ... A4A       288        AAA </body></html>

result.content will be this, after urlfetch fetches it:

'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA</body></html>'

Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don’t know how to split it. I tried

result.content.split('\n')

and

result.content.split('\r')

but the resulting list was all just 1 element. I don’t see any options in google’s urlfetch function to not remove newlines.

Any ideas how I can parse this data? Maybe I need to fetch it differently?

Thanks in advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T23:53:33+00:00

I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.

I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like

import re data = re.findall('<body>([^\<]*)</body>', result)[0]

then, it should be as easy as:

start = 0 end = 5 while (end<len(data)):    print data[start:end]    start = end+1    end = end+5 print data[start:]

(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions