Basically, I am designing a web search engine, so I designed a crawler to

Question

0

Asked: June 8, 20262026-06-08T16:39:34+00:00 2026-06-08T16:39:34+00:00

Basically, I am designing a web search engine, so I designed a crawler to

0

Basically, I am designing a web search engine, so I designed a crawler to get web pages.

When read in, the web pages are in html format, so all the tags are there. I need to extract keywords from the body and title, so I’m trying to remove all the tags (anything between ‘<‘ and ‘>’)

The code below works well for small html pages, but when I try to use this on a large scale (ie starting from http://www.google.com), I run out of memory.

0 def remove_tags(self, s):
1     while '<' in s:
2         start = s.index('<')
3         end = s.index('>')
4         s = s[:start] + " " + s[end+1:]
5     return s.split()

The memory error occurs at line 4. How do I fix my code so that taking the substrings of s doesn’t consume excessive memory?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T16:39:35+00:00

Your general approach is wrong. Firstly, use a real XML/HTML parser. Something like BeautifulSoup, which is forgiving when it comes to bad HTML. Your approach with looking at < and > won’t survive for long.

Secondly, you’ve read the whole thing into memory and are playing with it there. That’s memory consuming and some of the operations you’re doing might create copies which is not a good thing either. Instead, iterate over the input stream and process it as you see data. Think of remove_tags as a filter on the input rather than a text processing function.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Basically, I am designing a web search engine, so I designed a crawler to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply