I’m extracting information from a webpage in Swedish. This page is using characters like:

Question

0

Editorial Team

Asked: June 1, 20262026-06-01T23:01:55+00:00 2026-06-01T23:01:55+00:00

I’m extracting information from a webpage in Swedish. This page is using characters like:

0

I’m extracting information from a webpage in Swedish. This page is using characters like: öäå.

My problem is that when I print the information the öäå are gone.

I’m extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.

So if I’m correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it’s right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don’t know if that help in solving the problem)

EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T23:01:57+00:00

Editorial Team

2026-06-01T23:01:57+00:00Added an answer on June 1, 2026 at 11:01 pm

Always work in unicode and only convert to an encoded representation when necessary.

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m extracting information from a webpage in Swedish. This page is using characters like:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply