I want to scrape the content of websites with Python. Just like this: Apple’s

Question

0

Asked: June 16, 20262026-06-16T18:17:38+00:00 2026-06-16T18:17:38+00:00

I want to scrape the content of websites with Python. Just like this: Apple’s

0

I want to scrape the content of websites with Python. Just like this:

Apple’s stock continued to dominate the news over the weekend, with Barron’s placing it on the top of its favorite 2013 stock list.

But print them with error result:

Apple âs stock continued to dominate the news over the weekend, with Barronâs placing it on the top of its favorite 2013 stock list.

The symbol “’” can’t be shown, here is my code:

    #-*- coding: utf-8 -*-

    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    import urllib
    from lxml import *
    import urllib
    import lxml.html as HTML

    url = "http://www.forbes.com/sites/panosmourdoukoutas/2012/12/09/apple-tops-barrons- 10-favorite-stocks-for-2013/?partner=yahootix"
    sock = urllib.urlopen(url)
    htmlSource = sock.read()
    sock.close()

    root = HTML.document_fromstring(htmlSource)
    contents = ' '.join([x.strip() for x in root.xpath("//div[@class='body']/descendant::text()")])

    print contents

    f = open('C:/Users/yinyao/Desktop/Python Code/data.txt','w')
    f.write(contents)
    f.close()

However, after setting, the function of printf is not useful. Why? And what should I do?
I’m using Windows, and the default encoding approach is gbk.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T18:17:40+00:00

First, ensure that you know The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Second, always use unicode internally. Decode early, encode late: when you scrap a website, decode it to unicode and process it as unicode internally in your script. Otherwise your code will crash at random points, for example because it encountered an unexpected character in a comment in some webpage in Chinese. Only when you pass it later somewhere (e.g., to some writeable stream) you should encode it (“utf-8” preferably)

Third, use BeautifulSoup 4

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to scrape the content of websites with Python. Just like this: Apple’s

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply