Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 424895
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 12, 20262026-05-12T19:19:31+00:00 2026-05-12T19:19:31+00:00

i want to extract some text in certain website. here is web address what

  • 0

i want to extract some text in certain website.
here is web address what i want to extract some text to make scraper.
http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=times&x=0&y=0
in this page, i want to extract some text with subject and content field separately.
for example,if you open that page, you can see some text in page,

JAPAN TOKYO INTERNATIONAL FILM FESTIVAL
EPA연합뉴스 세계 | 2009.10.25 (일) 오후 7:21
Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film ‘Eight Times Up’ directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA

JAPAN TOKYO INTERNATIONAL FILM FESTIVAL
EPA연합뉴스 세계 | 2009.10.25 (일) 오후 7:18
she learns that she won the Best Actress Award for her role in the film ‘Eight Times Up’ by French film director Xabi Molia during the award ceremony of the 22nd Tokyo …

and so on ,,,,

and finally i want to extract text such like format

SUBJECT:JAPAN TOKYO INTERNATIONAL FILM FESTIVAL
CONTENT:EPA연합뉴스 세계 | 2009.10.25 (일) 오후 7:21 Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film ‘Eight Times Up’ directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA

SUBJECT: …
CONTENT: …

AND SO ON..
if anyone help,really appreciate.
thanks in advance.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-12T19:19:31+00:00Added an answer on May 12, 2026 at 7:19 pm

    In general, to solve such problems you must first download the page of interest as text (use urllib.urlopen or anything else, even external utilities such as curl or wget, but not a browser since you want to see how the page looks before any Javascript has had a chance to run) and study it to understand its structure. In this case, after some study, you’ll find the relevant parts are (snipping some irrelevant parts in head and breaking lines up for readability)…:

    <body onload=nx_init();>
     <dl>
     <dt>
    <a href="http://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=&oid=091&aid=0002497340"
     [[snipping other attributes of this tag]]>
    JAPAN TOKYO INTERNATIONAL FILM FESTIVAL</a>
    </dt>
     <dd class="txt_inline">
    EPA¿¬ÇÕ´º½º ¼¼°è <span class="bar">
    |</span>
     2009.10.25 (ÀÏ) ¿ÀÈÄ 7:21</dd>
     <dd class="sh_news_passage">
     Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight <b>
    Times</b>
     Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA</dd>
    

    and so forth. So, you want as “subject” the content of an <a> tag within a <dt>, and as “content” the content of <dd> tags following it (in the same <dl>).

    The headers you get contain:

    Content-Type: text/html; charset=ks_c_5601-1987
    

    so you must also find a way to interpret that encoding into Unicode — I believe that encoding is also known as 'euc_kr' and my Python installation appears to come with a codec for it, but you should check yours, too.

    Once you’ve determined all of these aspects, you try to lxml.etree.parse the URL — and, just like so many other web pages, it doesn’t parse — it doesn’t really present well formed HTML (try w3c’s validators on it to find out about some of the ways it’s broken).

    Because badly-formed HTML is so common on the web, there exist “tolerant parsers” that try to compensate for common errors. The most popular in Python is BeautifulSoup, and indeed lxml comes with it — with lxml 2.0.3 or later, you can use BeautifulSoup as the underlying parser, then proceed “just as if” the document had parsed correctly — but I find it simpler to use BeautifulSoup directly.

    For example, here’s a script to emit the first few subject/content pairs at that URL (they’ve changed currently, originally they were being the same as you give;-). You need a terminal that supports Unicode output (for example, I run this without problem on a Mac’s Terminal.App set to utf-8) — of course, instead of the prints you can otherwise collect the Unicode fragments (e.g. append them to a list and ''.join them when you have all the required pieces), encode them however you wish, etc, etc.

    from BeautifulSoup import BeautifulSoup
    import urllib
    
    def getit(pagetext, howmany=0):
      soup = BeautifulSoup(pagetext)
      results = []
      dls = soup.findAll('dl')
      for adl in dls:
        thedt = adl.dt
        while thedt:
          thea = thedt.a
          if thea:
            print 'SUBJECT:', thea.string
          thedd = thedt.findNextSibling('dd')
          if thedd:
            print 'CONTENT:',
            while thedd:
              for x in thedd.findAll(text=True):
                print x,
              thedd = thedd.findNextSibling('dd')
            print
          howmany -= 1
          if not howmany: return
          print
          thedt = thedt.findNextSibling('dt')
    
    theurl = ('http://news.search.naver.com/search.naver?'
              'sm=tab%5Fhty&where=news&query=times&x=0&y=0')
    thepage = urllib.urlopen(theurl).read()
    getit(thepage, 3)
    

    The logic in lxml, or “BeautifulSoup in lxml clothing”, is not very different, just the spelling and capitalization of the various navigational operations changes a bit.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to parse a web page in Groovy and extract all of the
I'd like to extract the text from an HTML file using Python. I want
I have a string (char) and I want to extract numbers out of it.
I have barcode images in jpg format and want to extract barcode # from
How would I invert .NET regex matches? I want to extract only the matched
Given a filename in the form someletters_12345_moreleters.ext , I want to extract the 5
I want to use Perl to extract information from a Certificate Signing Request ,
I just want a very handy way to extract the numbers out of a
Want to know what the stackoverflow community feels about the various free and non-free
Want my FireFox at work to be in sync with my FireFox at my

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.