Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8900585
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T01:07:59+00:00 2026-06-15T01:07:59+00:00

I am downloading HTML pages that have data defined in them in the following

  • 0

I am downloading HTML pages that have data defined in them in the following way:

... <script type= "text/javascript">    window.blog.data = {"activity":{"type":"read"}}; </script> ...

I would like to extract the JSON object defined in ‘window.blog.data’.
Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can’t seem to find a method that will return the exact object without parsing)

Thanks

Edit:
Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T01:08:00+00:00Added an answer on June 15, 2026 at 1:08 am

    BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

    In simple cases you could:

    1. extract <script>‘s text using an html parser
    2. assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
    3. assume that the string is a valid json and parse it using json module

    Example:

    #!/usr/bin/env python
    html = """<!doctype html>
    <title>extract javascript object as json</title>
    <script>
    // ..
    window.blog.data = {"activity":{"type":"read"}};
    // ..
    </script>
    <p>some other html here
    """
    import json
    import re
    from bs4 import BeautifulSoup  # $ pip install beautifulsoup4
    soup = BeautifulSoup(html)
    script = soup.find('script', text=re.compile('window\.blog\.data'))
    json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
                          script.string, flags=re.DOTALL | re.MULTILINE).group(1)
    data = json.loads(json_text)
    assert data['activity']['type'] == 'read'
    

    If the assumptions are incorrect then the code fails.

    To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

    from slimit import ast  # $ pip install slimit
    from slimit.parser import Parser as JavascriptParser
    from slimit.visitors import nodevisitor
    
    soup = BeautifulSoup(html, 'html.parser')
    tree = JavascriptParser().parse(soup.script.string)
    obj = next(node.right for node in nodevisitor.visit(tree)
               if (isinstance(node, ast.Assign) and
                   node.left.to_ecma() == 'window.blog.data'))
    # HACK: easy way to parse the javascript object literal
    data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
    assert data['activity']['type'] == 'read'
    

    There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have written a application that parses the html code of some web pages.
I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem
I'm writing an application that takes HTML pages and parses them to display on
I'm currently downloading an HTML page, using the following code: Try Dim req As
I started developing an application in Silverlight that was dealing with downloading the HTML
I've build a WEB TREC collection by downloading and parsing html pages by myself.
I need a link in an HTML page that could use any JavaScript to
setting is the following: I have a homepage where I display a diagram that
All, I have an HTML page that contains a form. When the user completes
given I asynchronously(!!) load several JavaScript files with an asynchronous script loader (that writes

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.