I’m doing a research about processing news texts on the internet. So, I’m writing a program to obtain and store news in a DB by the news url.
For instance, this is a random news url (spanish news website). So, I’m using BeautifulSoup to get the HTML content and after a little bit of simple process I have the news title, summary, content, category and more information about the news.
But, as you can see in the news I used in the example, there is also some “social networking” information (right side of the news image):
- number of recommendations (facebook)
- number of tweets (twitter)
- number of +1s (google+)
I would like to obtain these information too, so I tried to process the HTML content from that part but it’s not there! This is what I’ve done:
>>> import urllib
>>> from BeautifulSoup import BeautifulSoup as Soup
>>> news = urllib.urlopen('http://elcomercio.pe/mundo/1396187/noticia-horror-eeuu-cinco-ninos-muertos-deja-tiroteo-escuela-religiosa')
>>> soup = Soup(news.read())
>>> sociales = soup.findAll('ul', {'class': 'sociales'})[0].findAll('li')
>>> len(sociales)
3
This is the HTML content of the Facebook part:
>>> sociales[0] # facebook
<li class="top">
<div class="fb-plg">
<div id="fb-root"></div>
<script>(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) {return;}
js = d.createElement(s); js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=224939367568467";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>
<div class="fb-like" data-href="http://elcomercio.pe/noticia/1396187/horror-eeuu-cinco-ninos-muertos-deja-tiroteo-escuela-religiosa" data-send="false" data-layout="box_count" data-width="70" data-show-faces="false" data-action="recommend"></div></div></li>
Twitter part:
>>> sociales[1] # twitter
<li><a href="https://twitter.com/share" class="twitter-share-button" data-count="vertical" data-via="elcomercio" data-lang="es">Tweet</a><script type="text/javascript" src="//platform.twitter.com/widgets.js"></script></li>
Google+ part:
>>> sociales[2] # google+
<li><script type="text/javascript" src="https://apis.google.com/js/plusone.js">
{lang: 'es'}
</script><g:plusone size="tall"></g:plusone></li>
As you can see, the information I’m looking for is not included in the HTML content, I’m guessing it is obtained following those links with a sort of API.
So my question is: is there anyway I can obtain the information I’m looking for (number of facebook recommendations, number of tweets, number of +1s) from the HTML content of a certain news?
Here’s my solution. I’m posting it because maybe someday someone will have the same problem. I followed @Hoff advice and I used
phantomjs.So first I installed it (Linux, Windows or MacOS, doesn’t matter). You just have to be able to run it as a command in your prompt/console like:
Here is the phantomjs installation guide.
Then, I made a simple script, that receives an url and returns a
BeautifulSoupobject (after executing all the javascript):That’s it!
PS: I’ve only tested on Linux, so if any of you try this on Windows and/or MacOS, please share your “experience”. Thanks 🙂
PS 2: I’ve tested in Windows too, works like a charm!
I also posted this in my personal blog 🙂