I am trying to parse a website and retrieve the texts that contain Hyper link.
For eg:
<a href="www.example.com">This is an Example</a>
I need to retrieve “This is an Example”, which I am able to do for pages that dont have broken tags. I am unable to retrieve in following case:
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
<a href = "http:\\www.example.com">Example</a>
</body>
</html>
In such cases it the code is unable to retrieve Google because of the broken tag that links google and only gives me “Example”. Is there a way to also retrieve “Google”?
My code is here:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
f = open("sol.html","r")
soup = BeautifulSoup(f,parse_only=SoupStrainer('a'))
for link in soup.findAll('a',text=True):
print link.renderContents();
Please note sol.html contains the above given html code itself.
Thanks
– AJ
Remove
text=Truefrom your code and it should work just fine: