I am attempting to parse html data from a website using BeautifulSoup for python. However, urllib2 or mechanize is not able to read the whole html format. The returned data is
<html>
<head>
<title>
EC 4.1.2.13 - Fructose-bisphosphate aldolase </title>
<meta name="description" content="Information on EC 4.1.2.13 - Fructose-bisphosphate aldolase">
<meta name="keywords" content="EC,Number,Enzyme,Pathway,Reaction,Organism,Substrate,Cofactor,Inhibitor,Compound,KM Value,KI Value,IC50 Value,pi Value,Turnover Number,pH,Temperature,Optimum,Range,Source Tissue,BLAST,Subunits,Modification,Crystallization,Stability,Purification">
</head>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<frameset cols="190,*" border="0">
<frame name="navigation" src="flat_navigation.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
<frameset rows="110,*" border="0">
<frame name="header" src="flat_head.php4?ecno=4.1.2.13" frameborder="no">
<frame name="flat" src="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
</frameset>
</frameset>
<noframes>
<body>
<h1>EC 4.1.2.13 - Fructose-bisphosphate aldolase </h1>
<a href="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475">More detailed information on the enzyme EC 4.1.2.13 - Fructose-bisphosphate aldolase</a>
Sorry, but your browser doesn't support frames. Please use another browser!
</body>
</noframes>
</html>
When I manually open the webste using Internet Explorer the whole html can be read. Is there anyway using urllib2, mechanize, or BeautifulSoup to work around this?
That’s because the content is in the frames. You can either parse the page and look for the
srcattribute of the main<frame>element or directly request the frame. In most browsers, you can right-click and select “Frame Properties” or so to get the frame’s URL.