Trying the following code doesn’t seem to work out for me quite as planned:
from beautifulsoup import BeautifulSoup
definition = """From encyclopedia:\n<i></i><p>Infobox Country<br>fullcountryname=Thailand ราชอาณาจักรไทยRaja-anachakra Thai <br>image_flag= Flag of Thailand.svg <br>image_coa= Coat of arms of Thailand.png <br>image_location= LocationThailand.png <br>nationalmotto= none <br>nationalsong= Phleng Chat <br>nationalflower= n/a <br>nationalanimal= n/a <br>officiallanguages= Thai (<r><i>Thai language</i></r>) <br>populationtotal= 65,444,371 <br>populationrank= 19 <br>populationdensity= 127 <br>countrycapital= <r>Bangkok</r> <br>countrylargestcity= <r>Bangkok</r> <br>areatotal= 514,000 <br>arearank= 49 <br>areawater= n/a <br>areawaterpercent= 0.4 <br>establishedin= <r>April 7</r>, <r>1782</r> <br>leadertitlename= <br>currency= <r>Baht</r> <br>utcoffset= +7 <br>dialingcode= 66 <br>internettld= .th<p><b>Thailand</b> is a <r>country</r> in Southeast <r>Asia</r>. Its edges touch <r>Laos</r>, <r>Cambodia</r>, <r>Malaysia</r>, and <r>Myanmar</r> (which is also called Burma.) Thailand was called Siam until 1949."""
print BeautifulSoup(definition).find('p[1]').text
This does not return anything.. I’m sure it’s a syntax error with my use of BeautifulSoup, has anybody got any idea how I could simply get:
Infobox Country
fullcountryname=Thailand Raja-anachakra Thai
image_flag= Flag of Thailand. svg
image_coa= Coat of arms of Thailand. png
image_location= LocationThailand. png
nationalmotto= none
nationalsong= Phleng Chat
nationalflower= n/a
nationalanimal= n/a
officiallanguages= Thai (Thai language)
populationtotal= 65,444,371
populationrank= 19
populationdensity= 127
countrycapital= Bangkok
countrylargestcity= Bangkok
areatotal= 514,000
arearank= 49
areawater= n/a
areawaterpercent= 0. 4
establishedin= April 7, 1782
leadertitlename=
currency= Baht
utcoffset= +7
dialingcode= 66
internettld= . th
Thank you 🙂
EDIT: I would actually prefer if I could get the text between the word “Infobox” and the last
tag, so that I could use the script to parse live wikipedia pages.
You’re using XPath syntax, which Beautiful Soup doesn’t support. Lattyware’s answer is correct. As for the question in your edit, you can use Beautiful Soup 4’s .stripped_strings generator to get approximately what you want. Some example code: