I’m rather new to python and programming ;-), and I’m writting a programm for scraping data from the web-site that has over 6000 lines for only 1 page, while I’m going to scrape about 20000 thousands of them. I’m using python 2.7.4
I have seen some tutorials on how to use regular expressions but it did not work for me.
I’m using Beautiful Soup for finding particular tags, actually I need to find such tags:
<tr class="room_loop_counter1 maintr">
<tr class="room_loop_counter1 extendedRow">
<tr class="room_loop_counter2 maintr odd">
<tr class="room_loop_counter2 extendedRow odd">
<tr class="room_loop_counter3 maintr">
<tr data-occupancy="2" class="room_loop_counter1 ">
<tr data-occupancy="2" class="room_loop_counter2 odd">
<tr data-occupancy="3" class="room_loop_counter3 ">
<tr data-occupancy="3" class="room_loop_counter4 odd">
etc. I’m not sure about space infront of quotes after room_loop_counter1,3.
I was trying to write an expression that would fit next line of code:
soup = BeautifulSoup(html_part)
av = soup.find_all('tr', class_=REGULAR_EXP)
REGULAR_EXP = re.compile('"room_loop_counter"\d\s.')
but I obviously write wrong regular expression for class
How to write one that will be valid?
I suppose that it should be an expression that findes all “room_loop_counter” followed by any number of characters (numbers, spaces,letters, but not new line character)
Thank you, in advance.
The following regex finds all “room_loop_counter” followed by any number of characters (numbers, spaces,letters, but not new line character) :
Your regex
"room_loop_counter"\d\s.matches"room_loop_counter"(note the enclosing quotes) followed by a digit, followed by a space and then any character.So it matches
"room_loop_counter"1 xand"room_loop_counter"3 !but not"room_loop_counter1"