I have scraped a webpage using Scrapy and need to extract the background color from certain objects. Because inline-css is not part of the DOM, or so I have read, I need to create a regex that will augment my current XPath and select the needed value within an object’s style attribute. My current XPath returns the entire style value like so:
background:#80FF00;height:48px;width:98px;color:#FFFFFF
I need a regex that will select the background hex value only (ie: #80FF00). I do not need to verify the value is properly formated (ie ([0-9A-Fa-f]{3}|[0-9A-Fa-f]{6}))\b ), just need to grab whatever is between ‘background:’ and the following ‘;’.
I am new to writing regular expressions and appreciate the help.
The following regex should do what you want, the stuff you want to grab will be in the first capture group:
In Python
.matches any character,*means repeat the previous element any number of times, and the?makes it a lazy match, so it will match as few characters as possible. This is necessary to make sure that it doesn’t capture multiple semicolons and only stop at the last one. An alternative would bebackground:([^;]*)since[^;]would only match non-semicolon characters.