I’m trying to find a token out of a string and return it. I am using this method on other strings and it works fine, but this one does not seem to return any result. Not for findall and not for search.
pattern= re.compile(r'<input class="token" value="(.+?)" name="csrftoken_reply">')
matches = pattern.findall(htmlstring)
for match in matches:
print match
There is only one value in each response string. though I am not getting a result for “print match”
I also tried using re.search but same thing happens, a NoneType object is returned…
MORE INFO:
this is part of the html i’m parsing:
<form id="threadReplyForm" class="clearfix" method="post" action="/go/messages/private/threadID=0551796">
<input class="csrftoken" type="hidden" value="a7b161b7" name="csrftoken_reply">
<input type="hidden" value="reply" name="action">
<div class="editorWrapper">
<div id="premiumSmiliesNotAllowed" class="warning" style="display: none;">
<div id="editor_13" class="clearfix editor" mode="full">
<ul id="editorToolbar_13" class="editorToolbar clearfix">
<textarea id="messageInput" class="autogrow" cols="20" rows="8" name="message"></textarea>
<div id="previewDiv" class="previewArea" style="display: none;"></div>
</div>
<script>
</div>
<script>
<span class="loadingIndicator right loadingIndicatorMessage">
<p class="clearfix">
</form>
parsing it with this :
pattern= re.compile(r'<input class="csrftoken" type="hidden" value="(.+?)" name="csrftoken_reply">')
matches = pattern.findall(str(response.read()))
for match in matches:
print match
trying to get a7b161b7 as output
Not a Python person and not recommending regex to parse html, but it might be
possible to get unordered att-val data this way. Just put in some pairs that is
needed to qualify the tag. It doesn’t have to be all of them or in any order.
Modifiers: expanded, single-line string, global.
The value capture group is $5
Edit
Changed
(?= (?:".*?"|\'.*?\'|[^>]*?)+to(?= (?:[^>"\']|(?>".*?"|\'.*?\'))*?because lazy quantifier in this form will be forced to overrun markup boundries to satisfy the lookahead. The new sub-expression handlesattr="so< m >e"embedded markup, without overruns.All the caveats apply, could be hidden in imbedded code, could be comments, etc …
Extra regex logic is needed for that.