I have a text file containing entries like this: @markwarner VIRGINIA – Mark Warner

Question

0

Asked: June 7, 20262026-06-07T07:47:05+00:00 2026-06-07T07:47:05+00:00

I have a text file containing entries like this: @markwarner VIRGINIA – Mark Warner

0

I have a text file containing entries like this:

@markwarner VIRGINIA - Mark Warner 
@senatorleahy VERMONT - Patrick Leahy NO 
@senatorsanders VERMONT - Bernie Sanders 
@orrinhatch UTAH - Orrin Hatch NO 
@jimdemint SOUTH CAROLINA - Jim DeMint NO 
@senmikelee UTAH -- Mike Lee 
@kaybaileyhutch TEXAS - Kay Hutchison 
@johncornyn TEXAS - John Cornyn 
@senalexander TENNESSEE - Lamar Alexander

I have written the following to remove the ‘NO’ and the dashes using regular expressions:

import re

politicians = open('testfile.txt')
text = politicians.read()

# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s@[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)

## Make the list a string
newlist = ' '.join(no)

## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)

# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)

# make the string into a list
# problem with @jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(@[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)

for i in lst1:
    print i

When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.

Any ideas? Why is the expression not capturing this surname?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T07:47:07+00:00

It’s missing it because his state name contains two words: SOUTH CAROLINA

Have your second regex be this, it should help

 (@[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)

I added

(?:\s\w+)?

Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters

http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped

EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use

((@[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))

Which you can play with here: http://regexr.com?31fvk

The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4

Each capturing group works as follows:

(@[\w]+?\s)

This matches an @ sign followed by at least one but as few characters as possible until a space.

((?:(?:[\w]+?)\s){1,2})

This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words

((?:[\w]+?\s){2})

Matches and captures exactly two words, which is defined as few characters as possible followed by a space

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a text file containing entries like this: @markwarner VIRGINIA – Mark Warner

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply