I have some sample data like so:
MADISON COUNTY,,,,,,,,,,,,, "London, City of",,,,,,,,,,,,597,519
2.1,mill /s,(replacement),for,5 years,",",commencing in,2007,",",first due in calendar year,2008,",",, for current operating expenses
-,,,,,,,,,,,,, London Public Library District,,,,,,,,,,,,716,869 1.2,mill /s,(replacement),"& increase of 1.7 mills, for 15 years, commencing in 2007, first due in",,,,,,,,,, "calendar year 2008, for
current expenses -",,,,,,,,,,,,, "Range, Township of",,,,,,,,,,,,62,13
1.7,mill /s,(renewal),for,5 years,",",commencing in,2007,",",first due in calendar year,2008,",",, for fire protection -,,,,,,,,,,,,,
What I need at the end is a list of all “Towns”, so the output should be:
["London, City of", "London Public Library District", "Range, Township of"]
I’m at a bit of a struggle here because I don’t really know how to approach narrowing it down to just these fields. As you can see the series of commas is a pretty good start, but there are also unwanted strings of commas that don’t follow the pattern. Originally I thought I would match for 5 commas on both sides of the string with length < 100 chars, but this is frustrated by the arbitrary commas here:
first due in",,,,,,,,,, "cale
Any clues?
Further, the data is generally in this format:
SOME COUNTY,,,,,,,,,,,,, SOME TOWN,,,,,,,,,,,,some long string possibly with commas
,,,,,,,,,,,,, SOME TOWN,,,,,,,,,,,,some long string possibly with commas ... etc
It’s hard to tell from your sample data as I think it has extra line breaks in it, but from your summary of the data format it seems that the Town is the 14th column in each line.
As the data is in CSV format, you don’t need to use a Regular Expression and instead can use the
csvmodule to parse the data. Extracting the town names should be as easy as: