I am attempting to parse a Wikipedia SQL dump with the Python regular expressions library. The ultimate goal is to import this dump into PostgreSQL, but I know the apostrophes in strings need to be doubled, beforehand.
Every apostrophe in a string in this dump is preceded by a backwards slash, though, and I’d rather not remove the backwards slashes.
(42,’Thirty_Years\’_War’,33,5,0,0)
Using the command
re.match(".*?([\w]+?'[\w\s]+?).*?", line)
I cannot identify the apostrophe in the middle of ‘Thirty_Years\’_War’, when ‘line’ is parsed from a text file.
For comparison, these lines work fine when parsed (sans the last line).
The person’s car
The person’s car’s gasoline
Hodges’ Harbrace Handbook
‘Hodges’ Harbrace Handbook’
portspeople’,1475,29,0,0),(42,’Thirty_Years\’_War’,33,5,0,0)
Correct and expected output (sans the last line):
The person”s car
The person”s car”s gasoline
Hodges” Harbrace Handbook
(‘Hodges” Harbrace Handbook’)
portspeople’,1475,29,0,0),(42,’Thirty_Years\’_War’,33,5,0,0)
Using the command
re.match(".*?([\w\\]+?'[\w\s]+?).*?", line)
breaks it.
The person”s car
The person””s car””s gasoline
Hodges” Harbrace Handbook
(””””Hodges”””” Harbrace Handbook””””)
portspeople””””””””,1475,29,0,0),(42,””””””””Thirty_Years\””””””””_War””””””””,33,5,0,0)
Is it stuck in some sort sort of loop? What is the correct regex code to use?
I am not thinking about SQL injection attacks because this script is only going to be used for parsing dumps of Wikipedia articles (that don’t contain examples of SQL injection attacks).
If the dump consists of things like the string you provided, you could try something like this:
Where the character class contains all known separators.
EDIT: Only use regex for parsing when there is no better way 🙂