I have a wikipedia dump and struggling with finding appropriate regex patter to remove the double square brackets in the expression. Here is the example of the expressions:
line = 'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the [[herbicide]]s and [[defoliant]]s used by the [[United States armed forces|U.S. military]] as part of its [[herbicidal warfare]] program, [[Operation Ranch Hand]], during the [[Vietnam War]] from 1961 to 1971.'
I am looking to remove all of the square brackets with the following conditions:
-
if there is no vertical separator within square bracket, remove the brackets.
Example :
[[herbicide]]sbecomesherbicides. -
if there is a vertical separator within the bracket, remove the bracket and only use the phrase after the separator.
Example :
[[United States armed forces|U.S. military]]becomesU.S. military.
I tried using re.match and re.search but was not able to arrive to the desired output.
Thank you for your help!
What you need is
re.sub. Note that both square brackets and pipes are meta-characters so they need to be escaped.The
\1in the replacement string refers to what was matched inside the parentheses, that do not start with?:(i.e. in any case the text you want to have).There are two caveats. This allows for only a single pipe between the opening and closing brackets. If there are more than one you would need to specify whether you want everything after the first or everything after the last one. The other caveat is that single
]between opening and closing brackets are not allowed. If that is a problem, there would still be a regex solution but it would be considerably more complicated.For a full explanation of the pattern: