I’m currently playing with the Stack Overflow data dumps and am trying to construct (what I imagine is) a simple regular expression to extract tag names from inside of < and > characters. So, for each question, I have a list of one or more tags like <tagone><tag-two>...<tag-n> and am trying to extract just a list of tag names. Here are a few example tag strings taken from the data dump:
<javascript><internet-explorer>
<c#><windows><best-practices><winforms><windows-services>
<c><algorithm><sorting><word>
<java>
For reference, I don’t need to divide tag names into words, so for examples like <best-practices> I would like to get back best-practices (not best and practices). Also, for what it’s worth, I’m using Python if it makes any difference. Any suggestions?
Since the tag names of Stackoverflow do not have embedded
<>you can use the regex:or
Explanation:
<: A literal<(..): To group and remember thematch.
.*?: To match anything innon-greedy way.
>: A literal<[^>]: A char class to matchanything other than a
>