I’m currently playing with the Stack Overflow data dumps and am trying to construct

Question

0

Asked: May 16, 20262026-05-16T20:07:36+00:00 2026-05-16T20:07:36+00:00

I’m currently playing with the Stack Overflow data dumps and am trying to construct

0

I’m currently playing with the Stack Overflow data dumps and am trying to construct (what I imagine is) a simple regular expression to extract tag names from inside of < and > characters. So, for each question, I have a list of one or more tags like <tagone><tag-two>...<tag-n> and am trying to extract just a list of tag names. Here are a few example tag strings taken from the data dump:

<javascript><internet-explorer>

<c#><windows><best-practices><winforms><windows-services>

<c><algorithm><sorting><word>

<java>

For reference, I don’t need to divide tag names into words, so for examples like <best-practices> I would like to get back best-practices (not best and practices). Also, for what it’s worth, I’m using Python if it makes any difference. Any suggestions?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T20:07:36+00:00

Editorial Team

2026-05-16T20:07:36+00:00Added an answer on May 16, 2026 at 8:07 pm

Since the tag names of Stackoverflow do not have embedded < > you can use the regex:

<(.*?)>

or

<([^>]*)>

Explanation:

< : A literal <
(..) : To group and remember the
match.
.*? : To match anything in
non-greedy way.
> : A literal <
[^>] : A char class to match
anything other than a >

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m currently playing with the Stack Overflow data dumps and am trying to construct

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply