I’m having a hard time understanding this regex stuff…
I have a string like this:
<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">
I want to use findall() and groups to get this:
['56242','saddelmageri']
I can match the number with something like “synset-[0-9]” and the word with something like “{(.*?)}” but how do I write it to get the above result?
And here’s a follow-up question – some lines look like this:
<wn20schema:NounSynset rdf:about="&dn;synset-2589" rdfs:label="**{cykel_3: trehjulet cykel; tricykel,1_1}**">
In this case I want to extract the stuff between the {} with this result:
['2589', ['cykel', 'trehjulet cykel', 'tricykel']]
so that I can drop it in a dictionary later as a key(2589) : value([‘cykel’, ‘trehjulet cykel’, ‘tricykel’]) pair.
Any thoughts?
Since this appears to be xml data, you would be better off using an xml parser, since parsing xml with regular expressions is very, very difficult to do right.
However, since you specifically asked for a regular expression…
Your specifications are a bit imprecise, and with regular expressions you need to be very precise in what constitutes a match. For example, will the rdfs:label value always have a _1 that you want to strip off? Will there always only be one of these blocks of data per line, or multiple per line? Also, is the order of the result important?
Here’s a quick hack that might give you close to what you want:
When I run the above, I get the following output, which is a list of two-tuples containing the two strings you wanted (though in a different order):