I’m not sure how to even ask this question so bear with me. I have a list of (mostly) alpha-numerics that are drawing numbers in a giant XML that I’m tweaking a schema for. There appears to be no standard as to how they’ve been created, so I’m trying to create an XSD regex pattern for them to validate against. Normally, I’d just grind through them but in this case there are hundreds of them. What I want to do is isolate them down to a single instance of each type of drawing number, and then from that, I can create a regex with appropriate OR statements in the XSD.
My environment is Win7, but I’ve got an Ubuntu VM as well as Cygwin (where I’m currently doing all of this). I don’t know if there’s a Linux utility that can do this, or if my grep/sed-fu is just weak. I have no idea how to reduce this problem down except by brute force (which I’ve done for other pieces of this puzzle that weren’t as large as this one).
I used this command line statement to grab the drawing “numbers”. It looks for the drawing number, sorts them, only gives me uniques, and then strips away the enclosing tags:
grep "DrawingNumber" uber.xml | sort | uniq | sed -e :a -e 's/<[^>]*>//g;/</N;//ba'
Here is a sample of some of the actual drawing “numbers” (there are hundreds more):
10023C/10024C *<= this is how it's represented in the XML & I can't (easily) change it.
10023C
10043E
10051B
10051D
10058B
10059C
10447B 10447B *<= this is how it's represented in the XML & I can't (easily) change it.
10064A
10079B
10079D
10082B
10095A
10098B
10100B
10102
10109B
10109C
10115
101178
10118F
What I want is a list that would reduce the list of drawing numbers to a single instance of each type. For instance, this group of drawing “numbers”:
10023C
10043E
10051B
10051D
10058B
10059C
Would reduce to:
nnnnnx
to represent all instances of 5 digits followed by a single letter for which I can create a pattern like so:
[0-9]{5}[a-z A-Z]{1}
Similarly,
10102
10115
would reduce to:
nnnnn
which would represent all instances of 5 digits with nothing following and be captured with:
[0-9]{5}
and so on. I hope this enough information to present the problem in a workable form. Like I said, I didn’t even know how to frame the question, and frequently when I get as far as writing a question in SO I realize a solution & don’t even submit it, but this one has me stumped.
Update:
Using @nullrevolution’s answer, here’s what I came up with (this clarifies my comment below which is largely unreadable).
The command line I eventually used was:
grep "DrawingNumber" uber.xml | sort -d | uniq | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | sed 's/[A-Za-z]/x/g;s/[0-9]/n/g' | sort -u
On data that looked like this:
<DrawingNumber>10430A</DrawingNumber>
<DrawingNumber>10431</DrawingNumber>
<DrawingNumber>10433</DrawingNumber>
<DrawingNumber>10434</DrawingNumber>
<DrawingNumber>10443A</DrawingNumber>
<DrawingNumber>10444</DrawingNumber>
<DrawingNumber>10446</DrawingNumber>
<DrawingNumber>10446A</DrawingNumber>
<DrawingNumber>10447</DrawingNumber>
<DrawingNumber>10447B 10447B</DrawingNumber>
<DrawingNumber>10447B</DrawingNumber>
<DrawingNumber>10454A</DrawingNumber>
<DrawingNumber>10454B</DrawingNumber>
<DrawingNumber>10455</DrawingNumber>
<DrawingNumber>10457</DrawingNumber>
Which gave me a generified output of (for all my data, not the snippet above):
nnnnn
nnnnnn
nnnnnx
nnnnnx nnnnnx
nnnnnx/nnnnnx
nnxxx
Which is exactly what I needed. Turns out the next two instances of things I need to figure out will benefit from this new method, so who knows how many hours this just saved me?
try stripping away the enclosing tags first, then:
which will replace all letters with “n” and all numbers with “x”, then remove all duplicates.
run against your sample input file, the output is:
if that’s not feasible, then could you share a portion of the input file in its original form?