I’m not sure how to even ask this question so bear with me. I

Question

0

Asked: June 15, 20262026-06-15T15:44:02+00:00 2026-06-15T15:44:02+00:00

I’m not sure how to even ask this question so bear with me. I

0

I’m not sure how to even ask this question so bear with me. I have a list of (mostly) alpha-numerics that are drawing numbers in a giant XML that I’m tweaking a schema for. There appears to be no standard as to how they’ve been created, so I’m trying to create an XSD regex pattern for them to validate against. Normally, I’d just grind through them but in this case there are hundreds of them. What I want to do is isolate them down to a single instance of each type of drawing number, and then from that, I can create a regex with appropriate OR statements in the XSD.

My environment is Win7, but I’ve got an Ubuntu VM as well as Cygwin (where I’m currently doing all of this). I don’t know if there’s a Linux utility that can do this, or if my grep/sed-fu is just weak. I have no idea how to reduce this problem down except by brute force (which I’ve done for other pieces of this puzzle that weren’t as large as this one).

I used this command line statement to grab the drawing “numbers”. It looks for the drawing number, sorts them, only gives me uniques, and then strips away the enclosing tags:

grep "DrawingNumber" uber.xml | sort | uniq | sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

Here is a sample of some of the actual drawing “numbers” (there are hundreds more):

10023C/10024C *<= this is how it's represented in the XML & I can't (easily) change it.
10023C
10043E
10051B
10051D
10058B
10059C
10447B 10447B *<= this is how it's represented in the XML & I can't (easily) change it.
10064A
10079B
10079D
10082B
10095A
10098B
10100B
10102
10109B
10109C
10115
101178
10118F

What I want is a list that would reduce the list of drawing numbers to a single instance of each type. For instance, this group of drawing “numbers”:

Would reduce to:

nnnnnx

to represent all instances of 5 digits followed by a single letter for which I can create a pattern like so:

[0-9]{5}[a-z A-Z]{1}

Similarly,

10102
10115

would reduce to:

nnnnn

which would represent all instances of 5 digits with nothing following and be captured with:

[0-9]{5}

and so on. I hope this enough information to present the problem in a workable form. Like I said, I didn’t even know how to frame the question, and frequently when I get as far as writing a question in SO I realize a solution & don’t even submit it, but this one has me stumped.

Update:
Using @nullrevolution’s answer, here’s what I came up with (this clarifies my comment below which is largely unreadable).

The command line I eventually used was:

grep "DrawingNumber" uber.xml | sort -d | uniq | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | sed 's/[A-Za-z]/x/g;s/[0-9]/n/g' | sort -u

On data that looked like this:

<DrawingNumber>10430A</DrawingNumber>
<DrawingNumber>10431</DrawingNumber>
<DrawingNumber>10433</DrawingNumber>
<DrawingNumber>10434</DrawingNumber>
<DrawingNumber>10443A</DrawingNumber>
<DrawingNumber>10444</DrawingNumber>
<DrawingNumber>10446</DrawingNumber>
<DrawingNumber>10446A</DrawingNumber>
<DrawingNumber>10447</DrawingNumber>
<DrawingNumber>10447B 10447B</DrawingNumber>
<DrawingNumber>10447B</DrawingNumber>
<DrawingNumber>10454A</DrawingNumber>
<DrawingNumber>10454B</DrawingNumber>
<DrawingNumber>10455</DrawingNumber>
<DrawingNumber>10457</DrawingNumber>

Which gave me a generified output of (for all my data, not the snippet above):

nnnnn
nnnnnn
nnnnnx
nnnnnx nnnnnx
nnnnnx/nnnnnx
nnxxx

Which is exactly what I needed. Turns out the next two instances of things I need to figure out will benefit from this new method, so who knows how many hours this just saved me?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T15:44:04+00:00

Editorial Team

2026-06-15T15:44:04+00:00Added an answer on June 15, 2026 at 3:44 pm

try stripping away the enclosing tags first, then:

sed 's/[A-Za-z]/x/g;s/[0-9]/n/g' file | sort -u

which will replace all letters with “n” and all numbers with “x”, then remove all duplicates.

run against your sample input file, the output is:

nnnnnx

if that’s not feasible, then could you share a portion of the input file in its original form?

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m not sure how to even ask this question so bear with me. I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply