I am using Beautiful Soup to identify a specific tag and its contents. The

Question

0

Asked: June 2, 20262026-06-02T17:35:37+00:00 2026-06-02T17:35:37+00:00

I am using Beautiful Soup to identify a specific tag and its contents. The

0

I am using Beautiful Soup to identify a specific tag and its contents. The contents are html-links and I want to extract the text of these tags.

The problem is that the text is made up of different numbers according to a specific pattern. I am only interested in number such as “61993J0417” and “61991CJ0316” and I need the regexp to match both when the number has a “J” and “CJ” in the middle.

I have used this code to achieve this:

soup.find_all(text=re.compile('[6][1-2][0-9]{3}[J]|[CJ][0-9]{4}'))

The soup variable is the contents of the specific tag. This code works in 9 out of 10 cases. However, when I run this script on one of my source files, it also matches numbers such as “51987PC0716”.

I cannot understand why so I turn to you for assistance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T17:35:38+00:00

Editorial Team

2026-06-02T17:35:38+00:00Added an answer on June 2, 2026 at 5:35 pm

You haven’t specified what the | applies to; by default it’s the entire regex, meaning you have asked for either

[6][1-2][0-9]{3}[J]

(which is the same thing as 6[12][0-9]{3}J) or

CJ[0-9]{4}

(not [CJ], which means “either C or J”). Use parentheses to specify what the alternatives are:

^6[12][0-9]{3}(J|CJ)[0-9]{4}$

which is better written

^6[12][0-9]{3}C?J[0-9]{4}$

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using Beautiful Soup to identify a specific tag and its contents. The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply