I am processing a flat file, with data in line by line format, like

Question

0

Asked: June 6, 20262026-06-06T05:41:35+00:00 2026-06-06T05:41:35+00:00

I am processing a flat file, with data in line by line format, like

0

I am processing a flat file, with data in line by line format, like this

... blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah

I want to extract the sku field, it is the number with 8 char long. However, I am not sure if I should use split or regex, I am not very good at using regex in python.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T05:41:36+00:00

Assuming your sku values are always 8 char long, and are always preceded by ‘sku’, and possibly some ‘:’ (with or without spaces in the between), then I would use the regex: r'sku[\s:]*(\d{8})':

>>> import re
>>> string = '... | sku: 01234567 | price: 150 | ... '
>>> re.findall(r'sku[\s:]*(\d{8})', string)[0]
'01234533'

If your sku values length may be variable, just use: r'sku[\s:]*(\d*)':

>>> import re
>>> string = '... | sku: 01234 | price: 150 | sku: 99872453 | blah blah ... '
>>> re.findall(r'sku[\s:]*(\d*)', string)[0]
'01234'
>>> re.findall(r'sku[\s:]*(\d*)', string)[1]
'99872453'

edit

If your ‘sku’ is followed by some other characters, like sku1, sku2, sku-sp, sku-18 or sku_anything, you could try that:

>>> re.findall(r'sku\D*(\d*)', string)[0]

This is the exact equivalent of:

>>> re.findall(r'sku[^0-9]*([0-9]*)', string)[0]

It’s very general. It will match anything that begin with sku, then that will be followed by any undetermined number of non-decimal character (\D*, or [^0-9]*), and by some decimal characters (\d*, or [0-9]*). It will return the latter (a string of undetermined length of decimal characters).

Now, what do mean the things I used to build these expressions:

quantifiers

*: when following a single character or a class of characters, this symbol means that the expression will match any undetermined number of the character or class it follows (* means "0 or some", + means "at least one", ? means "0 or 1").
the {} are used in the same ways than the *, the + and the ?, ie. they follow a character or a class of characters. They also are quantifiers. If you say c{4}, it will match any string composed of exactly 4 ‘c’s. If you say c{1,6} it will match any string composed of between 1 and 6 ‘c’.

classes

[]: define a class of characters. [abc] means any of the characters ‘a’, ‘b’, or ‘c’. [a-z] means any of the lower case letters. [A-Z], any of the upper case letters, [a-zA-Z] any of the lower and upper case letters, [0-9] any of the decimal characters. If you want to match decimals with dots, or commas, with plus, minus and ‘e’ (for exponentials, for example), just say [0-9,\.+-e].
the ^ inside of a class – defined with [], means ‘inverted class’, everything but the class. Then, [^0-9] means anything but decimal characters, [^a-z] anything but lower case letters, and so on, and so forth.

predefined classes

These are classes that are predefined in python, for making the regexes syntax more friendly:

\s: will match any spacing character (space, tabulation, etc.)
\d: will match any decimal character (0, 1, 2, 3, 4, 5, 6, 7, 8, 9 … This is equivalent to [0-9], which is another way to express a characters class in regexes)
\D: will match any non-decimal character … This is equivalent to [^0-9], which is another way to express an exluded class of characters in regexes.
\S: will match any non-spacing character …
\w: will match any ‘word character’
\W: will match any non-word character
…

groups

() defines some groups. They have many usages. Here, in findall, the group highlights what you want to be returned by the expression … ie. (\d{8}) or [0-9]{8} means you want the expression returns to you only the strings of 8 decimal characters in the matching full string.

Regular expressions are really easy to use, and very useful. You just have to very well understand what they can do and what they can’t (they are limited to regular languages. If you need to deal with levels of nested things for example, or other languages defined with context-free grammars, regexes won’t be enough). You would probably want to have a look on the following pages:

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am processing a flat file, with data in line by line format, like

Leave an answerCancel reply

1 Answer

quantifiers

classes

predefined classes

groups

Leave an answer
Cancel reply