I am processing a flat file, with data in line by line format, like this
... blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah
I want to extract the sku field, it is the number with 8 char long. However, I am not sure if I should use split or regex, I am not very good at using regex in python.
Assuming your
skuvalues are always 8 char long, and are always preceded by ‘sku’, and possibly some ‘:’ (with or without spaces in the between), then I would use the regex:r'sku[\s:]*(\d{8})':If your
skuvalues length may be variable, just use:r'sku[\s:]*(\d*)':edit
If your ‘sku’ is followed by some other characters, like
sku1,sku2,sku-sp,sku-18orsku_anything, you could try that:This is the exact equivalent of:
It’s very general. It will match anything that begin with
sku, then that will be followed by any undetermined number of non-decimal character (\D*, or[^0-9]*), and by some decimal characters (\d*, or[0-9]*). It will return the latter (a string of undetermined length of decimal characters).Now, what do mean the things I used to build these expressions:
quantifiers
*: when following a single character or a class of characters, this symbol means that the expression will match any undetermined number of the character or class it follows (*means "0 or some",+means "at least one",?means "0 or 1").{}are used in the same ways than the*, the+and the?, ie. they follow a character or a class of characters. They also are quantifiers. If you sayc{4}, it will match any string composed of exactly 4 ‘c’s. If you sayc{1,6}it will match any string composed of between 1 and 6 ‘c’.classes
[]: define a class of characters.[abc]means any of the characters ‘a’, ‘b’, or ‘c’.[a-z]means any of the lower case letters.[A-Z], any of the upper case letters,[a-zA-Z]any of the lower and upper case letters, [0-9] any of the decimal characters. If you want to match decimals with dots, or commas, with plus, minus and ‘e’ (for exponentials, for example), just say[0-9,\.+-e].^inside of a class – defined with[], means ‘inverted class’, everything but the class. Then,[^0-9]means anything but decimal characters,[^a-z]anything but lower case letters, and so on, and so forth.predefined classes
These are classes that are predefined in python, for making the regexes syntax more friendly:
\s: will match any spacing character (space, tabulation, etc.)\d: will match any decimal character (0, 1, 2, 3, 4, 5, 6, 7, 8, 9 … This is equivalent to[0-9], which is another way to express a characters class in regexes)\D: will match any non-decimal character … This is equivalent to[^0-9], which is another way to express an exluded class of characters in regexes.\S: will match any non-spacing character …\w: will match any ‘word character’\W: will match any non-word charactergroups
()defines some groups. They have many usages. Here, infindall, the group highlights what you want to be returned by the expression … ie.(\d{8})or[0-9]{8}means you want the expression returns to you only the strings of 8 decimal characters in the matching full string.Regular expressions are really easy to use, and very useful. You just have to very well understand what they can do and what they can’t (they are limited to regular languages. If you need to deal with levels of nested things for example, or other languages defined with context-free grammars, regexes won’t be enough). You would probably want to have a look on the following pages: