I have the following regular expression for data validation:
lexer = /(?:
(.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
(.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
(?:\s+([A-Za-z][A-Za-z0-9]{2}(?=\s))|(\s+))\s*
(Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\s+\d{10}|\s+)\s*
(\d{6})\s*
(.*)(?=((?:\d{2}\/){2}\d{4}))\s*
((?:\d{2}\/){2}\d{4})\s*
(\S+)
)/x
The problem is that I have to iterate through a file with 10000 lines (average) performing the validation with the regular expression, resulting in a slow parsing application.
filename = File.new(@file, "r")
filename.each_line.with_index do |line, index|
next if index < INFO_AT + 1
lexer = /(?:
(.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
(.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
(?:\s+([A-Za-z][A-Za-z0-9]{2}(?=\s))|(\s+))\s*
(Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*
(\s+\d{10}|\s+)\s*
(\d{6})\s*
(.*)(?=((?:\d{2}\/){2}\d{4}))\s*
((?:\d{2}\/){2}\d{4})\s*
(\S+)
)/x
m = lexer.match(line)
begin
if (m) then ...
Edit
Here you can find some of the lines that I need to parse: File
Edit II
@Mike R
I’m parsing a file that contains 25 columns per line and each column might have it’s own way of validation. Either it could be whitespace or a full char-set.
- That validation is required since I have to drop away the line that doesn’t match that kind of part.
- Might not be necessary
- It’s necessary
I don’t believe that the expression it’s badly constructed, the lookahead it’s used, maybe in the part that I repeated the code (I just don’t remembered the capturing group index \1…\n, if this is what you mean!) I also believe that catastrophic backtracking is happening here.
If you see the file, maybe you’ll understand why I’m doing this! Let’s put as an example the first column. I have to match a “Part Number” and I don’t have any rule of how to do this, examples:
- 123456789
- 1 555 989
- 0123456789123456789
Neither a simple \S+, or (\S+\s){1, } could solve this problem, Cause I won’t be guaranting data integrity.
Ty!
Any improvement, suggestion?
~ Eder Quiñones
Your file is a format with fixed-width fields. Ruby has a string method called
unpackthat is specifically for parsing this type of file.Then in your line loop:
Now you have an array (row) with the contents of each field. You can then apply a regex to each one for validation. This is faster, more manageable, and also allows for field-specific error messages.