I’m quite inexperienced with RegEx – just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:
I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ‘ #’. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.
The rules are as follows:
Table names are ALWAYS upperCase – so I’m only looking for upperCase
words;
Column names, SQL expressions and variables are ALWAYS lowerCase;
SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ‘ #’;
Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the
previous word, which will be a table name.
Column values (must not be prefixed) will either be numerical values or characters enclosed in
quotes.
Here is some sample text requiring application of all the above mentioned rules:
update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'
So far with I’ve gotten only this far: (not too far…)
(?-i)[[:upper:]]
Any help would be most appreciated.
TIA,
MN
This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:
EDIT
You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:
or
EDIT II
Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn’t offer a (possible) solution yet. What you could do is create some sort of “hybrid-lexer” based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a
comment, astring literal, akeyword, or acapitalized word. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.A little demo in Python:
which produces:
This may not be the exact output you want, but I’m hoping the script is simple enought for you to adjust to your needs.
Good luck.