I’m trying to write a single regular expression to convert all uppercase words to lowercase while excluding uppercase Roman numerals from being converted.
The only way I found was to convert all uppercased words that are followed by a space, comma, or period, as well as hyphenated words into lowercase. Then convert all Roman numerals back to uppercase.
I used this to convert to lowercase:
(\u+[ ,.-])
Then I had to go through and find and replace all suspected Roman numerals.
What is a better way to do this? I tried negative lookahead expressions with no luck but I’m not very strong at writing them.
The sample that I’m testing this on is the U.S. Constitution. Here’s a sample of the input:
WE, the PEOPLE of the UNITED STATES, in order to form a more perfect
union, establish justice, ensure domestic tranquility, provide for the
common defence, promote the general welfare, and secure the blessings
of liberty to ourselves and our posterity, do ordain and establish
this Constitution for the United States of America.ARTICLE I.
Sect. 1. ALL legislative powers, herein granted, shall be vested in a Congress of the United > States, which shall consist of a Senate and House of Representatives.
Sect. 2. The House of Representatives shall
be composed of Members chosen every second year by all the people of
the several States, and the Electors in each State shall have the
qualifications requisite for Electors of the most numerous branch of
the State Legislature. No person shall be a Representative who shall
not have attained to the age of twenty-five years, and been seven
years a citizen of the United States, and who shall not, when elected,
be an inhabitant of that State in which he shall be chosen.ARTICLE IV.
ARTICLE V.
ARTICLE VI.
if the regex flavour supports negative lookaheads, you could try:
which says “any whole upper-case words that aren’t entirely composed of L, X, I, V, C, D, M” (the roman numerals).
It also conveniently stops the word “I” from being converted. (As an aside, if you wanted to prevent one-letter capital words from being converted, use
[A-Z]{2,}— this would prevent a capital “A” (at the start of a sentence) and I being converted, which you usually want to stay in their normal case).It would stop words consisting entirely of these letters being matched though — the only ones I can think of are “DID”, and perhaps “DIV” (as in HTML), “DIM” (as in dimension), “MID”, “MIDI”, “VIC” (as in Victoria?)…
Although, you could certainly alter the roman numerals regex to be a little more considerate of the rules, e.g.
Explanation:
I think that covers all possible roman numerals….
If your regex flavour doesn’t support negative lookaheads, maybe you could do something like:
And replace with “$2$3_converted_to_lower_case” (sorry – I don’t know how to do the actual conversion itself).
The above would work because the regex only ever matches either the roman numeral regex (and is captured in $2), or the other regex (captured in $3). So one of $2 or $3 is always empty.