So I’m working with a huge dataset in Java trying to scrub the text of everything but alpha characters. Right now I’m doing this with:
snippet = snippet.toLowerCase();
snippet.replaceAll("[^A-Za-z]", "");
however the sanitization is not going as planned. Some extraneous @, #, ?, and : are making their way through. Ideas?
In java, Strings are immutable – their value can’t be changed. Consequently,
replaceAll()returns the altered String; it doesn’t change the String on which it was called.You must assign the return value back to the variable:
Although this behaviour at first seems “non Object Oriented”, when the class is immutable it does make sense.
Also, you don’t need the call to
.toLowerCase()– you regex is matching on uppercase letters too.