Consider 3 regex expressions designed to remove non Latin characters from the string.
String x = "some†¥¥¶¶ˆ˚˚word";
long now = System.nanoTime();
System.out.println(x.replaceAll("[^a-zA-Z]", "")); // 5ms
System.out.println(System.nanoTime() - now);
now = System.nanoTime();
System.out.println(x.replaceAll("[^a-zA-Z]+", "")); // 2ms
System.out.println(System.nanoTime() - now);
now = System.nanoTime();
System.out.println(x.replaceAll("[^a-zA-Z]*", "")); // <1ms
System.out.println(System.nanoTime() - now);
All 3 produce the same result with vastly difference performance metrics.
Why is that?
The first one is slower because the regex matches each non-latin character individually, so
replaceAlloperates on each characters individually.The other patterns match the whole sequence of non-latin characters, so
replaceAllcan replace the whole sequence in one go. I can’t explain the performance difference between these two, though. Probably something to do with the difference in handling*and+in the regex engine.