I need to remove all additional spaces in a string.
I use regex for matching strings and matched strings i replace with some others.
For better understanding please see examples below:
3 input strings:
Hello, how are you?
Hello , how are you?
Hello , how are you ?
This are 3 strings that should match by one pattern-regex.
It looks something like this:
Hello\s*,\s+how\s+are\s+you\s*?
It works fine but there is a perfomance problem.
If I have a lot of patterns (~20k) and try to execute each pattern it runs very slow (3-5 minutes).
Maybe there is better way for doing this?
for example use some 3d-party libs?
UPD: Folks, this question is not about how to do this. It’s about how to do this with best perfomance. 🙂
Let me explain more detailed. The main goal is tokenize text. (replace some token with special symbols)
For example I have a token “nice try”.
Then I input text “this is nice try”.
result: “this is @tokenizedtext@” where @tokenizedtext@ some special symbols. It doesen’t matter in this case.
Next I have string “Mike said it was a nice try”.
result should be “Mike said it was a @tokenizedtext@”.
I think the main idea is clear.
So I can have a lot of tokens. When I process it I convert my token from “nice try” to pattern “nice\s+try”. and try to replace with this pattern input text.
It works fine. But if in tokens there is more spaces and there is also punctuation then my regexes became bigger and works very slow.
Do you have some suggestions (technical or logic) for solving this problem?
I can suggest a few solutions.
First of all, avoid the static
Regexmethod. Create an instance of it (and store it, don’t call the constructor for each replacement!) and, if possible, useRegexOptions.Compiled. It should improve your performance.Second, you can try to review your pattern. I’ll do some profiling, but I’m currently undecisive between:
With replacement being an empty string or:
With a space as a replacement. You can try this code, in the meanwhile:
EDIT: After having done some measurement, the second pattern seems to be faster. I’m editing my sample to adapt it.
EDIT 2: I’ve written an
unsafemethod. It’s much faster than the other ones presented here, including the Regex ones, but, as the word itself says, it’s unsafe. I don’t think that there’s any problem with the code I’ve written but I may be wrong — So please, check it again and again in case there’s a bug in the method.Usage (compile with /unsafe):
Profiling made in Release build, optimizations on, 1000000 iterations: