hi i would like to understand why the following code which does a split string split using regex
#include<regex>
#include<vector>
#include<string>
std::vector<std::string> split(const std::string &s){
static const std::regex rsplit(" +");
auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::sregex_token_iterator();
auto res = std::vector<std::string>(rit, rend);
return res;
}
int main(){
for(auto i=0; i< 10000; ++i)
split("a b c", " ");
return 0;
}
is slower then the following python code
import re
for i in range(10000):
re.split(' +', 'a b c')
here’s
> python test.py 0.05s user 0.01s system 94% cpu 0.070 total
./test 0.26s user 0.00s system 99% cpu 0.296 total
Im using clang++ on osx.
compiling with -O3 brings it to down to 0.09s user 0.00s system 99% cpu 0.109 total
Notice
See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.
I’ve augmented the loop to 1000000 to get a better timing measure.
This is my Python timing:
Here’s an equivalent of your code, just a bit prettier:
Timing:
This is an optimization to avoid construction/allocation of vector and string objects:
Timing:
This is near a 100% performance improvement.
The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there’s no memory deallocation by
clear(), the vector maintains the memory and construct strings in-place.Another performance increase would be to avoid construction/destruction
std::stringcompletely, and hence, allocation/deallocation of its objects.This is a tentative in this direction:
Timing:
An ultimate improvement would be to have a
std::vectorofconst char *as return, where each char pointer would point to a substring inside the originalsc string itself. The problem is that, you can’t do that because each of them would not be null terminated (for this, see usage of C++1ystring_refin a later sample).This last improvement could also be achieved with this:
I’ve built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.
Boost.Regex
This is the
boost::regextiming for the c string arguments sample:Same code,
boost::regexandstd::regexinterface in this sample are identical, just needed to change the namespace and include.Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.
EDIT
For sake of completion, I’ve tried this (the above mentioned "ultimate improvement" suggestion) and it didn’t improved performance of the equivalent
std::vector<std::string> &vversion in anything:This has to do with the array_ref and string_ref proposal. Here’s a sample code using it:
It will also be cheaper to return a vector of
string_refrather thanstringcopies for the case ofsplitwith vector return.EDIT 2
This new solution is able to get output by return. I have used Marshall Clow’s
string_view(string_refgot renamed) libc++ implementation found at https://github.com/mclow/string_view.Timing:
Note how faster this is compared to previous results. Of course, it’s not filling a
vectorinside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-basedfor, or even use it to fill avector.As ranging over the
iterator_rangecreatesstring_views over an originalstring(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.Just to compare using this
splitimplementation but actually filling avectorwe could do this:This uses boost range copy algorithm to fill the vector in each iteration, the timing is:
As can be seen, no much difference in comparison with the optimized
string_viewoutput param version.Note also there’s a proposal for a
std::splitthat would work like this.