Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9139537
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T09:24:02+00:00 2026-06-17T09:24:02+00:00

hi i would like to understand why the following code which does a split

  • 0

hi i would like to understand why the following code which does a split string split using regex

#include<regex>
#include<vector>
#include<string>

std::vector<std::string> split(const std::string &s){
    static const std::regex rsplit(" +");
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = std::sregex_token_iterator();
    auto res = std::vector<std::string>(rit, rend);
    return res;
}

int main(){
    for(auto i=0; i< 10000; ++i)
       split("a b c", " ");
    return 0;
}

is slower then the following python code

import re
for i in range(10000):
    re.split(' +', 'a b c')

here’s

> python test.py  0.05s user 0.01s system 94% cpu 0.070 total
./test  0.26s user 0.00s system 99% cpu 0.296 total

Im using clang++ on osx.

compiling with -O3 brings it to down to 0.09s user 0.00s system 99% cpu 0.109 total

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T09:24:03+00:00Added an answer on June 17, 2026 at 9:24 am

    Notice

    See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.


    I’ve augmented the loop to 1000000 to get a better timing measure.

    This is my Python timing:

    real    0m2.038s
    user    0m2.009s
    sys     0m0.024s
    

    Here’s an equivalent of your code, just a bit prettier:

    #include <regex>
    #include <vector>
    #include <string>
    
    std::vector<std::string> split(const std::string &s, const std::regex &r)
    {
        return {
            std::sregex_token_iterator(s.begin(), s.end(), r, -1),
            std::sregex_token_iterator()
        };
    }
    
    int main()
    {
        const std::regex r(" +");
        for(auto i=0; i < 1000000; ++i)
           split("a b c", r);
        return 0;
    }
    

    Timing:

    real    0m5.786s
    user    0m5.779s
    sys     0m0.005s
    

    This is an optimization to avoid construction/allocation of vector and string objects:

    #include <regex>
    #include <vector>
    #include <string>
    
    void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
    {
        auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
        auto rend = std::sregex_token_iterator();
        v.clear();
        while(rit != rend)
        {
            v.push_back(*rit);
            ++rit;
        }
    }
    
    int main()
    {
        const std::regex r(" +");
        std::vector<std::string> v;
        for(auto i=0; i < 1000000; ++i)
           split("a b c", r, v);
        return 0;
    }
    

    Timing:

    real    0m3.034s
    user    0m3.029s
    sys     0m0.004s
    

    This is near a 100% performance improvement.

    The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there’s no memory deallocation by clear(), the vector maintains the memory and construct strings in-place.


    Another performance increase would be to avoid construction/destruction std::string completely, and hence, allocation/deallocation of its objects.

    This is a tentative in this direction:

    #include <regex>
    #include <vector>
    #include <string>
    
    void split(const char *s, const std::regex &r, std::vector<std::string> &v)
    {
        auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
        auto rend = std::cregex_token_iterator();
        v.clear();
        while(rit != rend)
        {
            v.push_back(*rit);
            ++rit;
        }
    }
    

    Timing:

    real    0m2.509s
    user    0m2.503s
    sys     0m0.004s
    

    An ultimate improvement would be to have a std::vector of const char * as return, where each char pointer would point to a substring inside the original s c string itself. The problem is that, you can’t do that because each of them would not be null terminated (for this, see usage of C++1y string_ref in a later sample).


    This last improvement could also be achieved with this:

    #include <regex>
    #include <vector>
    #include <string>
    
    void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
    {
        auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
        auto rend = std::cregex_token_iterator();
        v.clear();
        while(rit != rend)
        {
            v.push_back(*rit);
            ++rit;
        }
    }
    
    int main()
    {
        const std::regex r(" +");
        std::vector<std::string> v;
        for(auto i=0; i < 1000000; ++i)
           split("a b c", r, v); // the constant string("a b c") should be optimized
                                 // by the compiler. I got the same performance as
                                 // if it was an object outside the loop
        return 0;
    }
    

    I’ve built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.


    Boost.Regex

    This is the boost::regex timing for the c string arguments sample:

    real    0m1.284s
    user    0m1.278s
    sys     0m0.005s
    

    Same code, boost::regex and std::regex interface in this sample are identical, just needed to change the namespace and include.

    Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.

    EDIT

    For sake of completion, I’ve tried this (the above mentioned "ultimate improvement" suggestion) and it didn’t improved performance of the equivalent std::vector<std::string> &v version in anything:

    #include <regex>
    #include <vector>
    #include <string>
    
    template<typename Iterator> class intrusive_substring
    {
    private:
        Iterator begin_, end_;
    
    public:
        intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}
    
        Iterator begin() {return begin_;}
        Iterator end() {return end_;}
    };
    
    using intrusive_char_substring = intrusive_substring<const char *>;
    
    void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
    {
        auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
        auto rend = std::cregex_token_iterator();
        v.clear(); // This can potentially be optimized away by the compiler because
                   // the intrusive_char_substring destructor does nothing, so
                   // resetting the internal size is the only thing to be done.
                   // Formerly allocated memory is maintained.
        while(rit != rend)
        {
            v.emplace_back(rit->first, rit->second);
            ++rit;
        }
    }
    
    int main()
    {
        const std::regex r(" +");
        std::vector<intrusive_char_substring> v;
        for(auto i=0; i < 1000000; ++i)
           split("a b c", r, v);
    
        return 0;
    }
    

    This has to do with the array_ref and string_ref proposal. Here’s a sample code using it:

    #include <regex>
    #include <vector>
    #include <string>
    #include <string_ref>
    
    void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
    {
        auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
        auto rend = std::cregex_token_iterator();
        v.clear();
        while(rit != rend)
        {
            v.emplace_back(rit->first, rit->length());
            ++rit;
        }
    }
    
    int main()
    {
        const std::regex r(" +");
        std::vector<std::string_ref> v;
        for(auto i=0; i < 1000000; ++i)
           split("a b c", r, v);
    
        return 0;
    }
    

    It will also be cheaper to return a vector of string_ref rather than string copies for the case of split with vector return.

    EDIT 2

    This new solution is able to get output by return. I have used Marshall Clow’s string_view (string_ref got renamed) libc++ implementation found at https://github.com/mclow/string_view.

    #include <string>
    #include <string_view>
    #include <boost/regex.hpp>
    #include <boost/range/iterator_range.hpp>
    #include <boost/iterator/transform_iterator.hpp>
    
    using namespace std;
    using namespace std::experimental;
    using namespace boost;
    
    string_view stringfier(const cregex_token_iterator::value_type &match) {
        return {match.first, static_cast<size_t>(match.length())};
    }
    
    using string_view_iterator =
        transform_iterator<decltype(&stringfier), cregex_token_iterator>;
    
    iterator_range<string_view_iterator> split(string_view s, const regex &r) {
        return {
            string_view_iterator(
                cregex_token_iterator(s.begin(), s.end(), r, -1),
                stringfier
            ),
            string_view_iterator()
        };
    }
    
    int main() {
        const regex r(" +");
        for (size_t i = 0; i < 1000000; ++i) {
            split("a b c", r);
        }
    }
    

    Timing:

    real    0m0.385s
    user    0m0.385s
    sys     0m0.000s
    

    Note how faster this is compared to previous results. Of course, it’s not filling a vector inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-based for, or even use it to fill a vector.

    As ranging over the iterator_range creates string_views over an original string(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.

    Just to compare using this split implementation but actually filling a vector we could do this:

    int main() {
        const regex r(" +");
        vector<string_view> v;
        v.reserve(10);
        for (size_t i = 0; i < 1000000; ++i) {
            copy(split("a b c", r), back_inserter(v));
            v.clear();
        }
    }
    

    This uses boost range copy algorithm to fill the vector in each iteration, the timing is:

    real    0m1.002s
    user    0m0.997s
    sys     0m0.004s
    

    As can be seen, no much difference in comparison with the optimized string_view output param version.

    Note also there’s a proposal for a std::split that would work like this.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

the following code is not behaving like I would expect. Please help me understand
I'm trying to understand the following valid Java code. I have two questions, which
The following code #include <cassert> #include <cstddef> template <typename T> struct foo { foo(std::nullptr_t)
I would like to understand what Implement Interface Explicitly entails in C# based on
I would like to understand whats the best way to handle exceptions in Multicast
I would like to understand how Etherpad's timeline feature work. If you don't know
I would like to understand that what is Mesh Object and its conection with
I would like to understand what class << self stands for in the next
I would like to understand how object deletion works on python. Here is a
I am new to silver light and would like to understand a bit more

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.