Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7547751
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T09:31:58+00:00 2026-05-30T09:31:58+00:00

I’m trying to convert some code from Python to C++ in an effort to

  • 0

I’m trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python’s split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday’s lesson).

Python Code:

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C++ Code:

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn’t merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).

I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I’m testing with a 20M line text file with three space-separated columns that each look similar to this: “foo.bar 127.0.0.1 home.foo.bar”

Results:

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python’s split), is thread safe (so no strtok), and whose performance is at least on par with python?

Edit 1 / Partial Solution?:

I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn’t exactly what the C++ code is doing, but it’s a bit closer. Basically, the loop is now:

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

The performance of python is now about the same as the split1 C++ implementation.

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I’m not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)

Thanks to all for your help.

Final Edit/Solution:

Please see Alf’s accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf’s code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.

One of the lessons here from this issue and yesterday’s stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages’ relative “default” performance. I appreciate the education.

Thanks again to all for your suggestions!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T09:31:59+00:00Added an answer on May 30, 2026 at 9:31 am

    As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.

    If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).

    The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code — and on my machine this is of course faster than Python, since Python’s string handling is implemented in C which is a subset of C++ (he he):

    #include <iostream>                                                              
    #include <string>
    #include <sstream>
    #include <time.h>
    #include <vector>
    
    using namespace std;
    
    class StringRef
    {
    private:
        char const*     begin_;
        int             size_;
    
    public:
        int size() const { return size_; }
        char const* begin() const { return begin_; }
        char const* end() const { return begin_ + size_; }
    
        StringRef( char const* const begin, int const size )
            : begin_( begin )
            , size_( size )
        {}
    };
    
    vector<StringRef> split3( string const& str, char delimiter = ' ' )
    {
        vector<StringRef>   result;
    
        enum State { inSpace, inToken };
    
        State state = inSpace;
        char const*     pTokenBegin = 0;    // Init to satisfy compiler.
        for( auto it = str.begin(); it != str.end(); ++it )
        {
            State const newState = (*it == delimiter? inSpace : inToken);
            if( newState != state )
            {
                switch( newState )
                {
                case inSpace:
                    result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                    break;
                case inToken:
                    pTokenBegin = &*it;
                }
            }
            state = newState;
        }
        if( state == inToken )
        {
            result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
        }
        return result;
    }
    
    int main() {
        string input_line;
        vector<string> spline;
        long count = 0;
        int sec, lps;
        time_t start = time(NULL);
    
        cin.sync_with_stdio(false); //disable synchronous IO
    
        while(cin) {
            getline(cin, input_line);
            //spline.clear(); //empty the vector for the next line to parse
    
            //I'm trying one of the two implementations, per compilation, obviously:
    //        split1(spline, input_line);  
            //split2(spline, input_line);
    
            vector<StringRef> const v = split3( input_line );
            count++;
        };
    
        count--; //subtract for final over-read
        sec = (int) time(NULL) - start;
        cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
        if (sec > 0) {
            lps = count / sec;
            cerr << "  Crunch speed: " << lps << endl;
        } else
            cerr << endl;
        return 0;
    }
    
    //compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x
    

    Disclaimer: I hope there aren’t any bugs. I haven’t tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won’t significantly affect the speed.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

For some reason, after submitting a string like this Jack’s Spindle from a text
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I have this code to decode numeric html entities to the UTF8 equivalent character.
I am trying to render a haml file in a javascript response like so:
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.