Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7623001
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T04:35:45+00:00 2026-05-31T04:35:45+00:00

My current Python project will require a lot of string splitting to process incoming

  • 0

My current Python project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:

Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5

Explanation: This particular example would come from a list where the first two items are a title and a date, while item 3 to item 5 would be invited people (the number of those can be anything from zero to n, where n is the number of registered users on the server).

From what I see, I have the following options:

  1. repeatedly use split()
  2. Use a regular expression and regular expression functions
  3. Some other Python functions I have not thought about yet (there are probably some)

Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:

((.+)|)+((.+)(<>)?)+

Okay, this regular expression is horrible, I can see that myself. It is also untested. But you get the idea.

Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for strings that have more items separated with | and strings that completely lack the <>. At least the regular expression-based solution would do that.

My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don’t know enough about Python’s implementation of regular expressions to judge how the regular expression would perform. split() is also less dynamic than a regular expression if it comes to different numbers of items and the absence of the second separator. Still, I can’t shake the impression that Python can do this better without regular expressions, and that’s why I am asking.

Some notes:

  • Yes, I could just benchmark both solutions, but I’m trying to learn something about Python in general and how it works here, and if I just benchmark these two, I still don’t know what Python functions I have missed.
  • Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about Python.
  • Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were separated by | from the parts with the separator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by obmarg) will not work too well. Solutions fitting this criterium are much appreciated.

To sum the question up: Which solution would be the most efficient one, for what reasons?

Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by obmarg, as well as the solutions by mgibsonbr and duncan:

import timeit
import re

def splitit(input):
    res0 = input.split("|")
    res = []
    for element in res0:
        t = element.split("<>")
        if t != [element]:
            res0.remove(element)
            res.append(t)
    return (res0, res)

def regexit(input):
    return re.split( "\||<>", input )


def mgibsonbr(input): # Solution by mgibsonbr
    items = re.split(r'\||<>', input) # Split input in items
    offset = 0
    result = [] # The result: strings for regular items, lists for <> separated ones
    acc = None
    for i in items:
        delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
        offset += len(i) + len(delimiter)
        if delimiter == '<>': # Will always put the item in a list
            if acc is None:
                acc = [i] # Create one if doesn't exist
                result.append(acc)
            else:
                acc.append(i)
        else:
            if acc is not None: # If there was a list, put the last item in it
                acc.append(i)
            else:
                result.append(i) # Add the regular items
            acc = None # Clear the list, since what will come next is a regular item or a new list
    return result

def split2(input): # Solution by duncan
    res0 = input.split("|")
    res1, res2 = [], []
    for r in res0:
        if "<>" in r:
            res2.append(r.split("<>"))
        else:
            res1.append(r)
    return res1, res2

print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()

The results:

mgibs: 14.7349407408
split: 6.403942732
split2: 3.68306812233
regex: 5.28414318792
mgibs: 107.046683735
split: 46.0844590775
split2: 26.5595985591
regex: 28.6513302646

At the moment, it looks like split2 by duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like mgibsonbr’s solution has some performance issues (sorry about that, but thanks for the solution regardless).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T04:35:46+00:00Added an answer on May 31, 2026 at 4:35 am

    I was slightly surprised that split() performed so badly in your code, so I looked at it a bit more closely and noticed that you’re calling list.remove() in the inner loop. Also you’re calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.

    import timeit
    import re
    
    def splitit(input):
        res0 = input.split("|")
        res = []
        for element in res0:
            t = element.split("<>")
            if t != [element]:
                res0.remove(element)
                res.append(t)
        return (res0, res)
    
    def split2(input):
        res0 = input.split("|")
        res1, res2 = [], []
        for r in res0:
            if "<>" in r:
                res2.append(r.split("<>"))
            else:
                res1.append(r)
        return res1, res2
    
    def regexit(input):
        return re.split( "\||<>", input )
    
    rSplitter = re.compile("\||<>")
    
    def regexit2(input):
        return rSplitter.split(input)
    
    print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
    print("split2:", timeit.Timer("split2(  'a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
    print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
    print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
    print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
    print("split2:", timeit.Timer("split2(  'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
    print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
    print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
    

    Which gives the following result:

    split:  1.8427431439631619
    split2: 1.0897291360306554
    regex:  1.6694280610536225
    regex2: 1.2277749050408602
    split:  14.356198082969058
    split2: 8.009285948995966
    regex:  9.526430513011292
    regex2: 9.083608677960001
    

    And of course split2() gives the nested lists that you wanted whereas the regex solution doesn’t.

    Compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn’t worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

My current project is in Rails. Coming from a Symfony (PHP) and Django (Python)
I have a current existing project repository that has a lot of code that
So my current project is mostly in Python, but I'm looking to rewrite the
I am adding a feature to my current project that will allow network admins
I need to integrate a email client in my current python web app. Anything
How do I get the current time in Python?
A reliable coder friend told me that Python's current multi-threading implementation is seriously buggy
In Python, how can I print the current call stack from within a method
How do I find out the current runtime directory from within a Python script?
I'd like to calculate Hebrew dates (primarily the current Hebrew date) in Python. Which

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.