Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8486269
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T20:53:43+00:00 2026-06-10T20:53:43+00:00

I am seeking for a way to speed up a file loading like this

  • 0

I am seeking for a way to speed up a file loading like this :

The data contains about 1 millions lines, tab separated with “\t” (tabulation char) and utf8 encoding, it takes about 9 seconds to parse the full file with the code below. However, I would have like to be almost in an order of a second!

def load(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
        previous = ""
        for n, s in enumerate(f):
            splitted = tuple(s.rstrip().split("\t"))
            if len(splitted) != 2:
                sys.exit("wrong format!")
            if previous >= splitted:
                sys.exit("unordered feature")
            previous = splitted
            features.append(splitted)
    return features   

I am wondering if any binary format data could speed up something? Or if I could benefit from a some NumPy or any other libraries to have faster loading speed.

Maybe you could give me advice on another speed bottleneck?

EDIT: so i try some of your ideas, thanks! BTW i really need the tuple (string, string) inside the huge list… here are the results, i’m gaining 50% of the time 🙂 now i am going to look after the NumPy binary data, as i have noticed that another huge file was really really quick to load…

import codecs

def load0(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return f.readlines() 

def load1(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return [tuple(x.rstrip().split("\t")) for x in f.readlines()]

def load3(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
    for n, s in enumerate(f):
        splitted = tuple(s.rstrip().split("\t"))
        features.append(splitted)
    return features

def load4(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    for s in f: 
        yield tuple(s.rstrip().split("\t")) 

a = datetime.datetime.now()
r0 = load0(myfile)
b = datetime.datetime.now()
print "f.readlines(): %s" % (b-a)

a = datetime.datetime.now()
r1 = load1(myfile)
b = datetime.datetime.now()
print """[tuple(x.rstrip().split("\\t")) for x in f.readlines()]: %s""" % (b-a)

a = datetime.datetime.now()
r3 = load3(myfile)
b = datetime.datetime.now()
print """load3: %s""" % (b-a)
if r1 == r3: print "OK: speeded and similars!"

a = datetime.datetime.now()
r4 = [x for x in load4(myfile)] 
b = datetime.datetime.now()
print """load4: %s""" % (b-a)
if r4 == r3: print "OK: speeded and similars!"

results :

f.readlines(): 0:00:00.208000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.310000
load3: 0:00:07.883000
OK: speeded and similars!
load4: 0:00:07.943000
OK: speeded and similars!

something very strange is that i notice that i can have almost double time on two consecutive runs (but not everytime) :

>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.220000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.479000
load3: 0:00:08.288000
OK: speeded and similars!
>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.279000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:04.983000
load3: 0:00:10.404000
OK: speeded and similars!

EDIT LATEST: well i tried to modify to use the numpy.load… it is very strange to me… from “normal” file with my 1022860 strings, and 10 KB.
After doing a numpy.save(numpy.array(load1(myfile))) i went to a 895 MB ! an then reloading this with numpy.load() i get this kind of timing on consecutive runs :

  >>> ================================ RESTART ================================
  loading: 0:00:11.422000 done.
  >>> ================================ RESTART ================================
  loading: 0:00:00.759000 done.

may be does numpy do some memory stuff to avoid future reload?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T20:53:44+00:00Added an answer on June 10, 2026 at 8:53 pm

    check how many seconds is to actually read the lines of the file, like

    def load(filename):
        features = []
        with codecs.open(filename, 'rb',  'utf-8') as f:
            return f.readlines()
    

    If it is significantly less then 9 sec, then

    1. try other to use multiprocessing and split the work of checking lines between cpu cores and/or
    2. use faster interpreter like pypy

    and see if any of these speed things up

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am seeking for most efficient way (in terms of speed) to retrieve some
I'm seeing in another forum if the best way to do this is with
I know I'm doing this a bad way... but I'm having trouble seeing any
I need to populate my production database app with data in particular tables. This
I am seeking a way of allowing my PHP applications to be perfectly portable.
I'm seeking a portable way to receive the (handy) $_SERVER['PATH_INFO'] variable. After reading a
I don't know much about Javascript, and this function I wrote doesn't seem to
I am seeking a way to enumerate all Drivers in the local Driverstore of
I'm seeking a way to get all American holidays as an array of NSDate
I'm seeking a way to let the python logger module to log to database

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.