Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6629265
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T22:14:07+00:00 2026-05-25T22:14:07+00:00

I have very large datasets that are stored in binary files on the hard

  • 0

I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure:

File Header

149 Byte ASCII Header

Record Start

4 Byte Int - Record Timestamp

Sample Start

2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample

Sample End

There are 122,880 Samples per Record and 713 Records per File. This yields a total size of 700,910,521 Bytes. The sample rate and number of records does vary sometimes so I have to code for detection of the number of each per file.

Currently the code I use to import this data into arrays works like this:

from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize

start_time = clock()
file_size = getsize(input_file)

with open(input_file,'rb') as openfile:
  input_data = openfile.read()

header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)

for record in xrange(number_of_records):

  time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 )
  unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] ) 

  record_t = zeros(sample_rate , dtype=int16)
  record_x = zeros(sample_rate , dtype=int16)
  record_y = zeros(sample_rate , dtype=int16)
  record_z = zeros(sample_rate , dtype=int16)

  for sample in xrange(sample_rate):

    record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
    record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
    record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
    record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'

This currently takes about 250 seconds per 700 MB file, which to me seems very high. Is there a more efficient way I could do this?

Final Solution

Using the numpy fromfile method with a custom dtype cut the runtime to 9 seconds, 27x faster than the original code above. The final code is below.

from numpy import savez, dtype , fromfile 
from os.path import getsize
from time import clock

start_time = clock()
file_size = getsize(input_file)

openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] )

data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)

end_time = clock()

print 'It took',end_time - start_time,'seconds'
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T22:14:07+00:00Added an answer on May 25, 2026 at 10:14 pm

    Some hints:

    • Don’t use the struct module. Instead, use Numpy’s structured data types and fromfile. Check here: http://scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files

    • You can read all of the records at once, by passing in a suitable count= to fromfile.

    Something like this (untested, but you get the idea):

    import numpy as np
    
    file = open(input_file, 'rb')
    header = file.read(149)
    
    # ... parse the header as you did ...
    
    record_dtype = np.dtype([
        ('timestamp', '<i4'), 
        ('samples', '<i2', (sample_rate, 4))
    ])
    
    data = np.fromfile(file, dtype=record_dtype, count=number_of_records)
    # NB: count can be omitted -- it just reads the whole file then
    
    time_series = data['timestamp']
    t_series = data['samples'][:,:,0].ravel()
    x_series = data['samples'][:,:,1].ravel()
    y_series = data['samples'][:,:,2].ravel()
    z_series = data['samples'][:,:,3].ravel()
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a very large file that looks like this (see below). I have
I have a very large code base that contains extensive unit tests (using CppUnit).
We have some very large data files (5 gig to 1TB) where we need
I have a very large codebase (read: thousands of modules) that has code shared
I have a very large C project with many separate C files and headers
Here at work we have a very large application with multiple sub applications. (500
Alright. So I have a very large amount of binary data (let's say, 10GB)
I am working on a project that must store very large datasets and associated
I have a data processing system that generates very large reports on the data
I have a very large (~6GB) SVN repository, for which I've written a batch

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.