Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9168007
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T15:31:43+00:00 2026-06-17T15:31:43+00:00

I’m trying to build a ETL toolkit with pandas, hdf5. My plan was extracting

  • 0

I’m trying to build a ETL toolkit with pandas, hdf5.

My plan was

  1. extracting a table from mysql to a DataFrame;
  2. put this DataFrame into a HDFStore;

But when i was doing the step 2, i found putting a dataframe into a *.h5 file costs too much time.

  • the size of table in source mysql server: 498MB
    • 52 columns
    • 924,624 records
  • the size of *.h5 file after putting the dataframe inside : 513MB
    • the ‘put’ operation costs 849.345677137 seconds

My questions are:
Is this time costs normal?
Is there any way to make it faster?


Update 1

thanks Jeff

  • my codes are pretty simple:

    extract_store = HDFStore(‘extract_store.h5’)
    extract_store[‘df_staff’] = df_staff

  • and when i trying ‘ptdump -av file.h5’, i got an error, but i still could load the dataframe object from this h5 file:

tables.exceptions.HDF5ExtError: HDF5 error back trace

File “../../../src/H5F.c”, line 1512, in H5Fopen
unable to open file File “../../../src/H5F.c”, line 1307, in H5F_open
unable to read superblock File “../../../src/H5Fsuper.c”, line 305, in H5F_super_read
unable to find file signature File “../../../src/H5Fsuper.c”, line 153, in H5F_locate_signature
unable to find a valid file signature

End of HDF5 error back trace

Unable to open/create file ‘extract_store.h5’

  • some other infos:
    • pandas version: ‘0.10.0’
    • os: ubuntu server 10.04 x86_64
    • cpu: 8 * Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
    • MemTotal: 51634016 kB

I will update the pandas to 0.10.1-dev and try again.


Update 2

  • I had updated pandas to ‘0.10.1.dev-6e2b6ea’
  • but the time costs wasn’t decreased, it costs 884.15 s seconds this time
  • the output of ‘ptdump -av file.h5 ‘ is :
    / (RootGroup) ''  
      /._v_attrs (AttributeSet), 4 attributes:  
       [CLASS := 'GROUP',  
        PYTABLES_FORMAT_VERSION := '2.0',  
        TITLE := '',  
        VERSION := '1.0']  
    /df_bugs (Group) ''  
      /df_bugs._v_attrs (AttributeSet), 12 attributes:  
       [CLASS := 'GROUP',  
        TITLE := '',  
        VERSION := '1.0',  
        axis0_variety := 'regular',  
        axis1_variety := 'regular',  
        block0_items_variety := 'regular',  
        block1_items_variety := 'regular',  
        block2_items_variety := 'regular',  
        nblocks := 3,  
        ndim := 2,  
        pandas_type := 'frame',  
        pandas_version := '0.10.1']  
    /df_bugs/axis0 (Array(52,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/axis1 (Array(924624,)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'integer',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_items (Array(5,)) ''  
      atom := StringAtom(itemsize=12, shape=(), dflt='')  
      maindim := 0   
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_values (Array(924624, 5)) ''  
      atom := Float64Atom(shape=(), dflt=0.0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block1_items (Array(19,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block1_values (Array(924624, 19)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',   
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block2_items (Array(28,)) ''  
      atom := StringAtom(itemsize=18, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block2_values (VLArray(1,)) ''  
      atom = ObjectAtom()  
      byteorder = 'irrelevant'  
      nrows = 1  
      flavor = 'numpy'  
      /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'VLARRAY',  
        PSEUDOATOM := 'object',  
        TITLE := '',   
        VERSION := '1.3',  
        transposed := True]  
  • and I had tried your code below (putting the dataframe into hdfstore with the param ‘table’ is True) , but got an error instead, it seemed like python’s datatime type was not supported :

Exception: cannot find the correct atom type -> [dtype->object] object
of type ‘datetime.datetime’ has no len()


Update 3

thanks jeff.
Sorry for the delay.

  • tables.version : ‘2.4.0’.
  • yes, the 884 seconds is only the put operation costs without the pull operation from mysql
  • a row of dataframe (df.ix[0]):
bug_id                                   1
assigned_to                            185
bug_file_loc                          None
bug_severity                      critical
bug_status                          closed
creation_ts            1998-05-06 21:27:00
delta_ts               2012-05-09 14:41:41
short_desc                    Two cursors.
host_op_sys                        Unknown
guest_op_sys                       Unknown
priority                                P3
rep_platform                          IA32
reporter                                56
product_id                               7
category_id                            983
component_id                         12925
resolution                           fixed
target_milestone                       ws1
qa_contact                             412
status_whiteboard                         
votes                                    0
keywords                                SR
lastdiffed             2012-05-09 14:41:41
everconfirmed                            1
reporter_accessible                      1
cclist_accessible                        1
estimated_time                        0.00
remaining_time                        0.00
deadline                              None
alias                                 None
found_in_product_id                      0
found_in_version_id                      0
found_in_phase_id                        0
cf_type                             Defect
cf_reported_by                 Development
cf_attempted                           NaN
cf_failed                              NaN
cf_public_summary                         
cf_doc_impact                            0
cf_security                              0
cf_build                               NaN
cf_branch                                 
cf_change                              NaN
cf_test_id                             NaN
cf_regression                      Unknown
cf_reviewer                              0
cf_on_hold                               0
cf_public_severity                     ---
cf_i18n_impact                           0
cf_eta                                None
cf_bug_source                          ---
cf_viss                               None
Name: 0, Length: 52
  • the picture of dataframe( just type ‘df’ in ipython notebook):

Int64Index: 924624 entries, 0 to 924623
Data columns:
bug_id                 924624  non-null values
assigned_to            924624  non-null values
bug_file_loc           427318  non-null values
bug_severity           924624  non-null values
bug_status             924624  non-null values
creation_ts            924624  non-null values
delta_ts               924624  non-null values
short_desc             924624  non-null values
host_op_sys            924624  non-null values
guest_op_sys           924624  non-null values
priority               924624  non-null values
rep_platform           924624  non-null values
reporter               924624  non-null values
product_id             924624  non-null values
category_id            924624  non-null values
component_id           924624  non-null values
resolution             924624  non-null values
target_milestone       924624  non-null values
qa_contact             924624  non-null values
status_whiteboard      924624  non-null values
votes                  924624  non-null values
keywords               924624  non-null values
lastdiffed             924509  non-null values
everconfirmed          924624  non-null values
reporter_accessible    924624  non-null values
cclist_accessible      924624  non-null values
estimated_time         924624  non-null values
remaining_time         924624  non-null values
deadline               0  non-null values
alias                  0  non-null values
found_in_product_id    924624  non-null values
found_in_version_id    924624  non-null values
found_in_phase_id      924624  non-null values
cf_type                924624  non-null values
cf_reported_by         924624  non-null values
cf_attempted           89622  non-null values
cf_failed              89587  non-null values
cf_public_summary      510799  non-null values
cf_doc_impact          924624  non-null values
cf_security            924624  non-null values
cf_build               327460  non-null values
cf_branch              614929  non-null values
cf_change              300612  non-null values
cf_test_id             12610  non-null values
cf_regression          924624  non-null values
cf_reviewer            924624  non-null values
cf_on_hold             924624  non-null values
cf_public_severity     924624  non-null values
cf_i18n_impact         924624  non-null values
cf_eta                 3910  non-null values
cf_bug_source          924624  non-null values
cf_viss                725  non-null values
dtypes: float64(5), int64(19), object(28)
  • after ‘convert_objects()’:
dtypes: datetime64[ns](2), float64(5), int64(19), object(26)
  • and putting the converted dataframe into hdfstore costs: 749.50 s 🙂
    • it seems that reducing the number of ‘object’ dtypes is the key to decrease time costs
  • and putting the converted dataframe into hdfstore with the param ‘table’ is true still returns that error
/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 
Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.datetime' has no len()
  • I’m trying to put the dataframe without datetime columns

Update 4

  • There are 4 columns in mysql whose type is datetime:
    • creation_ts
    • delta_ts
    • lastdiffed
    • deadline

After calling the convert_objects():

  • creation_ts:
Timestamp: 1998-05-06 21:27:00
  • delta_ts:
Timestamp: 2012-05-09 14:41:41
  • lastdiffed
datetime.datetime(2012, 5, 9, 14, 41, 41)
  • deadline is always None, no matter before or after calling ‘convert_objects’
None
  • putting the dataframe without column ‘lastdiff’ costs 691.75 s
  • when putting the dataframe without column ‘lastdiff’ and setting param ‘table’ equal to True, I got an new error, :
/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 

Exception: cannot find the correct atom type -> [dtype->object] object of type 'Decimal' has no len()
  • the type of columns ‘estimated_time’, ‘remaining_time’, ‘cf_viss’ is ‘decimal’ in mysql

Update 5

  • I had transformed these ‘decimal’ type columns to ‘float’ type, by the code below:
no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float)
  • and now, the time costs is 372.84 s
  • but the ‘table’ version putting still raised an error:
/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 

Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.date' has no len()
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T15:31:44+00:00Added an answer on June 17, 2026 at 3:31 pm

    How to make this faster?

    1. use ‘io.sql.read_frame’ to load data from a sql db to a dataframe. Because the ‘read_frame’ will take care of the columns whose type is ‘decimal’ by turning them into float.
    2. fill the missing data for each columns.
    3. call the function ‘DataFrame.convert_objects’ before putting operation
    4. if having string type columns in dateframe, use ‘table’ instead of ‘storer’

    store.put(‘key’, df, table=True)

    After doing these jobs, the performance of putting operation has a big improvement with the same data set:

    CPU times: user 42.07 s, sys: 28.17 s, total: 70.24 s
    Wall time: 98.97 s
    

    Profile logs of the second test:

    95984 function calls (95958 primitive calls) in 68.688 CPU seconds
    
       Ordered by: internal time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
          445   16.757    0.038   16.757    0.038 {numpy.core.multiarray.array}
           19   16.250    0.855   16.250    0.855 {method '_append_records' of 'tables.tableExtension.Table' objects}
           16    7.958    0.497    7.958    0.497 {method 'astype' of 'numpy.ndarray' objects}
           19    6.533    0.344    6.533    0.344 {pandas.lib.create_hdf_rows_2d}
            4    6.284    1.571    6.388    1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects}
           20    2.640    0.132    2.641    0.132 {pandas.lib.maybe_convert_objects}
            1    1.785    1.785    1.785    1.785 {pandas.lib.isnullobj}
            7    1.619    0.231    1.619    0.231 {method 'flatten' of 'numpy.ndarray' objects}
           11    1.059    0.096    1.059    0.096 {pandas.lib.infer_dtype}
            1    0.997    0.997   41.952   41.952 pytables.py:2468(write_data)
           19    0.985    0.052   40.590    2.136 pytables.py:2504(write_data_chunk)
            1    0.827    0.827   60.617   60.617 pytables.py:2433(write)
         1504    0.592    0.000    0.592    0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
            4    0.534    0.133   13.676    3.419 pytables.py:1038(set_atom)
            1    0.528    0.528    0.528    0.528 {pandas.lib.max_len_string_array}
            4    0.441    0.110    0.571    0.143 internals.py:1409(_stack_arrays)
           35    0.358    0.010    0.358    0.010 {method 'copy' of 'numpy.ndarray' objects}
            1    0.276    0.276    3.135    3.135 internals.py:208(fillna)
            5    0.263    0.053    2.054    0.411 common.py:128(_isnull_ndarraylike)
           48    0.253    0.005    0.253    0.005 {method '_append' of 'tables.hdf5Extension.Array' objects}
            4    0.240    0.060    1.500    0.375 internals.py:1400(_simple_blockify)
            1    0.234    0.234   12.145   12.145 pytables.py:1066(set_atom_string)
           28    0.225    0.008    0.225    0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects}
           36    0.218    0.006    0.218    0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
         6110    0.155    0.000    0.155    0.000 {numpy.core.multiarray.empty}
            4    0.097    0.024    0.097    0.024 {method 'all' of 'numpy.ndarray' objects}
            6    0.084    0.014    0.084    0.014 {tables.indexesExtension.keysort}
           18    0.084    0.005    0.084    0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects}
        11816    0.064    0.000    0.108    0.000 file.py:1036(_getNode)
           19    0.053    0.003    0.053    0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
         1528    0.045    0.000    0.098    0.000 array.py:342(_interpret_indexing)
        11709    0.040    0.000    0.042    0.000 file.py:248(__getitem__)
            2    0.027    0.013    0.383    0.192 index.py:1099(get_neworder)
            1    0.018    0.018    0.018    0.018 {numpy.core.multiarray.putmask}
            4    0.013    0.003    0.017    0.004 index.py:607(final_idx32)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

For some reason, after submitting a string like this Jack’s Spindle from a text
I am trying to find ID3V2 tags from MP3 file using jid3lib in Java.
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I'm trying to convert HTML to plain text. I get many &\#8217; &\#8220; etc.
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
Does anyone know how can I replace this 2 symbol below from the string
I am trying to loop through a bunch of documents I have to put
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.