I’m trying to build a ETL toolkit with pandas, hdf5. My plan was extracting

Question

0

Editorial Team

Asked: June 17, 20262026-06-17T15:31:43+00:00 2026-06-17T15:31:43+00:00

I’m trying to build a ETL toolkit with pandas, hdf5. My plan was extracting

0

I’m trying to build a ETL toolkit with pandas, hdf5.

My plan was

extracting a table from mysql to a DataFrame;
put this DataFrame into a HDFStore;

But when i was doing the step 2, i found putting a dataframe into a *.h5 file costs too much time.

the size of table in source mysql server: 498MB
- 52 columns
- 924,624 records
the size of *.h5 file after putting the dataframe inside : 513MB
- the ‘put’ operation costs 849.345677137 seconds

My questions are:
Is this time costs normal?
Is there any way to make it faster?

Update 1

thanks Jeff

my codes are pretty simple:

extract_store = HDFStore(‘extract_store.h5’)
extract_store[‘df_staff’] = df_staff
and when i trying ‘ptdump -av file.h5’, i got an error, but i still could load the dataframe object from this h5 file:

tables.exceptions.HDF5ExtError: HDF5 error back trace

File “../../../src/H5F.c”, line 1512, in H5Fopen
unable to open file File “../../../src/H5F.c”, line 1307, in H5F_open
unable to read superblock File “../../../src/H5Fsuper.c”, line 305, in H5F_super_read
unable to find file signature File “../../../src/H5Fsuper.c”, line 153, in H5F_locate_signature
unable to find a valid file signature

End of HDF5 error back trace

Unable to open/create file ‘extract_store.h5’

some other infos:
- pandas version: ‘0.10.0’
- os: ubuntu server 10.04 x86_64
- cpu: 8 * Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
- MemTotal: 51634016 kB

I will update the pandas to 0.10.1-dev and try again.

Update 2

I had updated pandas to ‘0.10.1.dev-6e2b6ea’
but the time costs wasn’t decreased, it costs 884.15 s seconds this time
the output of ‘ptdump -av file.h5 ‘ is :

    / (RootGroup) ''  
      /._v_attrs (AttributeSet), 4 attributes:  
       [CLASS := 'GROUP',  
        PYTABLES_FORMAT_VERSION := '2.0',  
        TITLE := '',  
        VERSION := '1.0']  
    /df_bugs (Group) ''  
      /df_bugs._v_attrs (AttributeSet), 12 attributes:  
       [CLASS := 'GROUP',  
        TITLE := '',  
        VERSION := '1.0',  
        axis0_variety := 'regular',  
        axis1_variety := 'regular',  
        block0_items_variety := 'regular',  
        block1_items_variety := 'regular',  
        block2_items_variety := 'regular',  
        nblocks := 3,  
        ndim := 2,  
        pandas_type := 'frame',  
        pandas_version := '0.10.1']  
    /df_bugs/axis0 (Array(52,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/axis1 (Array(924624,)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'integer',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_items (Array(5,)) ''  
      atom := StringAtom(itemsize=12, shape=(), dflt='')  
      maindim := 0   
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_values (Array(924624, 5)) ''  
      atom := Float64Atom(shape=(), dflt=0.0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block1_items (Array(19,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block1_values (Array(924624, 19)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',   
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block2_items (Array(28,)) ''  
      atom := StringAtom(itemsize=18, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block2_values (VLArray(1,)) ''  
      atom = ObjectAtom()  
      byteorder = 'irrelevant'  
      nrows = 1  
      flavor = 'numpy'  
      /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'VLARRAY',  
        PSEUDOATOM := 'object',  
        TITLE := '',   
        VERSION := '1.3',  
        transposed := True]

and I had tried your code below (putting the dataframe into hdfstore with the param ‘table’ is True) , but got an error instead, it seemed like python’s datatime type was not supported :

Exception: cannot find the correct atom type -> [dtype->object] object
of type ‘datetime.datetime’ has no len()

Update 3

thanks jeff.
Sorry for the delay.

tables.version : ‘2.4.0’.
yes, the 884 seconds is only the put operation costs without the pull operation from mysql
a row of dataframe (df.ix[0]):

bug_id                                   1
assigned_to                            185
bug_file_loc                          None
bug_severity                      critical
bug_status                          closed
creation_ts            1998-05-06 21:27:00
delta_ts               2012-05-09 14:41:41
short_desc                    Two cursors.
host_op_sys                        Unknown
guest_op_sys                       Unknown
priority                                P3
rep_platform                          IA32
reporter                                56
product_id                               7
category_id                            983
component_id                         12925
resolution                           fixed
target_milestone                       ws1
qa_contact                             412
status_whiteboard                         
votes                                    0
keywords                                SR
lastdiffed             2012-05-09 14:41:41
everconfirmed                            1
reporter_accessible                      1
cclist_accessible                        1
estimated_time                        0.00
remaining_time                        0.00
deadline                              None
alias                                 None
found_in_product_id                      0
found_in_version_id                      0
found_in_phase_id                        0
cf_type                             Defect
cf_reported_by                 Development
cf_attempted                           NaN
cf_failed                              NaN
cf_public_summary                         
cf_doc_impact                            0
cf_security                              0
cf_build                               NaN
cf_branch                                 
cf_change                              NaN
cf_test_id                             NaN
cf_regression                      Unknown
cf_reviewer                              0
cf_on_hold                               0
cf_public_severity                     ---
cf_i18n_impact                           0
cf_eta                                None
cf_bug_source                          ---
cf_viss                               None
Name: 0, Length: 52

the picture of dataframe( just type ‘df’ in ipython notebook):


Int64Index: 924624 entries, 0 to 924623
Data columns:
bug_id                 924624  non-null values
assigned_to            924624  non-null values
bug_file_loc           427318  non-null values
bug_severity           924624  non-null values
bug_status             924624  non-null values
creation_ts            924624  non-null values
delta_ts               924624  non-null values
short_desc             924624  non-null values
host_op_sys            924624  non-null values
guest_op_sys           924624  non-null values
priority               924624  non-null values
rep_platform           924624  non-null values
reporter               924624  non-null values
product_id             924624  non-null values
category_id            924624  non-null values
component_id           924624  non-null values
resolution             924624  non-null values
target_milestone       924624  non-null values
qa_contact             924624  non-null values
status_whiteboard      924624  non-null values
votes                  924624  non-null values
keywords               924624  non-null values
lastdiffed             924509  non-null values
everconfirmed          924624  non-null values
reporter_accessible    924624  non-null values
cclist_accessible      924624  non-null values
estimated_time         924624  non-null values
remaining_time         924624  non-null values
deadline               0  non-null values
alias                  0  non-null values
found_in_product_id    924624  non-null values
found_in_version_id    924624  non-null values
found_in_phase_id      924624  non-null values
cf_type                924624  non-null values
cf_reported_by         924624  non-null values
cf_attempted           89622  non-null values
cf_failed              89587  non-null values
cf_public_summary      510799  non-null values
cf_doc_impact          924624  non-null values
cf_security            924624  non-null values
cf_build               327460  non-null values
cf_branch              614929  non-null values
cf_change              300612  non-null values
cf_test_id             12610  non-null values
cf_regression          924624  non-null values
cf_reviewer            924624  non-null values
cf_on_hold             924624  non-null values
cf_public_severity     924624  non-null values
cf_i18n_impact         924624  non-null values
cf_eta                 3910  non-null values
cf_bug_source          924624  non-null values
cf_viss                725  non-null values
dtypes: float64(5), int64(19), object(28)

after ‘convert_objects()’:

dtypes: datetime64[ns](2), float64(5), int64(19), object(26)

and putting the converted dataframe into hdfstore costs: 749.50 s 🙂
- it seems that reducing the number of ‘object’ dtypes is the key to decrease time costs
and putting the converted dataframe into hdfstore with the param ‘table’ is true still returns that error

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 
Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.datetime' has no len()

I’m trying to put the dataframe without datetime columns

Update 4

There are 4 columns in mysql whose type is datetime:
- creation_ts
- delta_ts
- lastdiffed
- deadline

After calling the convert_objects():

creation_ts:

Timestamp: 1998-05-06 21:27:00

delta_ts:

Timestamp: 2012-05-09 14:41:41

lastdiffed

datetime.datetime(2012, 5, 9, 14, 41, 41)

deadline is always None, no matter before or after calling ‘convert_objects’

None

putting the dataframe without column ‘lastdiff’ costs 691.75 s
when putting the dataframe without column ‘lastdiff’ and setting param ‘table’ equal to True, I got an new error, :

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 

Exception: cannot find the correct atom type -> [dtype->object] object of type 'Decimal' has no len()

the type of columns ‘estimated_time’, ‘remaining_time’, ‘cf_viss’ is ‘decimal’ in mysql

Update 5

I had transformed these ‘decimal’ type columns to ‘float’ type, by the code below:

no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float)

and now, the time costs is 372.84 s
but the ‘table’ version putting still raised an error:

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 

Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.date' has no len()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T15:31:44+00:00

How to make this faster?

use ‘io.sql.read_frame’ to load data from a sql db to a dataframe. Because the ‘read_frame’ will take care of the columns whose type is ‘decimal’ by turning them into float.
fill the missing data for each columns.
call the function ‘DataFrame.convert_objects’ before putting operation
if having string type columns in dateframe, use ‘table’ instead of ‘storer’

store.put(‘key’, df, table=True)

After doing these jobs, the performance of putting operation has a big improvement with the same data set:

CPU times: user 42.07 s, sys: 28.17 s, total: 70.24 s
Wall time: 98.97 s

Profile logs of the second test:

95984 function calls (95958 primitive calls) in 68.688 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      445   16.757    0.038   16.757    0.038 {numpy.core.multiarray.array}
       19   16.250    0.855   16.250    0.855 {method '_append_records' of 'tables.tableExtension.Table' objects}
       16    7.958    0.497    7.958    0.497 {method 'astype' of 'numpy.ndarray' objects}
       19    6.533    0.344    6.533    0.344 {pandas.lib.create_hdf_rows_2d}
        4    6.284    1.571    6.388    1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       20    2.640    0.132    2.641    0.132 {pandas.lib.maybe_convert_objects}
        1    1.785    1.785    1.785    1.785 {pandas.lib.isnullobj}
        7    1.619    0.231    1.619    0.231 {method 'flatten' of 'numpy.ndarray' objects}
       11    1.059    0.096    1.059    0.096 {pandas.lib.infer_dtype}
        1    0.997    0.997   41.952   41.952 pytables.py:2468(write_data)
       19    0.985    0.052   40.590    2.136 pytables.py:2504(write_data_chunk)
        1    0.827    0.827   60.617   60.617 pytables.py:2433(write)
     1504    0.592    0.000    0.592    0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
        4    0.534    0.133   13.676    3.419 pytables.py:1038(set_atom)
        1    0.528    0.528    0.528    0.528 {pandas.lib.max_len_string_array}
        4    0.441    0.110    0.571    0.143 internals.py:1409(_stack_arrays)
       35    0.358    0.010    0.358    0.010 {method 'copy' of 'numpy.ndarray' objects}
        1    0.276    0.276    3.135    3.135 internals.py:208(fillna)
        5    0.263    0.053    2.054    0.411 common.py:128(_isnull_ndarraylike)
       48    0.253    0.005    0.253    0.005 {method '_append' of 'tables.hdf5Extension.Array' objects}
        4    0.240    0.060    1.500    0.375 internals.py:1400(_simple_blockify)
        1    0.234    0.234   12.145   12.145 pytables.py:1066(set_atom_string)
       28    0.225    0.008    0.225    0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects}
       36    0.218    0.006    0.218    0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
     6110    0.155    0.000    0.155    0.000 {numpy.core.multiarray.empty}
        4    0.097    0.024    0.097    0.024 {method 'all' of 'numpy.ndarray' objects}
        6    0.084    0.014    0.084    0.014 {tables.indexesExtension.keysort}
       18    0.084    0.005    0.084    0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects}
    11816    0.064    0.000    0.108    0.000 file.py:1036(_getNode)
       19    0.053    0.003    0.053    0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
     1528    0.045    0.000    0.098    0.000 array.py:342(_interpret_indexing)
    11709    0.040    0.000    0.042    0.000 file.py:248(__getitem__)
        2    0.027    0.013    0.383    0.192 index.py:1099(get_neworder)
        1    0.018    0.018    0.018    0.018 {numpy.core.multiarray.putmask}
        4    0.013    0.003    0.017    0.004 index.py:607(final_idx32)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to build a ETL toolkit with pandas, hdf5. My plan was extracting

Update 1

Update 2

Update 3

Update 4

Update 5

Leave an answerCancel reply

1 Answer

How to make this faster?

Leave an answer
Cancel reply