Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9220233
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 18, 20262026-06-18T03:18:00+00:00 2026-06-18T03:18:00+00:00

I am using Python for some data analysis. I have two tables, the first

  • 0

I am using Python for some data analysis. I have two tables, the first (let’s call it ‘A’) has 10 million rows and 10 columns and the second (‘B’) has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables.

I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting each chunk with A and the concatenating these intersections (output from inner joins). This is OK on speed but every now and then this gives me problems and spits out a segmentation fault … no so great. This error is difficult to reproduce, but it happens on two different machines (Mac OS X v10.6 (Snow Leopard) and UNIX, Red Hat Linux).

I finally tried with the combination of Pandas and PyTables by writing table B to disk and then iterating over table A and selecting from table B the matching rows. This last options works but it is slow. Table B on pytables has been indexed already by default.

How do I tackle this problem?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-18T03:18:01+00:00Added an answer on June 18, 2026 at 3:18 am

    This is a little pseudo codish, but I think should be quite fast.

    Straightforward disk based merge, with all tables on disk. The
    key is that you are not doing selection per se, just indexing
    into the table via start/stop, which is quite fast.

    Selecting the rows that meet a criteria in B (using A’s ids) won’t
    be very fast, because I think it might be bringing the data into Python space
    rather than an in-kernel search (I am not sure, but you might want
    to investigate on pytables.org more in the in-kernel optimization section.
    There is a way to tell if it’s going to be in-kernel or not).

    Also if you are up to it, this is a very parallel problem (just don’t write
    the results to the same file from multiple processes. pytables is not write-safe for that).

    See this answer for a comment on how doing a join operation will actually be an ‘inner’ join.

    For your merge_a_b operation I think you can use a standard pandas join
    which is quite efficient (when in-memory).

    One other option (depending on how ‘big’ A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

    A = HDFStore('A.h5')
    B = HDFStore('B.h5')
    
    nrows_a = A.get_storer('df').nrows
    nrows_b = B.get_storer('df').nrows
    a_chunk_size = 1000000
    b_chunk_size = 1000000
    
    def merge_a_b(a,b):
        # Function that returns an operation on passed
        # frames, a and b.
        # It could be a merge, join, concat, or other operation that
        # results in a single frame.
    
    
    for a in xrange(int(nrows_a / a_chunk_size) + 1):
    
        a_start_i = a * a_chunk_size
        a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)
    
        a = A.select('df', start = a_start_i, stop = a_stop_i)
    
        for b in xrange(int(nrows_b / b_chunk_size) + 1):
    
            b_start_i = b * b_chunk_size
            b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
    
            b = B.select('df', start = b_start_i, stop = b_stop_i)
    
            # This is your result store
            m = merge_a_b(a, b)
    
            if len(m):
                store.append('df_result', m)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am using Python and Numpy to do some data analysis. I have a
I'm using the MVC pattern to design some data analysis software (in Python ).
I'm trying to collect some data from some webpage using python (they don't have
I am using Python to generate some data and have some code like this
I have some data that uses unixtime . I'm using Python and MySQL. I
i have noticed when using python gnupg, that if i sign some data and
I am performing some data validation and cleanup using Python, and I have run
I'm using Python and I have some data that I want to put into
I have some python code using shutil.copyfile: import os import shutil src='C:\Documents and Settings\user\Desktop\FilesPy'
I just started using/learning Python and have some questions. I have a text file

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.