What would anyone consider the most efficient way to merge two datasets using Python?
A little background – this code will take 100K+ records in the following format:
{user: aUser, transaction: UsersTransactionNumber}, ...
and using the following data
{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...
to create
{user: aUser, activationNumber: assoiciatedActivationNumber}, ...
N.B These are not Python dictionaries, just the closest thing to portraying record format cleanly.
So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key – at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? So far I felt this could be implemented as:
-
Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
-
Manipulate the data as an in-memory SQLite Table? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier.
-
Some kind of Set based magic?
Clarification
The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up.
Apologies if my notation for the records was misleading, I have updated them accordingly.
Thanks for the replies, I am going to give two ideas a try:
- Sorting the lists first (I don’t know
how expensive this is) - Creating a
dictionary with the transactionCodes
as the key then store the user and
activation code in a list as the
value
Performance isn’t overly paramount for me, I just want to try and get into good habits with my Python Programming.
Here’s a radical approach.
Don’t.
You have two CSV files; one (users) is clearly the driver. Leave this alone.
The other — transaction codes for a user — can be turned into a simple dictionary.
Don’t “combine” or “join” anything except when absolutely necessary. Certainly don’t “merge” or “pre-join”.
Write your application do simply do simple lookups in the other collection.
Close. It looks like this. Note: No Sort.
This is fast and simple. Save the dictionaries (use
shelveorpickle).False.
One list is the “driving” list. The other is the lookup list. You’ll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.