I have a huge database with rows structured by the fields “date, ad, site, impressions, clicks”
I got all of them via python using:
cursor.execute(select * from dabase)
data = cursor.fetchall()
From all this data, I need to sample only the rows that happened in a certain time an ad when printed in certain site has lead to an amount of clicks bigger than zero, so for instance:
row(1) : (t1, ad1, site1) -> clicks = 1 (t is time)
row(2) : (t2, ad1, site1) -> clicks = 0
So the ad1 and site1 at point t1 had clicks > 0 and therefore all points in data containing ad1 and site1 must be taken and put into another list, which I called final_list that would contain row(1) and row(2) (row(2) has 0 clicks, but since in time t1 ad1 and site1 had clicks > 0, so this row must be taken as well)
When I tried making it via MySQL Workbench it took so long that I got the error message “Lost Connection to Database”. I think it happens because the table has almost 40 million rows, even though I´ve seem people working with much bigger amounts of data here MySQL is not being able to handle it, that´s why I used python (in fact, to get the rows with clicks > 0 it took a few seconds in python whereas it took more than 10 minutes via MySQL, I´m not sure precisely how long it was)
What I did then was to first select points ad and site with clicks > 0:
points = [(row[1], row[2]) for row in data if row[4]]
points = list(set(points))
dic = {}
for element in points:
dic[element] = 1
This code took just a few seconds to run. Having a dictionary with the wanted points I began to insert data into the final_list:
final_list = []
for row in data:
try:
if dic[(row[1], row[2])] == 1: final_list.append(row)
except: continue
But it´s taking too long and I´ve been trying to figure out a way to make it go faster. Is it possible?
I appreciate any help!
I know the comments have asked why you aren’t able to just do this in the database, which I wonder as well… but as for at least addressing your code, you probably don’t need a bunch of steps in the middle such as converting to list -> set -> list -> dictionary. I’m sure the list append()’s are killing you, as well as the for loops.
What about this?
You could even see if this is faster to get your point set:
My answer gives your question the benefit of the doubt that you have no option for doing this regularily from sql with a better sql query