I have a database with over 30,000 tables and ~40-100 rows in each table. I want to retrieve a list of table names which contain a string under a specific column.
So for example:
I want to retrieve the names of all tables which contain ‘foo’…
Database
Table_1
ID: 1, STR: bar
ID: 2, STR: foo
ID: 3, STR: bar
Table_2
ID: 1, STR: bar
ID: 2, STR: bar
ID: 3, STR: bar
Table_3
ID: 1, STR: bar
ID: 2, STR: bar
ID: 3, STR: foo
So in this case the function should return [‘Table_1’, ‘Table_3’]
So far I have this, it works fine but takes over 2 minutes to execute, which is way too long for the application I have in mind.
self.m('SHOW TABLES')
result = self.db.store_result()
tablelist = result.fetch_row(0, 1)
for table in tablelist:
table_name = table['Tables_in_definitions']
self.m("""SELECT `def` FROM `""" + table_name + """` WHERE `def` = '""" + str + """'""")
result = self.db.store_result()
r = result.fetch_row(1, 1)
if len(r) > 0:
results.append(table_name)
I’m not smart enough to come up with a way to speed this up so if anyone has any suggestions it would be greatly appreciated, thanks!
If you are just testing for the existence of one row in each table where
def = 'str', one easy thing to do (with no other changes) is to add aLIMIT 1clause to the end of your query.(If your query is performing a full table scan, MySQL can halt it once the first row is found. If no rows are found, the full table scan has to run to the end of the table.)
This also avoids overhead of preparing lots of rows to be returned to the client, and returning them to the client, if they aren’t needed.
Also, an index with
defas a leading column (at least on your largest tables) will likely help performance, if your query is looking through large tables for “a needle in haystack”.UPDATE:
I’ve re-read your question, and I see that you have 30,000 tables to check, that’s 30,000 separate queries, 30,000 roundtrips to the database. (ACCCKKK.)
So my previous suggestion is pretty much useless. (That would be more appropriate with 40 tables each having 30,000 rows.)
Another approach would be to query a bunch of those tables at the same time. I’d be hesitant to even try more than a couple hundred tables at a time though, so I’d do it in batches.
If
defis unique in each table, or, if it’s nearly unique, and you can handle duplicate table_name values being returned, you could get rid of the DISTINCT keyword.You do need to ensure that every table in the list has a column named
def. If you encounter a table that doesn’t have that column in it, the whole batch would fail. And aSHOW TABLESdoesn’t do that check of the column names. I’d be using a query like this to get the list of table names that have a column nameddef: