I know the question seems duplicate, but I don’t know how to ask it differently.
I have two very simple tables in MySQL database, The first is table Users
id, user_id
1 1
2 3
4 4
The second is table Friends
id, user_id, friend_id
1 1 3
2 1 4
3 1 8
I dumped the data from CSV file that I would like to clean. I need to check if friend_id exists in table 1 as well. The first table has around 30000 rows, but the second table has around 30 million rows.
And I use this query to check
SELECT u.user_id, uf.friend_id as exists_friend_ids
FROM Users u, Friends uf
WHERE u.user_id = '1'
and uf.friend_id IN (select user_id from eventify.Users)
However, my desired output would be this but as I cannot run the above query to actually give my test results I cannot continue.
user_id, exists_friend_ids
1 3
1 4
You can see that 8 is not there, because it doesn’t exist in Users table. But as the second table has over 30 million records it’s just running forever on my computer. Am I doing it right or this is the only way to do it. Or should I learn Hadoop instead?
I have updated my query to use equal join.
Have you tried a LEFT JOIN query with a GROUP BY friend_id ? If a user doesn’t exist, it won’t add a line to the result.