I have to make my query on large database (Snort alerts) to find duplicate entries. However, I came up with bellow query, but it takes so many time to be executed!
SELECT sid, cid, timestamp, sig_name, inet_ntoa(ip_src), layer4_sport,
inet_ntoa(ip_dst), layer4_dport
FROM DB
WHERE (ip_dst IN
(SELECT ip_dst FROM DB GROUP BY ip_dst HAVING count(*) > 1)
AND timestamp IN
(SELECT timestamp FROM DB GROUP BY timestamp HAVING count(*) > 1)
AND layer4_dport IN
(SELECT layer4_dport FROM DB GROUP BY layer4_dport HAVING count(*)>1 ))
The above query trying to find alerts ip_dst that has same timestamp and layer4_dport
if the they came more than one time. I hope its clear!
Any tips or tricks to make it efficient?
I’ve formatted your query… if we break it down you seem to be applying a couple of functions
inet_ntoa. If you don’t have a pressing need for then get rid of them ( especially if they look at a table ).Secondly, if we look at your query you are doing a full scan of
DB3 times for your various counts, and then at the very minimum a range scan in your top level select.By not linking your subquery back to the main table, you’ve assumed that
ip_dst,timestampandlayer4_dportare each unique across the whole table and then are trying to find where the unlikely occurrence of 3 independently unique values happened to have duplicates in the same row.I suspect what you want to do is something like the following:
This finds you all the rows where there are more than 1 identical
timestampandlayer4_dportcombinations as per your question.If you want to find all the duplicates at the level of
ip_dstthen you need to add this to your sub-query.