Data is fairly large and takes few minutes to run it every time, so its taking a lot of time debugging this problem. When I run like concat('%',T.item,'%') on smaller data it seems to identify items properly. However, when I run it on the main DB (the code shown), it still shows many(maybe even all) of the exceptions.
EDIT:
it seems when i add NOT it stops identifying items
select distinct T.comment
from (select comment, source, item from data, non_informative where ticker != "O" and source != 7 and source != 6) as T
where T.comment not like concat('%',T.item,'%')
order by T.comment;
comment and source are in data, item is in non_informative
Some items from T.item:
‘Stock Analysis -‘, ‘#InsideTrades’, ‘IIROC Trade’
Example comment which should be removed
‘#InsideTrades #4 | MACNAB CRAIG (Director,Officer,Chief Executive
Officer): Filed Form 4 for $NNN (NATIONAL RETA’
Can’t seem to figure out it why shows all the items
You’ve got a Cartesian product between
non_informativeanddatatables. (Not at all clear which table the columntickeris from.Understand that for a “comment” to be returned, all that is required (to satisfy the predicates in your query) is for one row to be found in
non_informativewhich does not “match” the comment. There may be rows in non_informative that do match, but your query doesn’t care about those. Your query is only looking for the existence of a row that does NOT match. The query is effectively saying that a “comment” will be excluded ONLY if it matches every single row in non_informative.If what you want to return is the values of “comment” for which there is NO matching row in non_informative, you need a different query. (I’m going to assume that the
tickercolumn is from thedatatable.)I’m also going to exclude the corner cases of an empty string value for
item, since that will essentially “match” every non-null value for comment.SQL Fiddle here
— using a NOT EXISTS predicate:
— or, using an anti-join:
These two statements should return an equivalent result set (but different from the resultset of your original query). They will also likely exhibit different performance characteristics (depending on the version of MySQL, and whether the MySQL engine can transform the NOT EXISTS predicate into an anti-join operation… performance is really going to depend on what indexes are available, and generated execution plan.)
If we don’t bother with the empty string corner-case, we can simplify the second statement a bit…
Basically, for every row in the
datatable, we’re checking for a “match” in thenon_informativetable. For any row where we find a “match”, that row will be excluded by the “n.item IS NULL” predicate. For any row fromdatawhere it doesn’t find a matching row innon_informative, the LEFT JOIN operation will generate a NULL value for the “item” column, so the row will be included in the resultset.PERFORMANCE:
Your original query includes an inline view (aliased as
t). MySQL is going to materialize that as an intermediate MyISAM table, before the outer query runs. And that kind of think can be a real performance killer with large tables.But before we “tune” that statement, we really need a statement that returns a correct resultset. (There’s no sense in re-writing that statement if it doesn’t return the desired resultset, except as an exercise.)