I have a query in SQL that I’m trying to translate into Pig Latin (for use on a Hadoop cluster). Most of the time I have no problem moving the queries over to Pig, but I’ve encountered something I can’t seem to figure out.
Imagine a select statement like this:
SELECT a.f1, b.f2, b.f3 -- just for example
FROM tableA a, tableB b
WHERE(
(
a.f1 = b.f2
AND
(
a.f2 = b.f1
OR
(
(a.f2 = 'somestring1' OR a.f2 = 'somestring2')
AND
(b.f1 is null OR b.f1 ='somestring3' OR b.f1 = 'somestring4')
)
)
OR
(
a.f3 = b.f4
AND
(
a.f4 = b.f3
OR
(
(a.f4 = 'somestring5' OR a.f4 = 'somestring6')
AND
(b.f3 is null OR b.f3 ='somestring7' OR b.f3 = 'somestring8')
)
)
)
)
GROUP BY -- some other stuff
Now, I know that a direct translation to Pig might not be possible.
If that’s the case, I’m wondering how this statement gets decomposed into the JOINs (and filters) required to compute it (as it may be easier to see how to construct a Pig query out of that).
I’ve gone to the Pig mailing list but I haven’t found a good solution yet, as Pig doesn’t really do “OR” with JOINs. I am aware of how odd this question is.
(If anyone does have an idea of how this be done in Pig, I wouldn’t be opposed to looking at it…)
Edit: Does anyone know if this would be easier in Hive, maybe?
UNION is another way of spelling OR … sort of.
Maybe: