I have two tables in Hive, t1 and t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03, I want to count distinct userid from 2011-01-01 to 2011-01-03.
for date 2011-01-04, I want to count distinct userid from 2011-01-02 to 2011-01-04
I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
That said, the following part does not take effect.
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
EDIT: One solution can be (but it is super slow):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL.
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Then, I do group by and count. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join.
The following should seem to work in standard SQL…
It will, however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can’t be used, and the SQL engine can’t do Anything clever to reduce the effort being expended.
As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.
If you store these values as dates, you don’t need
to_date(). And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.NOTE: The format of the
ONclause means that you do not need to roundt2.createddatedown to a daily value.EDIT Why your code didn’t work…
This joins t1 to t2 with an
ONclause of(to_date(t2.createddate) = to_date(t1.date_id)). As the join is a LEFT OUTER JOIN, the values int2.createddateMUST now either be NULL (no matches) or be the same ast1.date_id.The
WHEREclause allows a much wider range (3 days). But theONclause of theJOINhas already restricted you data down to a single day.The example I gave above simply takes your
WHEREclause and put’s it in place of the oldONclause.EDIT
Hive doesn’t allow
<=and>=in the ON clause? Are you really fixed in to using HIVE???If you really are, what about BETWEEN?
Alternatively, refactor your table of dates to enumerate the dates you want to include…