I am trying to run a hive query to filter invalid records. Here is what I am doing
1. Load the csv file into a single column table.
2. define a UDF my_validation to validate each record
3. execute the query
from pgstg INSERT OVERWRITE LOCAL DIRECTORY '/tmp/validrecords.out'
select * where my_validation(record) IS NOT NULL
INSERT OVERWRITE TABLE PGERR
select record where my_validation(record) IS NULL;
Here are my questions:
a. Is there a better way to filter invalid records;
b. Does the my_validation UDF run twice on the whole table ?
c. what is the best way to split a single column to multiple column.
Thanks much for your help.
To answer your questions:
1) If you have custom validation criteria UDF is probably the way to go. If I were doing it, I would create an is_valid UDF that returns a boolean (instead of returning NULL vs. not NULL).
2) Yes, the UDF does get run twice.
3) Glad you asked. Look at the explode function available in Hive