i am new to hadoop map reduce framework, and I am thinking of using hadoop map reduce to parse my data. I have thousands of big delimited files for which I am thinking of writing a map reduce job to parse those files and load them into hive datawarehouse. I have written a parser in perl which can parse those files. But I am stuck at doing the same with Hadoop map reduce
For example: I have a file like
x=a y=b z=c…..
x=p y=q z=s…..
x=1 z=2 ….
and so on
Now I have to load this file as columns (x,y,z) in hive table, but I am not able to figure out can I proceed with it. Any guidance with this would be really helpful.
Another problem in doing this is there are some files where the field y is missing. I have to include that condition in the map reduce job. So far, I have tried using streaming.jar and giving my parser.pl as mapper as input to that jar file. I think that is not the way to do it :), but I was just trying if that would work. Also, I thought of using load function of Hive, but the missing column will create problem if I will specify regexserde in hive table.
I am lost in this now, if any one could guide me with this I would be thankful 🙂
Regards,
Atul
I posted something a while ago to my blog a while ago. (Google “hive parse_url” should be in the top few)
I was parsing urls but in this case you will want to use
str_to_map.arg1=> String to processarg2=> Key Value Pair separatorarg3=> Key Value separatorThe result of
str_to_mapwill give you amap<str, str>of 3 key-value pairs.We can pass this to Hive via: