I have a csv file with following format
TRAABRX12903CC4816,1548880,2:19,4:7,...
.
.
.
My problem is that I want to interpret as
{(key:chararray,key2:int,{(id:int,cnt:int)})}
So far my code is
data = LOAD 'mxm_dataset_test_3.txt' using PigStorage(',');
data0 = foreach data generate $0 as key:chararray, {$2 ..} as bow;
For data0 this the schema data0: {key: chararray,bow: {(NULL)}}
When I try to explicit cast it to (bag{tuple(chararray)}) with
data0 = foreach data generate $0 as key:chararray, {$2 ..} as bow;
this gives the error Cannot cast bag with schema :bag{:tuple(:NULL)} to bag with schema :bag{:tuple(:chararray)}
Use the TOBAG built-in function to build your bag:
If you want to split up the id:cnt pairs, however, this gets trickier. Because there is no way to assign a schema to an arbitrary number of elements, and
TOBAGis a UDF, Pig can’t cast the bytearray to a chararray or anything else later.I would recommend loading the entire line as a string (
USING PigStorage('\n')), usingSTRSPLITwith a limit of 3 to get yourkey,key2, and comma-delimited list of strings, then iterate withSTRSPLITon comma and then on colon to get the pairs you want, usingFLATTENandTOBAGas needed. I would demonstrate this for you, but I am stuck on Pig 0.9, and judging by PIG-2311, this isn’t possible until Pig 0.10.The simplest solution may just be to write your own UDF to interpret a string like
2:13,9:4,5:4:where
myudfs.PARSE_PAIRSreturns a bag with the tuples you want. Good luck.