i recently meet this problem in my work, it’s about pig flatten. i use a simple example to express it
two files
===file1===
1_a
2_b
4_d
===file2 (tab seperated)===
1 a
2 b
3 c
pig script 1:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
c = join a1 by num, b by num;
dump c; -- exception java.lang.String cannot be cast to java.lang.Integer
pig script 2:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
a2 = foreach a1 generate (int)num as num, ch as ch;
c = join a2 by num, b by num;
dump c; -- exception java.lang.String cannot be cast to java.lang.Integer
pig script 3:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
a2 = foreach a1 generate (int)$0 as num, $1 as ch;
c = join a2 by num, b by num;
dump c; -- right
i don’t know why script 1,2 are wrong and script 3 right, and i also want to know is there more concise expression to get relation c, thx.
Is there any particular reason you are not using PigStorage? Because it could make life so much easier for you 🙂 .
Also note that, in file1 you used underscore as delimiter, but you give “-” as argument to STRSPLIT.
edit:
I have spent some more time on the scripts you provided; script 1 & 2 indeed does not work and the script 3 also works like this (without the extra foreach):
As for the source of the problem, i’ll take a wild guess and say it might be related to this (as stated in Pig Documentation) combined with pig’s run cycle optimizations :
In your case, I believe schema of the STRSPLIT result is unknown until runtime.
edit2:
Ok, here is my theory explained:
This is the complete -explain- output for script 2 and this is for script 3. I’ll just paste the interesting parts here.
Above section is for script 2; see the last line. It assumes output of
flatten(STRSPLIT)will have a first element of typeinteger(because you provided the schema that way). But in factSTRSPLIThas anulloutput schema which is treated asbytearrayfields; so output offlatten(STRSPLIT)is actually(n:bytearray, c:bytearray). Because you provided a schema, pig tries to make a java cast (to the output ofa1) tonumfield; which fails asnumis in fact a javaStringrepresented as bytearray. Since this java-cast fails, pig does not even try to make the explicit cast in the line above.Let’s see the situation for script 3:
See the last line, here output of
a1is properly treated asbytearray, no problems here. And now look at the second to last line; pig tries (and succeeds) to make an explicit cast operation frombytearraytointeger.