I would like to use methods REPLACE, SUBSTRING and INDEXOF in my Pig, but I am unable to use it in a nice way.
-
First case:
REPLACEinREGEX_EXTRACT_ALL:data_split = FOREACH data GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, MY_REGULAR_EXPRESSION)) AS ( timestamp: chararray, url: chararray, REPLACE(url , '.*?://', '') AS clean_url: chararray);
I would like to use REPLACE to remove the leading http:// in URL. In this case I am getting:
Error during parsing. Encountered " "(" "( ""
-
Second case: Reusing output:
ws = FOREACH data_split { clean_url = REPLACE(url , '.*?://', ''); url_index = INDEXOF(clean_url, '/'); web_server = SUBSTRING(clean_url, 0, url_index); GENERATE web_server, timestamp, ip ;
Neither this case works, when I try to reuse clean_url from previous call to REPLACE, I am getting
Attempt to give operator of type
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
multiple outputs. This operator does not support multiple outputs.
Thanks
I think you can’t use a
UDFwithin theAS clausein which the schema is specified.I assume you already have it in this way:
As for your second question:
Which Pig version do use use? I think this was a bug, in version 0.10.0 I couldn’t reproduce it.