I’m at my wits end trying to solve this one. I have scripts and UDFs that run perfectly with Pig 0.8.1, but when I try to run with Pig 0.10.0, I get:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2218: Invalid resource schema: bag schema must have tuple as its field
The code that calls the UDF from the Pig script looks like this:
parsed = LOAD '$INPUT'
USING pignlproc.storage.ParsingWikipediaLoader('$LANG')
AS (title, id, pageUrl, text, redirect, links, headers, paragraphs);
The ParsingWikipediaLoader class implements LoadMetaData, and the getSchema() method looks like this:
public ResourceSchema getSchema(String location, Job job)
throws IOException {
Schema schema = new Schema();
schema.add(new FieldSchema("title", DataType.CHARARRAY));
schema.add(new FieldSchema("id", DataType.CHARARRAY));
schema.add(new FieldSchema("uri", DataType.CHARARRAY));
schema.add(new FieldSchema("text", DataType.CHARARRAY));
schema.add(new FieldSchema("redirect", DataType.CHARARRAY));
Schema linkInfoSchema = new Schema();
linkInfoSchema.add(new FieldSchema("target", DataType.CHARARRAY));
linkInfoSchema.add(new FieldSchema("begin", DataType.INTEGER));
linkInfoSchema.add(new FieldSchema("end", DataType.INTEGER));
schema.add(new FieldSchema("links", linkInfoSchema, DataType.BAG));
Schema headerInfoSchema = new Schema();
headerInfoSchema.add(new FieldSchema("tagname", DataType.CHARARRAY));
headerInfoSchema.add(new FieldSchema("begin", DataType.INTEGER));
headerInfoSchema.add(new FieldSchema("end", DataType.INTEGER));
schema.add(new FieldSchema("headers", headerInfoSchema, DataType.BAG));
Schema paragraphInfoSchema = new Schema();
paragraphInfoSchema.add(new FieldSchema("tagname", DataType.CHARARRAY));
paragraphInfoSchema.add(new FieldSchema("begin", DataType.INTEGER));
paragraphInfoSchema.add(new FieldSchema("end", DataType.INTEGER));
schema.add(new FieldSchema("paragraphs", paragraphInfoSchema,
DataType.BAG));
return new ResourceSchema(schema);
}
Again, the script and UDF work as expected with Pig 0.8.1, so this has to be some difference between the versions. I’ve searched thoroughly, but can’t find anything about this in the documentation, or on Stack Overflow.
Looks like the difference is in the
ResourceFieldSchemaconstructor.0.8.1 detects a Bag and wraps the inner schema in a tuple, whereas this logic has been removed from 0.10.0. I guess you need to amend your schema definition to wrap the bag schemas in a tuple:
This does however produce a tuple-in-tuple like schema when used on 0.8.1:
{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (t: (target: chararray,begin: int,end: int))},headers: {t: (t: (tagname: chararray,begin: int,end: int))},paragraphs: {t: (t: (tagname: chararray,begin: int,end: int))}}You can fix this by amending the two level access required flag to true:
Which then produces an identical schema between 0.10.0 and 0.8.1:
{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}0.10.0:
0.8.1