I’m trying to set up a trial cassandra + pig cluster. The cassandra wiki makes it sound like you need hadoop to integrate with pig.
but the readme in cassandra-src/contrib/pig makes it sound like you can run pig on cassandra without hadoop.
If hadoop is optional, what do you lose by not using it?
Hadoop is only optional when you are testing things out. In order to do anything at any scale you will need hadoop as well.
Running without hadoop means you are running pig in local mode. Which basically means all the data is processed by the same pig process that you are running in. This works fine with a single node and example data.
When running with any significant amount of data or multiple machines you want to run pig in hadoop mode. By running hadoop task trackers on your cassandra nodes pig can take advantage of the benefits map reduce provides by distributing the workload and using data locality to reduce network transfer.