I’m trying to run a clustering job on Amazon EMR using Mahout.
I have a solr index that I uploaded on S3 and I want to vectorize it using mahouts lucene.vector.(this is the first step in the job flow)
The parameters for the step are:
- Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar
- MainClass: org.apache.mahout.driver.MahoutDriver
- Args: lucene.vector –dir s3n://mahout-input/solr_index/ –field name –dictOut /test/solr-dict-out/dict.txt –output /test/solr-vectors-out/vectors
The error in the log is:
Unknown program ‘lucene.vector’ chosen.
I’ve done the same process locally with hadoop and Mahout and it worked fine.
How should I call the lucene.vector function on EMR?
I’ve eventually figured out the answer. The problem was I was using the wrong MainClass argument. Instead of
I should have used:
Therefore the correct arguments should have been