I’m trying to test out Nutch 2.1 on a single Windows machine. The following command dies:
nutch crawl seeds -dir crawl -solr http://somehost:8983/solr -depth 2 -topN 2
…with a traceback of several exceptions:
java.net.ConnectionException: Connection refusedGoraExceptionSQLTransientConnectionExceptionorg.hsqldb.HsqlException
This is the same problem as this post: connection refused error when running Nutch 2
It looks like Nutch 2 wants some kind of database already installed, but there’s no mention of that in the (sparse) documentation that I can see.
The production environment will eventually be a linux/Hadoop cluster, but for the moment I’m just trying to get a simple local system to work out of the box.
So what options are there for a simple Nutch database? How do you tell Nutch and Gora about the database? HBase might be a good answer as soon as we have our Hadoop cluster up and running. However; in the meantime is there a simple, even slow, database that will work for initial exploration on a single system?
I’ve tried with MYSQL and HBASE.
For MYSQL, this link helps iron out most of the quirks: http://nlp.solutions.asia/?p=180
For HBASE, versions above 0.90.x cause problems (Invalid Host Value pair). I’ve been able to get it working with 0.90.5