I am trying to import a large dataset (41million records) into a new Solr index. I have setup the core, it works, I inserted some test docs, they work. I have setup the data-config.xml as below and then I start the full-import. After about 12 hours! the import fails.
The document size can get quite large, could the error be because of a large document (or field) or due to the volume of data going into the DataImportHandler?
How can I get this frustrating import task working!?!
I have included the tomcat error log below.
Let me know if there is any info i have missed!
logs:
Jun 1, 2011 5:47:55 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity results with URL: jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor
Jun 1, 2011 5:47:56 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Time taken for getConnection(): 1185
Jun 1, 2011 5:48:02 PM org.apache.solr.core.SolrCore execute
INFO: [results] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0
...
Jun 2, 2011 5:16:32 AM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:664)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringDecoder.decode(Unknown Source)
at java.lang.StringCoding.decode(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:419)
at com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:1974)
at com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:175)
at com.microsoft.sqlserver.jdbc.Column.getValue(Column.java:113)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1982)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1967)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2256)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2265)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.getARow(JdbcDataSource.java:286)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$700(JdbcDataSource.java:228)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:266)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:260)
at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:78)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
... 5 more
Jun 2, 2011 5:16:32 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Jun 2, 2011 5:16:44 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor"
user="sa"
password="password"/>
<document>
<entity name="results" query="SELECT fielda, fieldb, fieldc FROM mydb.[dbo].mytable WITH (NOLOCK)">
<field column="fielda" name="fielda"/><field column="fieldb" name="fieldb"/><field column="fieldc" name="fieldc"/>
</entity>
</document>
</dataConfig>
solrconfig.xml snippet:
<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>25</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>100000</maxFieldLength>
<writeLockTimeout>10000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
</indexDefaults>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>25</mergeFactor>
<infoStream file="INFOSTREAM.txt">true</infoStream>
</mainIndex>
Java config settings: init mem 128mb, max 512mb
Environment:
solr 3.1
tomcat 7.0.12
windows server 2008
java: v6 update 25 (build 1.6.0_25-b06)
(data coming from:sql 2008 r2)
/admin/stats.jsp - DataImportHandler
Status : IDLE
Documents Processed : 2503083
Requests made to DataSource : 1
Rows Fetched : 2503083
Documents Deleted : 0
Documents Skipped : 0
Total Documents Processed : 0
Total Requests made to DataSource : 0
Total Rows Fetched : 0
Total Documents Deleted : 0
Total Documents Skipped : 0
handlerStart : 1306759913518
requests : 9
errors : 0
EDIT: I am currently running a sql query to find out the largest single record’s field length, as I think this is probably cause of exception. Also, running import again with jconsole to monitor heap usage.
EDIT: Read solr performance factors page. changing maxFieldLength to 1000000 and changing ramBufferSizeMB = 256. Now for another import run (yay…)
makes it pretty obvious that The MS JDBC driver is running out of ram. Many JDBC drivers can default to fetching all their results at once in memory. So see if this can be tuned or consider using the opensource JTDS driver which is generally better behaved anyway
I don’t believe maxfieldlength is gonna help you – that will affect how much Lucene truncates, but not how much is initially transferred. Another option is to only transfer a selection at a time, say a 1 million, using TOP and ROWNUMBER and such for paging.