I am using spring batch to process multiple files using MultiResourcePartitioner and all the itemreader and writers are in step scope.Each step runs individual files and commits to database at interval of 1000. when there is any error during current processing, all the previous commits needs to be roll backed and the step will fail . Thus the file contents are not added to the database.
Which is the best way among these:
-
Using Transaction Propogation as NESTED.
-
Setting commit interval in chunk with Integer.MAXVALUE , this will not work as the
file have large items and fail with heap space. -
any other way to have transaction at the step level.
I have the sample xml file shown below:
<bean id="filepartitioner" class="org.springframework.batch.core.partition.support.MultiResourcePartitioner">
<property name="resources" value="classpath:${filepath}" />
</bean>
<bean id="fileItemReader" scope="step" autowire-candidate="false" parent="itemReaderParent">
<property name="resource" value="#{stepExecutionContext[fileName]}" />
</bean>
<step id="step1" xmlns="http://www.springframework.org/schema/batch">
<tasklet transaction-manager="ratransactionManager" >
<chunk writer="jdbcItenWriter" reader="fileItemReader" processor="itemProcessor" commit-interval="800" retry-limit="3">
<retryable-exception-classes>
<include class="org.springframework.dao.DeadlockLoserDataAccessException"/>
</retryable-exception-classes>
</chunk>
<listeners>
<listener ref="customStepExecutionListener">
</listener>
</listeners>
</tasklet>
<fail on="FAILED"/>
</step>
UPDATES:
It seems that the main table (where direct insert happens) is referred by other tables and materialized views . if i delete the data in this table to remove stale records using processed column indicator , the data spooled using MV will show old data. i think staging table is needed for my requirement.
To implement staging data table for this requirement
-
Create another parallel step to poll database and write the data whose processed column value is Y.
-
Transfer data at the end of each successful file completion using step listener (afterStep method).
or any other suggestions.
In general I agree with @MichaelLange approach. But perhaps separate table is too much… You can have additional column
completedin your import table, which if set to “false” then the record belongs to file which is being processing now (or failed processing). After you’ve processed the file you issue a simple update for this table (should not fail as you don’t have any constraints on this column):update import_table set completed = true where file_name = "file001_chunk1.txt"Before processing a file you should remove “stale” records:
delete from import_table where file_name = "file001_chunk1.txt"This solution would be faster and easier to implement then nested transactions. Perhaps with this approach you will face table locks but with appropriate selection of isolation level this can be minimised. Optionally you may wish to create a view over this table to filter out the non-completed records (enable index on
completedcolumn):create view import_view as select a, b, c from import_table where completed = trueIn general I think nested transactions are not possible in this case, as chunks can be processed in parallel threads, each thread holding it’s own transaction context. The transaction manager will not be able to start a nested transaction in new thread, even if you somehow manage to create a “main transaction” in “top” job thread.
Yet another approach is the continuation of the “temporary table”. What the import process should do is to create import tables and name them according to e.g. date:
and a “super-veiw” that joins all these tables:
After the import succeeded, the “super-view” should be re-created.
With this approach you will have difficulties with foreign keys for import table.
Yet another approach is to use a separate DB for import and then feed the imported data from the import DB to main (e.g. transfer the binary data).