I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data
namely:
field:
name:value
fieldset:
nombre <FieldSet>
field,
.
.
.
field(n)
table
nombre <table>
head(1)... head(n)
val(1)... val(n)
.
.
.
I was wondering if there was a tool (preferably in Java) that could extract learn/understand these data structures, parse the file and convert to a Map or object which I could run validation checks on?
I am aware of Antlr but understand this is more geared towards tree construction, an not independent bits of data (am I wrong about this?)
Does anyone have any suggestions for the problem as a whole?
I recommend Talend. It is very versatile, open source data integration tool. It is based on java. You can use build in tools/components to extract data from unstructured data sources. You can also write complex custom java code to do what you want.
I used Talend in couple of scientific proof of concept projects of mine. It worked for me. Good part is, it is free!