I’ve been asked to make an ETL-style application that transfers information from one data source to another. At the moment, I’ve decided to use a three-layer architecture but I would like to find out more about the best practices as well as the life cycle described on this wikipedia page:
http://en.wikipedia.org/wiki/Extract,_transform,_load
Four-layered approach for ETL architecture design
- Functional layer: Core functional ETL processing (extract, transform, and load).
- Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
- Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
- Utility layer: Common components supporting all other layers.
Real-life ETL cycle
The typical real-life ETL cycle consists of the following execution steps:
- Cycle initiation
- Build reference data
- Extract (from sources)
- Validate
- Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
- Stage (load into staging tables, if used)
- Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
- Publish (to target tables)
- Archive
- Clean up
I don’t know what your situation is or what your requirements are, but you’re likely over thinking the problem.
The name alone is “the” architecture:
Exporting a DB table to a CSV can be considered “ET” while loading the CSV is the “L”. Most ETL problems are simply not complicated.
Beyond that, you should grab any of the 1 or 2 million ETL and ESB packages already available in Java, free and commercial, libraries and full boat processing systems, and simply adopt one of them that you like best.
Get a white board, string some bubbles together with lines and turn that in to code.