We’re building tools to mine information from the web. We have several pieces, such as
- Crawl data from the web
- Extract information based on templates & business rules
- Parse results into database
- Apply normalization & filtering rules
- Etc, etc.
The problem is troubleshooting issues & having a good ‘high-level picture’ of what’s happening at each stage.
What techniques have helped you understand and manage complex processes?
- Use workflow tools like Windows Workflow foundation
- Encapsulate separate functions into command-line tools & use scripting tools to link them together
- Write a Domain-Specific Language (DSL) to specify what order things should happen at a higher level.
Just curious how you get a handle on a system with many interacting components. We’d like document/understand how the system works at a higher level than tracing through the source code.
The code says what happens at each stage. Using a DSL would be a boon, but possibly not if it comes at the cost of writing your own scripting-language and/or compiler.
Higher level documentation should not include details of what happens at each step; it should provide an overview of the steps and how they relate together.
Good tips:
I wouldn’t recommend building command-line tools unless you actually have a use for them. No need in maintaining tools you don’t use. (That’s not the same as saying it can’t be useful; but most of what you do sounds more like it belongs in a library rather than executing external processes).