This post illustrates what are the most important qualities to look for while deciding a Big data workflow tool.
- Schedule the ETL process to ingest incremental data from various streams.Ability to schedule is the most elementary requirement of any orchestration tool, and the toolmust be abel to ingest and process data simultaneously form various sources to various endpoints. It should support most common sources and endpoints out of the box.
- Notifications on failure of ETL process.Notifications and Alerts are a must, a failure can setback you to backfill and can drastically increase the infrastructure costs (especially if u pay by the hour for the cluster). So a good reliable notification mechanism is essential and a must have.
- Optimization on resources usage for daily ingestion.Not directly supported by the workflow tool, but the tool should persist the execution time and trends for a task, which can be used to optimize your tasks. This data becomes important when you have run the workflow sufficiently and can understand the time trends.
- Retry on failures, make ur tasks as Idempotent incremental loadsThe ownership of making the workflow idempotent is on the developer and not the tool. The tool should support retires.
- Zero/Minimal Manual intervention
- Self recovering/ rescheduling system
- Pause the workflowThe tool should support you , when needed to pause the execution after the current running task. This enables you to fix any issue with the next task in your workflow without having to fail the workflow.
- Automatic Data audits, for incremental loaded data.Should support audits on tasks and help make decision in the workflow.