Thursday 17 March 2016

Orchestrating Hadoop Workflows - What your workflow tool should look like.

This post illustrates what are the most important qualities to look for while deciding a Big data workflow tool.


  1. Schedule the ETL process to ingest incremental data from various streams.

    Ability to schedule is the most elementary requirement of any orchestration tool, and the tool
    must be abel to ingest and process data simultaneously form various sources to various endpoints. It should support most common sources and endpoints out of the box.

  2. Notifications on failure of ETL process.

    Notifications and Alerts are a must, a failure can setback you to backfill and can drastically increase the infrastructure costs (especially if u pay by the hour for the cluster). So a good reliable notification mechanism is essential and a must have.

  3. Optimization on resources usage for daily ingestion.

    Not directly supported by the workflow tool, but the tool should persist the execution time and trends for a task, which can be used to optimize your tasks. This data becomes important when you have run the workflow sufficiently and can understand the time trends.

  4. Retry on failures, make ur tasks as Idempotent incremental loads

    The ownership of making the workflow idempotent is on the developer and not the tool. The tool should support retires.

  5. Zero/Minimal Manual intervention

  6. Self recovering/ rescheduling system

  7. Pause the workflow

    The tool should support you , when needed to pause the execution after the current running task. This enables you to fix any issue with the next task in your workflow without having to fail the workflow.

  8. Automatic Data audits, for incremental loaded data.

    Should support audits on tasks and help make decision in the workflow.



Friday 4 March 2016

Submit Hive queries to different Queues

Hive Execution Engine is MapReduce 
set mapreduce.job.queuename=<queuename>; 

Hive Execution Engine is Tez
set tez.queue.name=<queuename>; 

Hive Default Field Delimiter

\U0001  ( CTRL+SHIFT+A in windows) 

Hive RuntimeException

"Failed with exception java.io.IOException:java.lang.RuntimeException: serious problem"

If the hive cluster is a Kerberos secured cluster, disable Hive optimizations.

set hive.fetch.task.conversion=none;