A Brief Summary Of Everything: March 2016

Thursday 17 March 2016

Orchestrating Hadoop Workflows - What your workflow tool should look like.

This post illustrates what are the most important qualities to look for while deciding a Big data workflow tool.

Schedule the ETL process to ingest incremental data from various streams.

Ability to schedule is the most elementary requirement of any orchestration tool, and the tool

must be abel to ingest and process data simultaneously form various sources to various endpoints. It should support most common sources and endpoints out of the box.
Notifications on failure of ETL process.

Notifications and Alerts are a must, a failure can setback you to backfill and can drastically increase the infrastructure costs (especially if u pay by the hour for the cluster). So a good reliable notification mechanism is essential and a must have.
Optimization on resources usage for daily ingestion.

Not directly supported by the workflow tool, but the tool should persist the execution time and trends for a task, which can be used to optimize your tasks. This data becomes important when you have run the workflow sufficiently and can understand the time trends.
Retry on failures, make ur tasks as Idempotent incremental loads

The ownership of making the workflow idempotent is on the developer and not the tool. The tool should support retires.
Zero/Minimal Manual intervention
Self recovering/ rescheduling system
Pause the workflow

The tool should support you , when needed to pause the execution after the current running task. This enables you to fix any issue with the next task in your workflow without having to fail the workflow.
Automatic Data audits, for incremental loaded data.

Should support audits on tasks and help make decision in the workflow.

Friday 4 March 2016

Submit Hive queries to different Queues

Hive Execution Engine is MapReduce

set mapreduce.job.queuename=<queuename>;

Hive Execution Engine is Tez

set tez.queue.name=<queuename>;

"Failed with exception java.io.IOException:java.lang.RuntimeException: serious problem"

If the hive cluster is a Kerberos secured cluster, disable Hive optimizations.

set hive.fetch.task.conversion=none;

A Brief Summary Of Everything

Thursday 17 March 2016

Orchestrating Hadoop Workflows - What your workflow tool should look like.

Friday 4 March 2016

Submit Hive queries to different Queues

Hive Default Field Delimiter

Hive RuntimeException

Total Pageviews

Followers

Blog Archive

About Me

T1