Thursday, November 1, 2012

Pentaho solution for Sqoop call with dynamic partitions

Pentaho big data release doesnt have a step for SQOOP as of this writing..
Simple solution is to use a "Shell script" stage to call sqoop..

But our requirement has a twist .. and most of sqoop users might have as well..
- capture incremental data for certain tables and keep it in a new hive partition
- Run the sqoop extract for a window of dates where each day's data goes to a partition (DAILY , HOURLY, MONTHLY etc)

Design


The solution is to use a DB for configuration and pentaho to frame sqoop calls

JOB


We are calling the pensqoop transformation to frame the list of sqoop calls to run
shell stage actually runs the sqoop script.. here check the "execute for each input row " so that this script will be called for each sqoop call framed in previous step.

TRANSFORMATION



1. get table configurations (like db connection, incremental column to use for incremental data capture etc)
2. switch the flow for DAILY partitions or FULL refresh
3. frame sqoop call using string manipulation at javascript stage
4. "copy rows to result" to send the output to calling job

No comments:

Post a Comment