How to capture output from Hive queries in Oozie is an essential question if you’re going to implement any ETL-like solution using Hive. Most commonly used approach is a shell-action, however it requires Hive CLI to be installed on each node, also it doesnt works for remote clusters. Here i wanted to share more generic approach using custom Java action.

The idea of the code is to write single row of Hive output to some variable. As per Oozie doc:

1 The < tt > capture - output < / tt > element can be used to propagate values back into Oozie context , which can then be accessed via EL - functions . This needs to be written out as a java properties format file . The filename is obtained via a System property specified by the constant < tt > oozie . action . output . properties < / tt >

To query Hive i will use Hive2 server and JDBC driver. This approach allows to run this code from any environment, even if you’re trying to connect to remote cluster. Off course atleast Hive JDBC driver should be in your classpath, usually is already included to Oozie shared libs.

The essential code to do this looks like:

Class.forName("org.apache.hive.jdbc.HiveDriver"); Connection con = DriverManager.getConnection("jdbc:hive2://hive2_server_url:10000/db_name","hive", "password"); Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery("select 1"); Properties props = new Properties(); if (res.next()) { props.setProperty(o.getOutputKeyName(), res.getString(1)); } else { props.setProperty(o.getOutputKeyName(), "NA"); } File file = new File( System.getProperty("oozie.action.output.properties")); OutputStream os = new FileOutputStream(file); props.store(os, ""); os.close(); 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Class . forName ( "org.apache.hive.jdbc.HiveDriver" ) ; Connection con = DriverManager . getConnection ( "jdbc:hive2://hive2_server_url:10000/db_name" , "hive" , "password" ) ; Statement stmt = con . createStatement ( ) ; ResultSet res = stmt . executeQuery ( "select 1" ) ; Properties props = new Properties ( ) ; if ( res . next ( ) ) { props . setProperty ( o . getOutputKeyName ( ) , res . getString ( 1 ) ) ; } else { props . setProperty ( o . getOutputKeyName ( ) , "NA" ) ; } File file = new File ( System . getProperty ( "oozie.action.output.properties" ) ) ; OutputStream os = new FileOutputStream ( file ) ; props . store ( os , "" ) ; os . close ( ) ;

You can find full version of code on github.

Oozie workflow configuration:

workflow.xml <action name="get_max_timestamp"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>org.mmartsen.oozie.actions.hive.GetValueFromQuery</main-class> <arg>${hiveServerUrl}</arg> <arg>${hiveUser}</arg> <arg>${hivePassword}</arg> <arg>select nvl(max(cast(timestamp_ms as bigint)),0) from default.table1</arg> <arg>max_timestamp</arg> <capture-output /> </java> <ok to="process_action"/> <error to="kill"/> </action> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 <action name = "get_max_timestamp" > <java> <job-tracker> ${jobTracker} </job-tracker> <name-node> ${nameNode} </name-node> <main-class> org.mmartsen.oozie.actions.hive.GetValueFromQuery </main-class> <arg> ${hiveServerUrl} </arg> <arg> ${hiveUser} </arg> <arg> ${hivePassword} </arg> <arg> select nvl(max(cast(timestamp_ms as bigint)),0) from default.table1 </arg> <arg> max_timestamp </arg> <capture-output /> </java> <ok to = "process_action" /> <error to = "kill" /> </action>

<action name="process_action"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>hive-site-conf.xml</job-xml> <script>process_data_since_max_timestamp.hql</script> <param>max_timestamp=${wf:actionData('get_max_timestamp')['max_timestamp']</param> </hive> <ok to="end"/> <error to="kill"/> </action> 1 2 3 4 5 6 7 8 9 10 11 <action name = "process_action" > <hive xmlns = "uri:oozie:hive-action:0.2" > <job-tracker> ${jobTracker} </job-tracker> <name-node> ${nameNode} </name-node> <job-xml> hive-site-conf.xml </job-xml> <script> process_data_since_max_timestamp . hql </script> <param> max_timestamp=${wf:actionData('get_max_timestamp')['max_timestamp'] </param> </hive> <ok to = "end" /> <error to = "kill" /> </action>

After execution it will put the result to “max_timestamp” variable, which can be used this way:Here it is. This small piece of code can be very helpful for ETL-like processing with Hive and Oozie.