How to solve the issues in the Oozie-Sqoop-Hive job?

How to solve the issues in the Oozie-Sqoop-Hive job?

There is something about Big Data that is making every big data consulting company experts crazy. In this post, the experts are going to explain what is the issue in the Oozie-Sqoop-Hive job and how to fix it. Read this post further to know the solution.

Big Data – Why It’s Hot?

In this era everyone is going digital, whatever you do create a log or an electronic record – whether you purchase, go to a movie or comment on friends photo on Facebook.

Each day, 2.5 quintillion bytes of data are created. These data come from digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, cell phone GPS signals, to name a few. This is known as Big Data.

Every organization continues to gather hundreds of terabytes of that information, they are using various tools to mine and analyze it, to help businesses better understand their markets and customers, and even predict what’s next.

With new technologies comes bugs too and we will discuss one of such in this blog.

Technology:  

There are various tools which we use in Hadoop ecosystems for Data processing, importing data from external sources and job scheduling, to name a few they are Hive, Pig, HBase, Nifi, Sqoop, Flume, Oozie etc…

We encountered one issue while dealing with Oozie (job scheduler)-Sqoop(Data Ingestion engine)-Hive(Data processing engine) issue.

Use-case:

Below is the sample data in MySQL, which we will use to import to Hive table.

Oozie-Sqoop-Hive job

We will use basic Sqoop command which describes source table in MySQL and destination table in Hive with hive-import option.

We will run this command from CLI which will load the data into the hive_sqoop1 from sqoop_test1,

Oozie-Sqoop-Hive job

After the command ran successful, we have looked into the hive table (hive_sqoop1) and we got the desired result.

 Oozie-Sqoop-Hive job

Oozie-Sqoop-Hive job

But what if you want to dump data into Hive table in regular interval, then we need to schedule the above Sqoop command with the job scheduler – Oozie.

We have created Oozie Workflow.xml file as below where we have inserted Sqoop command.

One can refer box below the image if facing difficulty to see the content in the image.

Submit the Oozie job as mentioned in below command:

$ oozie job -oozie http://localhost:11000/oozie -config 
/home/cloudera/training/oozie/example1/job.properties -run

Below is the screen shot for the same 

Oozie-Sqoop-Hive job

Issue:

We thought that everything went well but then job got failed and we came to know that we cannot dump data from external sources into Hive table with Sqoop command when the whole process it scheduled with Oozie.

First we thought that we have missed some Hive jars to add into Oozie share/lib but after many attempts it specifies that it’s a bug in the whole process.

Resolution:

We can break the whole process into 2 steps

  1. Dump data into HDFS from external sources, here it’s MySQL.
  2. Create External Hive table above HDFS location or create managed hive table and load the HDFS data into it.

In the below hive script, we have created managed hive table named hive_sqoop2 and loading the data from the path ‘/training/oozie/example3/input-data/user/’ into the same.

Oozie-Sqoop-Hive job

Now we have inserted both the steps which we have discussed above into the Oozie – Workflow.xml file.

  1. Dump the data into HDFS from MySQL table with the help ofSqoop command.
cat > workflow.xml
<workflow-app name="Sqoop-Hive" xmlns="uri:oozie:workflow:0.5">
<start to="sqoop-5bb2"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="sqoop-5bb2">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect jdbc:mysql://localhost/mysql_db --username root --password cloudera --table sqoop_test1 --target-dir /training/oozie/example3/input-data/user/ -m 1</command>
</sqoop>
<ok to="hive2-03da"/>
<error to="Kill"/>
</action>
<action name="hive2-03da" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://localhost:10000/default</jdbc-url>
<script>/training/oozie/example3/hive_script3.hql</script>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
  • Load the data into Hive table from HDFS with the help of Hive script.

Submit the Oozie job as mentioned in below command:

$ oozie job -oozie http://localhost:11000/oozie -config 
/home/cloudera/training/oozie/example3/job.properties -run

We can see in Hue that the job ran successfully.

Oozie-Sqoop-Hive job

Verify the same in hive table named hive_sqoop1.

Oozie-Sqoop-Hive job

Conclusion:

This is a known issue in Oozie-Sqoop-Hive job and we have to undergo 2 step solution as discussed above whenever we wants to schedule the loading of external sources data in Hive.

Big data consulting company experts have just shared 2 step solutions for the issue in oozie-sqoop-hive job. If you have any query, ask in comment.

Author Bio: Aaron Jacobson is a Big data Application Developer at Technoligent.  Technoligent is a big data consultancy services providing company. Aaron has also knowledge of content writing and reading the feeds related technology.

Comments (1)

Leave a Reply

Your email address will not be published. Required fields are marked *

140 Shares
Share100
Tweet
+11
Share19
Pin19
Stumble1