I was working on a project last week to load data into a HDFS-based Hive database. This was essentially a periodic load so Sqoop appeared to be the best tool for the job. My small project consisted of the following goals:
– Connect Sqoop to SQL Server and/or Oracle instances
– Export a set of tables into HDFS files
– Load the data files into Hive tables
Sqoop was a new tool to me, so I started with the highest version which was 1.99.3. I was thinking that it’s almost always better to start with the new functionality in case you might need it. I struggled through the limited documentation but was eventually able to get Sqoop connected to both Oracle and SQL Server using the Command Line Interface available in Sqoop 2. The most challenging part of this exercise was working on the connection string and TCP/IP isues, but that’s a topic for another time.
I was able to export the tables into HDFS relatively easily, and I began looking for the run-time option that would allow me to automatically create the Hive table. I couldn’t figure out how to do it right away, but I was able to run a LOAD DATA operation in HIVE to load the data files into HIVE. This was an ok solution, but the problem is that I expected Sqoop to do this automatically. I needed to transfer about 500 tables, so loading them all manually was going to be a real pain.
After researching the issue further I discovered that the 1.99 version of Sqoop does not support the automatic creation of Hive tables that is available in 1.4.4 yet. Doh! This is a key requirement for my project with so many tables, so it turns out that choosing 1.99 was not the best decision. Once I knew that was the case, I began researching how to do this task in version 1.4.5 instead. In this version there is a simple –create-hive-table option that accomplishes my goal easily and seamlessly. Luckily for me most of the work I had already done on 1.99 translated fairly well back to 1.99. This allowed me to complete the project relatively quickly after I decided to roll back to an earlier version.
The moral of this story is that in the Wild Wild West of big data, newer is not always better. It pays to put the work in up front to be sure the version you are selecting meets you needs. In the open source world often the old version is older but “old reliable” – more reliable and has more features.