Surprising Sqoop-to-Hive Gotchas

In this blog post, I describe a few surprising gotchas related to the import of a MySQL table into Hive using Sqoop 1.4.5 (the most recent version supported by vendors like Hortonworks or Cloudera at the time of writing this post).

Real-world scenario

In my simple (yet real-world) use-case, I have a MySQL table and I want to copy it as an external table in Hive. The data in Hive should be stored in a columnar format such as Parquet or ORC.

Although Sqoop was created many years ago and it seems to be a very mature tool, you’ll see that, even in this simple use-case, it still lacks useful features here and there.

Two possible ways of importing

Basically, there are two ways to import a MySQL table into Hive using Sqoop:

One-shoot command import – just adding the --hive-import option to your Sqoop command
Two-hop import – first dumping the content of the table to files in HDFS using a Sqoop command, and then creating Hive table (or partition) on top of it using a Hive command.

Both of them have own pros, cons and surprising limitations.

Gotchas related to the “one-shoot command” import

Because of simplicity, this seems to be the first choice to go. Let’s review it.

1. Non-textual file formats are not supported in Sqoop to Hive import (in 1.4.5)

There is no support for Avro (see SQOOP-324), ORC or Parquet today. The support for Parquet was added to Sqoop 1.4.6 (see SQOOP-1366), but Sqoop 1.4.6 is not shipped with HDP 2.2 or CDH 5.3.

So if you want to use a one-shoot import from MySQL to Hive today, you must use text format. Text is not only more time consuming to parse and more storage heavy file format, but it’s also vulnerable to many issues caused by dirty data. For example, the occurrence of the new line character (\n) or field delimiter (e.g. \u001 or ,) in any individual column of your imported rows might cause problems later when reading them in Hive. Fortunately, you can overcome this problem by using the --hive-delims-replacement option to replace these delimiters with your own string when importing table, or even drop them by using the --hive-drop-import-delims option. Although it helps you avoid parsing errors, the dataset imported to Hive will slightly differ from the original one in MySQL. Changing the delimiters also impacts performance, because Sqoop will have to parse the mysqldump‘s output into fields and transcode them into the user-specified delimiter instead of just copying the data directly from mysqldump‘s output into HDFS.

2. External tables/partitions are not supported

See SQOOP-816 to learn more.

3. Partitions are created under the default location

When loading data into some partition, Sqoop will create a subdirectory for a partition inside the directory where the table is located. Unfortunately, you can’t override it, because the --target-dir and --warehouse-dir options are not respected. See SQOOP-1293 to learn more.

4. Existing Hive table can’t be truncated before the import

The --delete-target-dir option is not respected. This means that if you run your import more than once, you risk duplication of the dataset.

See SQOOP-1293 to learn more.

In other words…

The one-shot command doesn’t satisfy the requirement stated in this blog post. Hari Sekhon already mentioned this in the last comment of SQOOP-1293 saying that a following feature is needed:

Import a table from database into Hive as external table (or partition) placing data in a given path in HDFS (in columnar format), deleting the directory if it exists to avoid cumulative data build up (i.e a total table refresh operation from source).

Cons of the “two-hop” import

Although some multi-step workarounds can satisfy our requirements, they have their own cons…

1. Inelegant because it forces a multi-step procedure

It’s no longer a just a single command… If you have multiple commands, then something can fail in the middle and the cleanup/recovery procedure becomes a bit more complex.

2. You must write code to create a Hive table and partitions

… and probably integrate it with your scheduling tool e.g. Oozie (and/or Falcon), Azkaban, Luigi.

3. Requires an additional job to change the file format

Sqoop 1.4.5 allows you to dump data to Avro or SequenceFile format (using the --as-avrodatafile option, for example), but you still need to implement and run an identity map-only MapReduce job (e.g. in Pig or Hive) to convert your dataset to a columnar format such as Parquet or ORC.

If you decide to dump data as Avro files (e.g. to make parsing of your records easier), you won’t be able to … leverage the direct (fast) import. It turns out that MySQL direct import currently supports only text output format. Therefore, an option like --as-avrodatafile isn’t supported with the --direct option when importing a MySQL table. This makes the --direct option much less exciting for the MySQL users.

Sqoop, Hive, HCatalog and ORC

Updated on June 25th, 2015

As Venkat Ranganathan wrote in the comment, the import from RDBMS to Hive in ORC format via HCatalog is supported. I tested it and it works fine:

$ hive -e "CREATE TABLE concert (id int, artist string, day date, location string) STORED AS ORCFILE;"

When running Sqoop import to Hive, the main difference is that we use the --hcatalog-database and --hcatalog-table options instead of the --hive-table option as described in Sqoop-HCatalog Integration.

$ sqoop import \
--connect jdbc:mysql://$(hostname)/streamrock \
--username kawaa -P \
--table concert \
--hcatalog-database streamrock \
--hcatalog-table concert

The integration of Sqoop and HCatalog doesn’t work with Parquet, though (due to HIVE-7502).

Summary

Although the “two-hop” import is less elegant, it gives us more flexibility. Therefore, it’s a better fit for the use-case that I described in the beginning of this blog post.

Surprising Sqoop-to-Hive Gotchas

Real-world scenario

Two possible ways of importing

Gotchas related to the “one-shoot command” import

1. Non-textual file formats are not supported in Sqoop to Hive import (in 1.4.5)

2. External tables/partitions are not supported

3. Partitions are created under the default location

4. Existing Hive table can’t be truncated before the import

In other words…

Cons of the “two-hop” import

1. Inelegant because it forces a multi-step procedure

2. You must write code to create a Hive table and partitions

3. Requires an additional job to change the file format

Sqoop, Hive, HCatalog and ORC

Summary

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List