Used to capture technical solutions to common issues

Development in Sqoop 1 vs Sqoop 2

I was working on a project last week to load data into a HDFS-based Hive database. This was essentially a periodic load so Sqoop appeared to be the best tool for the job. My small project consisted of the following goals:

– Connect Sqoop to SQL Server and/or Oracle instances
– Export a set of tables into HDFS files
– Load the data files into Hive tables

Sqoop was a new tool to me, so I started with the highest version which was 1.99.3. I was thinking that it’s almost always better to start with the new functionality in case you might need it. I struggled through the limited documentation but was eventually able to get Sqoop connected to both Oracle and SQL Server using the Command Line Interface available in Sqoop 2. The most challenging part of this exercise was working on the connection string and TCP/IP isues, but that’s a topic for another time.

I was able to export the tables into HDFS relatively easily, and I began looking for the run-time option that would allow me to automatically create the Hive table. I couldn’t figure out how to do it right away, but I was able to run a LOAD DATA operation in HIVE to load the data files into HIVE. This was an ok solution, but the problem is that I expected Sqoop to do this automatically. I needed to transfer about 500 tables, so loading them all manually was going to be a real pain.

After researching the issue further I discovered that the 1.99 version of Sqoop does not support the automatic creation of Hive tables that is available in 1.4.4 yet. Doh! This is a key requirement for my project with so many tables, so it turns out that choosing 1.99 was not the best decision. Once I knew that was the case, I began researching how to do this task in version 1.4.5 instead. In this version there is a simple –create-hive-table option that accomplishes my goal easily and seamlessly. Luckily for me most of the work I had already done on 1.99 translated fairly well back to 1.99. This allowed me to complete the project relatively quickly after I decided to roll back to an earlier version.

The moral of this story is that in the Wild Wild West of big data, newer is not always better. It pays to put the work in up front to be sure the version you are selecting meets you needs. In the open source world often the old version is older but “old reliable” – more reliable and has more features.

Header/Detail XML Query in Oracle

On a client project today we had the need to create a table from an XML data file sent in a hierarchical format. The XML file contains multiple orders. Each order is for a particular store which is provided inside a header tag. Each store can order multiple items which are provided via sub tags of each header. A simplified sample of the file looks like this:

[box type=”shadow”]

<orders>
<order_header>
        <id>{6315E9FF}</id>
        <account_number>12345</account_number>
        <account_name>JOES CONVENIENCE</account_name>
        <store_address>123 MAIN STREET</store_address>
        <store_address_city>Pittsburgh</store_address_city>
        <store_address_state>PA</store_address_state>
        <store_address_postcode>15235</store_address_postcode>
        <order_line>
            <product_code>123456</product_code>
            <product_externalid>123456</product_externalid>
            <product_name>Peg Counter Spinner</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        <order_line>
            <product_code>723678</product_code>
            <product_externalid>723678</product_externalid>
            <product_name>Skittles</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        </order_header>
        <order_header>
        <id>{3CB06C14}</id>
        <account_number>23456</account_number>
        <account_name>SMITH LIMITED</account_name>
        <store_address>132 WEST AVENUE</store_address>
        <store_address_city>Pittsburgh</store_address_city>
        <store_address_state>PA</store_address_state>
        <store_address_postcode>15213</store_address_postcode>
        <order_line>
            <product_code>123456</product_code>
            <product_externalid>123456</product_externalid>
            <product_name>Peg Counter Spinner</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        </order_header>
</orders>

[/box]

The challenge was to write an Oracle query that could easily return both the header and detail information on a single row.  After some quick research I found that this can be done by using two separate XMLTables that hit the same file.  The first XMLTable contains store-level information.  The second XMLTable contains item level information.  When used together the second table starts with the XMLType (order_lines) that is defined in the first XMLTable.  The outer query simply selects fields you need.  It’s essentially a Cartesian product but it only intersects where appropriate given the “passing” clause. Here is a sample of the query to read the data.  This query assumes the file name is orders.xml and it is sitting in the previously defined XMLDIR Oracle directory.

[box type=”shadow”]

SELECT stores.order_id
      ,stores.account_number
      ,stores.account_name
      ,stores.retail_number
      ,stores.store_name
      ,stores.store_address
      ,stores.store_state
      ,stores.store_zip
      ,items.*
FROM XMLTable('/orders/order_header'
	     passing xmltype(  bfilename('STRIPE_UAT','wrigley_orders.xml'), nls_charset_id('AL32UTF8') )
         columns  order_id varchar2(2000) path 'id'
                 ,account_number varchar2(50)  path 'account_number'
                 ,account_name   varchar2(50)  path 'account_name'
                 ,retail_number  varchar2(12)  path 'store_externalid'
                 ,store_name     varchar2(50)  path 'store'
                 ,store_address  varchar2(100) path 'store_address'
                 ,store_state    varchar2(5)   path 'store_address_state'
                 ,store_zip      varchar2(10)  path 'store_address_postcode'
                 ,order_lines    XMLTYPE       path 'order_line'
     ) stores,
     XMLTable('/order_line'
	     passing stores.order_lines
         columns  product_code   varchar2(100) path 'product_code'
                 ,product_name   varchar2(100) path 'product_name'
                 ,order_quantity varchar2(100) path 'order_quantity'
     ) items

[/box]

This methodology can easily be extended to query an XML structure n levels deep to return information in one result set.

ASP.NET Issue: The SqlDataSource control does not have a naming container…

I have received the following .NET error several times in the past when developing nested databound controls. I have found the solution does not have good coverage on the Internet. The curious thing about this error is that it does not appear until 2nd or 3rd postback on the page.

[box]The SqlDataSource control does not have a naming container. Ensure that the control is added to the page before calling DataBind[/box]

The cause is simple, however. It occurs when you use a ControlParameter on a DataSource that is within another data control. I ran into it again today on a Gridview within a DataList for example. The error occurs because the inner control cannot find the control containing the parameter for some reason.

The solution is simple as well. You must remove the ControlParameter from the DataSource that is inside the other control. This can be accomplished by using a normal <asp:Parameter> and assigning a DefaultValue to it when the outer control binds.