Used to capture technical solutions to common issues

Prototype Review of Azure SQL Data Warehouse

Data Warehouse Consultants is constantly investing in our staff to learn new technologies and offerings.  The summary below was published by our analyst Mukesh Thangadurai regarding the use of the Azure SQL Data Warehouse, which is Microsoft’s current cloud offering focusing on the data warehousing space.  For more information on this prototype or similar work we can do for your company, please use our Contact Form.

Environment

  • Microsoft Azure SQL Data Warehouse
  • Microsoft SQL server 2016
  • Microsoft SQL Server Integration Services

Performance / Costs

  1. DWU 100 – costs 1.51 USD/Hour (Minimum)
  2. DWU 400 – costs 6.05 USD/Hour (Default)

Implementation 1: Transportation Data

  1. Data Set details
  • 20 Dimension Tables
  • 8 Fact Tables (200+ million records)
  1. The flat files for this implementation were obtained by connecting to a source SQL Server database and converting tabular data to flat files.
  2. These flat files are then loaded to the Azure blob using SSIS.
  3. Polybase scripts were written, which converts these flat files into relational tabular format.
  4. TSQL procedures were then written to complete the ETL part and converting this data into a format as required by the target tables
  5. These TSQL procedures can also be triggered using SSIS.
  6. We were also able to test scenarios for incremental loading, where there are specific cases of new files coming in and old files being archived out.
  7. The entire process including the above step was completed using SSIS.

Implementation 2: Consumer Data

  1. Data Set details
  • 8 Dimension Tables
  • 1 Fact Table (168 million records)
  1. The flat files for dimension and fact data are already in a structured, delimited format.
  2. Since no data cleansing or transformation was required, dimension data was uploaded straight to the data warehouse, using the function AzureDWUploadTask in SSIS.
  3. This function auto-creates a temporary external table and deletes them the actual tables are loaded.
  4. The data in fact table was uploaded in phases using regular data upload.
  5. To upload 168 million records, the time taken was 45 minutes and the procedure to move all the data to the fact table using the procedure took another 35 minutes.
  6. The entire data upload was taken care using SSIS.

 

Observations

Detailed benchmark data captured with various queries tested on this system with both DWU100 and DWU400

  1. Query performance for intensive queries improved by at least 40% when scaled up from DWU100 to DWU400.
  2. Scaling was effective and took less than 10 minutes to scale up or down.
  3. Azure SQL DWH has some limitations in DDL and doesn’t support primary keys.
  4. The compatibility with SSIS is very tight and the entire process can be orchestrated using SSIS.
  5. Though there are some limitations like lack of primary keys, unavailability of certain functions such as @@rowcount.  A separate limitations document has been created to capture these.  This still seems like a strong solution which is an upgrade to the conventional databases but not quite as complicated/drifting away from traditional databases.
  6. The data warehouse provides an option to pause the system when not in use, drastically minimizing operations costs.
  7. Azure SQL DWH enjoys great compatibility with SSIS and the entire implementation above was completed using SSIS.
  8. Polybase does not support DML and has dependency with PDW external table.  It also doesn’t skip header rows.
  9. In External Tables default values and constraints are not allowed
  10. In Regular tables default values in column definitions are limited to literals and constants only.
  11. Non deterministic expressions or functions like GETDATE() and CURRENT_TIMESTAMP are not supported
  12. There is significant caching, and re-running the queries back-to-back shows improved performance.
  13. The following T-SQL operations are limited or not supported.  This can require substantial TSQL code changes including Temp Tables or CTAS statements.
  • ANSI JOINS on UPDATEs
  • ANSI JOINs on DELETEs
  • ANSI JOINs on DELETEs
  • ANSI JOINs on DELETEs
  • WITH clauses / Common table expressions followed by INSERT, UPDATE, DELETE or MERGE
  1. Identity / sequence fields on tables are not supported in REPLICATED tables

Recommendations

Azure SQL DWH was a perfect solution for the above requirement and scaling helped maintaining the expenses by setting it at DWU100 for data upload and when querying dimensions and to scale up to DWU400 when running large queries.  This would be a perfect solution for organizations who want to test moving to cloud and still have a backup option for recovery scenarios.

Our Experience with Azure HD Insight

Data Warehouse Consultants is constantly investing in our staff to learn new technologies and offerings.  The summary below was published by our analyst Mukesh Thangadurai regarding the use of the Azure HD Insight product, which is Microsoft’s current Hadoop offering.  For more information on this prototype or similar work we can do for your company, please use our Contact Form.

Environment

  • Microsoft Azure HDInsight
  • Microsoft SQL Server Integration Services

Data Set details:

  • 8 Dimension Tables
  • 1 Fact Table (168 million records)

Pricing

Excluding data storage costs

2 master, 2 slave architecture: 4.01$/hour

2 master, 4 slave architecture: 6.59$/hour

Implementation Notes

  1. The flat files for dimension and fact data were already in a structured, delimited format.
  2. Azure HDInsight Cluster gives the advantage of accessing files straight from Azure blob, without requiring any actual loading to the servers in the cluster.  This saves a ton of time as loading Terabytes of data to the cluster would be a time-consuming task.
  3. The data for fact tables and dimension tables were already uploaded using SSIS
  4. To upload 168 million records to the blob, it took 45 minutes.
  5. The entire data upload was done using SSIS.
  6. Now, HDinsight Cluster can be created in two ways.
    1. Option 1: Create it from the azure portal using the UI.
    2. Option 2: Create it using SSIS using Create HDInsight cluster function.
  7. We tried both these methods and they took almost same time in creating the cluster.
  8. Microsoft sets the default Master nodes as 2 and gives the flexibility of choosing the number of slave nodes.  In our implementation, we trialed with 2 slave nodes and 4 slave nodes in separate implementations.
  9. Once the clusters were created, the DDL for creating the hive tables were run.  These DDLs, specify the location of the files in the Azure blob.
  10. We also had options of creating user defined partitions. User defined partitions can be established by marking each folder of data as a separate partition.
  11. We implemented the prototype using both these styles, however predominant testing happened with unpartitioned data.
  12. Once the dataset is created, we used HiveQL to query the data.
  13. We can either use SQL server or the cluster dashboard to query this data

Key Observations:

  1. Query performance significantly improved when upgraded from 2 slave nodes to 4 slave nodes.
  2. Unlike the conventional Hadoop architecture, Azure defaults the number of master nodes to 2 instead of 1.
  3. Azure does offer two options to store metadata.
    1. Option 1: In a separate database of choice
    2. Option 2: In the same location the cluster is being built.
  4. The second option is the default option and the drawback is that the metadata gets purged when the cluster gets deleted.
  5. Unlike Azure Data Warehouse, we do not have the option of pausing the cluster when not in use and we are charged for every hour the cluster exists.
  6. Hence, the best way to make use of the cluster is to have all the required data already uploaded to the blob, have the DDL scripts and queries for reports or analysis complete, then create the cluster, run the DDL and analysis scripts and delete the cluster.
  7. The entire task of uploading data to blob, creating cluster, running the required DDLs, deleting the cluster can be orchestrated using SSIS.
  8. There were some expected restrictions like no DML authorization and some unexpected limitations like inability to use subqueries for comparison in the where clause.

Recommendations

Azure HDInsight feels like a true big data solution more suited for high end analytics than for a traditional data warehouse. Theclusters cost a minimum of $4 for the 2 slave node and $6 for the more efficient 4 slave nodes.  This cost structure seems more suited for high end temporary data analytics, rather than a traditional frequently queried data warehouse.

Moving Your Data Warehouse to the Cloud

The Cloud offers many advantages and many organizations are looking to utilize Cloud solutions.  But how easy is it to move an existing Data Warehouse to Microsoft Azure or Amazon Web Services?  To gain insight and better advise our clients, Data Warehouse Consultants has begun two research projects to identify the benefits and pitfalls of a move to the Cloud.

One project migrates an existing operational data warehouse to Microsoft Azure utilizing Azure SQL Data Warehouse, SQL Server Integration Services (SSIS) and T-SQL in a cloud hosted server environment.  The other project migrates the same operational data warehouse to AWS Redshift using Python in the AWS Lambda serverless environment.

Through these two parallel development efforts, Data Warehouse Consultants will be better positioned to help clients evaluate the different Cloud based solutions and assist an organization as it migrates existing systems or implements new solutions in a Cloud based environment.

Excel External Content Blocked

We have been encountering a new issue with Microsoft Excel and one of our clients had it today.  It took a little while to figure out, so we want to help out others in this situation.

 

Issue – Excel External/Web Content is Blocked: In most of our interactive applications, we allow for exporting data (either to csv or to xls extensions). By default, these downloads will open in Excel.  A recent update did something to Excel to prevent these files from opening.  It happened to my PC at home (just figured I did something). But then today, it happened at one of the client sites. Everything looks OK with the file when you download it, but either the file doesn’t open, or excel opens and there is no file there.  There is no error message or anything – the file just doesn’t open.

 

Resolution: When the file downloads, instead of opening the file, open the folder location to view the file. Right click the file, and click properties.  At the bottom of the window, there is an “Unblock” button. If you click unblock you can open the file. This shows that the file content is getting blocked because it came from the internet (external).  Unblocking it allows for opening the file (but you’d have to do that every time you download/open a file).  To not have to do it every time, you have to change excel settings.

 

Steps:
1. Open excel -> options -> trust center. Click “Trust Center Settings”.
2. Options -> Trust Center -> Click “Trust Center Settings”
3. Go to Protected View: Uncheck “Enable Protected View for files originating from the internet”.

 

After doing some research, it seems we are not the only ones experiencing this issue.  Here is a link to news article about it:

 

8/26/2016 Update – It seems like Microsoft has now resolved the issue.

UNIX Quick Start Guides

Unix (all-caps UNIX for the trademark) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, developed in the 1970s at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and others.
From the power user’s or programmer’s perspective, Unix systems are characterized by a modular design that is sometimes called the “Unix philosophy”, meaning that the operating system provides a set of simple tools that each perform a limited, well-defined function, with a unified filesystem as the main means of communication and a shell scripting and command language to combine the tools to perform complex workflows. Aside from the modular design, Unix also distinguishes itself from its predecessors as the first portable operating system: almost the entire operating system is written in the C programming language that allowed Unix to reach numerous platforms. Read more.

 

There following are some resources to learn about UNIX:

Unix Operating System
Unix Commands Tutorial 1
— TutorialPoints Unix_Tutorial
— UNIX CheatSheets Unix_CheatSheet_1 Unix_CheatSheet_2

Getting Started with ASP.NET Web Forms and MS Visual Studio

The .NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primarily on Microsoft Windows. It includes a large class library known as Framework Class Library (FCL) and provides language interoperability (each language can use code written in other languages) across several programming languages. Programs written for .NET Framework execute in a software environment (as contrasted to hardware environment), known as Common Language Runtime (CLR), an application virtual machine that provides services such as security, memory management, and exception handling. FCL and CLR together constitute .NET Framework. Read more.

 

The following are some resources to learn about ASP.NET, Visual Studio and other Microsoft products:

W3Schools ASP.NET 4 Tutorial
Tutorial Points ASP.NET Tutorial
MS Introduction to ASP.NET Web Forms
— ASP.NET Hands On Project ASP.NET Project
Free Microsoft Training Delivered by Experts

 

Using Google Apps

Google Apps for Work (formerly Google Apps for Business) is a suite of cloud computing productivity and collaboration software tools and software offered on a subscription basis by Google.

It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar, and Google Docs. While these products are available to consumers free of charge, Google Apps for Work adds business-specific features such as custom email addresses at your domain (@yourcompany.com), at least 30 GB of storage for documents and email, and 24/7 phone and email support. As a cloud computing solution, it takes a different approach from off-the-shelf office productivity software by hosting customer information in Google’s network of secure data centers, rather than on traditional in-house servers that are located within companies. Read more.

 

The following are some resources to learn about Google Apps:

Using Google Apps at Work or School

Introduction to AWS (Amazon Web Services)

Amazon Web Services is a collection of remote computing services, also called web services, that make up a cloud computing platform offered by Amazon.com. These services are based out of 11 geographical regions across the world. To learn more about Cloud Computing please click here.

 

The following are some resources to learn about Amazon Web Services:

— Webinar: What Is AWS?

Introduction to Amazon Web Services

AWS in plain English

7 Steps to Select the Right Architecture for Your Web Application

AWS Instructional Videos and Labs

Development in Sqoop 1 vs Sqoop 2

I was working on a project last week to load data into a HDFS-based Hive database. This was essentially a periodic load so Sqoop appeared to be the best tool for the job. My small project consisted of the following goals:

– Connect Sqoop to SQL Server and/or Oracle instances
– Export a set of tables into HDFS files
– Load the data files into Hive tables

Sqoop was a new tool to me, so I started with the highest version which was 1.99.3. I was thinking that it’s almost always better to start with the new functionality in case you might need it. I struggled through the limited documentation but was eventually able to get Sqoop connected to both Oracle and SQL Server using the Command Line Interface available in Sqoop 2. The most challenging part of this exercise was working on the connection string and TCP/IP isues, but that’s a topic for another time.

I was able to export the tables into HDFS relatively easily, and I began looking for the run-time option that would allow me to automatically create the Hive table. I couldn’t figure out how to do it right away, but I was able to run a LOAD DATA operation in HIVE to load the data files into HIVE. This was an ok solution, but the problem is that I expected Sqoop to do this automatically. I needed to transfer about 500 tables, so loading them all manually was going to be a real pain.

After researching the issue further I discovered that the 1.99 version of Sqoop does not support the automatic creation of Hive tables that is available in 1.4.4 yet. Doh! This is a key requirement for my project with so many tables, so it turns out that choosing 1.99 was not the best decision. Once I knew that was the case, I began researching how to do this task in version 1.4.5 instead. In this version there is a simple –create-hive-table option that accomplishes my goal easily and seamlessly. Luckily for me most of the work I had already done on 1.99 translated fairly well back to 1.99. This allowed me to complete the project relatively quickly after I decided to roll back to an earlier version.

The moral of this story is that in the Wild Wild West of big data, newer is not always better. It pays to put the work in up front to be sure the version you are selecting meets you needs. In the open source world often the old version is older but “old reliable” – more reliable and has more features.

Header/Detail XML Query in Oracle

On a client project today we had the need to create a table from an XML data file sent in a hierarchical format. The XML file contains multiple orders. Each order is for a particular store which is provided inside a header tag. Each store can order multiple items which are provided via sub tags of each header. A simplified sample of the file looks like this:

[box type=”shadow”]

<orders>
<order_header>
        <id>{6315E9FF}</id>
        <account_number>12345</account_number>
        <account_name>JOES CONVENIENCE</account_name>
        <store_address>123 MAIN STREET</store_address>
        <store_address_city>Pittsburgh</store_address_city>
        <store_address_state>PA</store_address_state>
        <store_address_postcode>15235</store_address_postcode>
        <order_line>
            <product_code>123456</product_code>
            <product_externalid>123456</product_externalid>
            <product_name>Peg Counter Spinner</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        <order_line>
            <product_code>723678</product_code>
            <product_externalid>723678</product_externalid>
            <product_name>Skittles</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        </order_header>
        <order_header>
        <id>{3CB06C14}</id>
        <account_number>23456</account_number>
        <account_name>SMITH LIMITED</account_name>
        <store_address>132 WEST AVENUE</store_address>
        <store_address_city>Pittsburgh</store_address_city>
        <store_address_state>PA</store_address_state>
        <store_address_postcode>15213</store_address_postcode>
        <order_line>
            <product_code>123456</product_code>
            <product_externalid>123456</product_externalid>
            <product_name>Peg Counter Spinner</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        </order_header>
</orders>

[/box]

The challenge was to write an Oracle query that could easily return both the header and detail information on a single row.  After some quick research I found that this can be done by using two separate XMLTables that hit the same file.  The first XMLTable contains store-level information.  The second XMLTable contains item level information.  When used together the second table starts with the XMLType (order_lines) that is defined in the first XMLTable.  The outer query simply selects fields you need.  It’s essentially a Cartesian product but it only intersects where appropriate given the “passing” clause. Here is a sample of the query to read the data.  This query assumes the file name is orders.xml and it is sitting in the previously defined XMLDIR Oracle directory.

[box type=”shadow”]

SELECT stores.order_id
      ,stores.account_number
      ,stores.account_name
      ,stores.retail_number
      ,stores.store_name
      ,stores.store_address
      ,stores.store_state
      ,stores.store_zip
      ,items.*
FROM XMLTable('/orders/order_header'
	     passing xmltype(  bfilename('STRIPE_UAT','wrigley_orders.xml'), nls_charset_id('AL32UTF8') )
         columns  order_id varchar2(2000) path 'id'
                 ,account_number varchar2(50)  path 'account_number'
                 ,account_name   varchar2(50)  path 'account_name'
                 ,retail_number  varchar2(12)  path 'store_externalid'
                 ,store_name     varchar2(50)  path 'store'
                 ,store_address  varchar2(100) path 'store_address'
                 ,store_state    varchar2(5)   path 'store_address_state'
                 ,store_zip      varchar2(10)  path 'store_address_postcode'
                 ,order_lines    XMLTYPE       path 'order_line'
     ) stores,
     XMLTable('/order_line'
	     passing stores.order_lines
         columns  product_code   varchar2(100) path 'product_code'
                 ,product_name   varchar2(100) path 'product_name'
                 ,order_quantity varchar2(100) path 'order_quantity'
     ) items

[/box]

This methodology can easily be extended to query an XML structure n levels deep to return information in one result set.