Our Experience with Azure HD Insight
Data Warehouse Consultants is constantly investing in our staff to learn new technologies and offerings. The summary below was published by our analyst Mukesh Thangadurai regarding the use of the Azure HD Insight product, which is Microsoft’s current Hadoop offering. For more information on this prototype or similar work we can do for your company, please use our Contact Form.
Environment
- Microsoft Azure HDInsight
- Microsoft SQL Server Integration Services
Data Set details:
- 8 Dimension Tables
- 1 Fact Table (168 million records)
Pricing
Excluding data storage costs
2 master, 2 slave architecture: 4.01$/hour
2 master, 4 slave architecture: 6.59$/hour
Implementation Notes
- The flat files for dimension and fact data were already in a structured, delimited format.
- Azure HDInsight Cluster gives the advantage of accessing files straight from Azure blob, without requiring any actual loading to the servers in the cluster. This saves a ton of time as loading Terabytes of data to the cluster would be a time-consuming task.
- The data for fact tables and dimension tables were already uploaded using SSIS
- To upload 168 million records to the blob, it took 45 minutes.
- The entire data upload was done using SSIS.
- Now, HDinsight Cluster can be created in two ways.
- Option 1: Create it from the azure portal using the UI.
- Option 2: Create it using SSIS using Create HDInsight cluster function.
- We tried both these methods and they took almost same time in creating the cluster.
- Microsoft sets the default Master nodes as 2 and gives the flexibility of choosing the number of slave nodes. In our implementation, we trialed with 2 slave nodes and 4 slave nodes in separate implementations.
- Once the clusters were created, the DDL for creating the hive tables were run. These DDLs, specify the location of the files in the Azure blob.
- We also had options of creating user defined partitions. User defined partitions can be established by marking each folder of data as a separate partition.
- We implemented the prototype using both these styles, however predominant testing happened with unpartitioned data.
- Once the dataset is created, we used HiveQL to query the data.
- We can either use SQL server or the cluster dashboard to query this data
Key Observations:
- Query performance significantly improved when upgraded from 2 slave nodes to 4 slave nodes.
- Unlike the conventional Hadoop architecture, Azure defaults the number of master nodes to 2 instead of 1.
- Azure does offer two options to store metadata.
- Option 1: In a separate database of choice
- Option 2: In the same location the cluster is being built.
- The second option is the default option and the drawback is that the metadata gets purged when the cluster gets deleted.
- Unlike Azure Data Warehouse, we do not have the option of pausing the cluster when not in use and we are charged for every hour the cluster exists.
- Hence, the best way to make use of the cluster is to have all the required data already uploaded to the blob, have the DDL scripts and queries for reports or analysis complete, then create the cluster, run the DDL and analysis scripts and delete the cluster.
- The entire task of uploading data to blob, creating cluster, running the required DDLs, deleting the cluster can be orchestrated using SSIS.
- There were some expected restrictions like no DML authorization and some unexpected limitations like inability to use subqueries for comparison in the where clause.
Recommendations
Azure HDInsight feels like a true big data solution more suited for high end analytics than for a traditional data warehouse. Theclusters cost a minimum of $4 for the 2 slave node and $6 for the more efficient 4 slave nodes. This cost structure seems more suited for high end temporary data analytics, rather than a traditional frequently queried data warehouse.