Prototype Review of Azure SQL Data Warehouse
Data Warehouse Consultants is constantly investing in our staff to learn new technologies and offerings. The summary below was published by our analyst Mukesh Thangadurai regarding the use of the Azure SQL Data Warehouse, which is Microsoft’s current cloud offering focusing on the data warehousing space. For more information on this prototype or similar work we can do for your company, please use our Contact Form.
Environment
- Microsoft Azure SQL Data Warehouse
- Microsoft SQL server 2016
- Microsoft SQL Server Integration Services
Performance / Costs
- DWU 100 – costs 1.51 USD/Hour (Minimum)
- DWU 400 – costs 6.05 USD/Hour (Default)
Implementation 1: Transportation Data
- Data Set details
- 20 Dimension Tables
- 8 Fact Tables (200+ million records)
- The flat files for this implementation were obtained by connecting to a source SQL Server database and converting tabular data to flat files.
- These flat files are then loaded to the Azure blob using SSIS.
- Polybase scripts were written, which converts these flat files into relational tabular format.
- TSQL procedures were then written to complete the ETL part and converting this data into a format as required by the target tables
- These TSQL procedures can also be triggered using SSIS.
- We were also able to test scenarios for incremental loading, where there are specific cases of new files coming in and old files being archived out.
- The entire process including the above step was completed using SSIS.
Implementation 2: Consumer Data
- Data Set details
- 8 Dimension Tables
- 1 Fact Table (168 million records)
- The flat files for dimension and fact data are already in a structured, delimited format.
- Since no data cleansing or transformation was required, dimension data was uploaded straight to the data warehouse, using the function AzureDWUploadTask in SSIS.
- This function auto-creates a temporary external table and deletes them the actual tables are loaded.
- The data in fact table was uploaded in phases using regular data upload.
- To upload 168 million records, the time taken was 45 minutes and the procedure to move all the data to the fact table using the procedure took another 35 minutes.
- The entire data upload was taken care using SSIS.
Observations
Detailed benchmark data captured with various queries tested on this system with both DWU100 and DWU400
- Query performance for intensive queries improved by at least 40% when scaled up from DWU100 to DWU400.
- Scaling was effective and took less than 10 minutes to scale up or down.
- Azure SQL DWH has some limitations in DDL and doesn’t support primary keys.
- The compatibility with SSIS is very tight and the entire process can be orchestrated using SSIS.
- Though there are some limitations like lack of primary keys, unavailability of certain functions such as @@rowcount. A separate limitations document has been created to capture these. This still seems like a strong solution which is an upgrade to the conventional databases but not quite as complicated/drifting away from traditional databases.
- The data warehouse provides an option to pause the system when not in use, drastically minimizing operations costs.
- Azure SQL DWH enjoys great compatibility with SSIS and the entire implementation above was completed using SSIS.
- Polybase does not support DML and has dependency with PDW external table. It also doesn’t skip header rows.
- In External Tables default values and constraints are not allowed
- In Regular tables default values in column definitions are limited to literals and constants only.
- Non deterministic expressions or functions like GETDATE() and CURRENT_TIMESTAMP are not supported
- There is significant caching, and re-running the queries back-to-back shows improved performance.
- The following T-SQL operations are limited or not supported. This can require substantial TSQL code changes including Temp Tables or CTAS statements.
- ANSI JOINS on UPDATEs
- ANSI JOINs on DELETEs
- ANSI JOINs on DELETEs
- ANSI JOINs on DELETEs
- WITH clauses / Common table expressions followed by INSERT, UPDATE, DELETE or MERGE
- Identity / sequence fields on tables are not supported in REPLICATED tables
Recommendations
Azure SQL DWH was a perfect solution for the above requirement and scaling helped maintaining the expenses by setting it at DWU100 for data upload and when querying dimensions and to scale up to DWU400 when running large queries. This would be a perfect solution for organizations who want to test moving to cloud and still have a backup option for recovery scenarios.