Data Warehouse Consultants to Host 2nd Annual Memorial Golf Outing

PITTSBURGH (6/26/2023) – Data Warehouse Consultants (DWC) is hosting the second annual Bill Randall Memorial Golf Outing August 28 at Shannopin Country Club. 

Our friend, Bill, was taken too soon on April 5, 2021, after a long battle with Huntington’s Disease. Since being diagnosed at birth, he led a normal life until his college years. He always loved playing sports and spending time with family. Now, his family is teaming up with DWC to raise awareness of this disease and to honor Bill’s memory.  

Registration starts at 7:30 a.m. and a continental breakfast begins at 8 a.m. A 9:30 a.m. shotgun start will kick off the 15 holes in scramble format. There will be prizes for various contests during the outing. Afterwards, there will be a cookout-style lunch at the club.

Registration closes July 31 or when spots fill up.

All the proceeds from the golf outing will benefit the Bill Randall Memorial Scholarship at Duquesne University and the Huntington’s Disease Society of America (HDSA).

Last year’s outing raised $22,000 and brought the donation total to $46,132.16 for the Bill Randall Memorial Scholarship. With the help and support of this year’s golfers and donors, we wish to reach our goal of a $100,000 donation. A donation of $7,600 was also made to the HDSA in 2022.

According to the HDSA, Huntington’s Disease is a genetic brain disease in which there is a mistake in one’s DNA instructions that builds our bodies and keeps them running. Over time, the damage caused to the brain leads to physical, mental and emotional deterioration. There is currently no cure.

Data Warehouse Consultants is the host and diamond sponsor of this event. There are four other sponsorship options available for anyone who is interested. Each sponsorship level and registration option is described in detail on the Bill Randall Memorial Golf Outing website.

Data Warehouse Consultants is a consulting firm based in Pittsburgh. Founded in 2004, the firm provides custom development services for web applications, corporate databases and data warehouse environments.

Bad Data Costs More Than You Think, Fortunately Solutions are Achievable

Do you know what duplicate, incorrect, missing, inconsistent, and otherwise ‘bad data’ is costing you? Most organizations don’t. The costs are surprisingly high. Fortunately, improvements are achievable.

Poor quality and inaccurate data (“bad data”) is a source of financial and brand losses for most organizations today. On average, organizations spend 3-5 times more than necessary due to bad data. Bad data results from outdated information, conflicting information, missing information, and data entry errors.

Data management best practices can significantly improve data quality. When developing data improvement tactics, consider accessibility and information security. To achieve the best results, follow proven program management to assess problems, define priorities, develop plans, implement, and evaluate & improve progress. Evaluate resource capabilities and experience when planning. Experts achieve the best results. Know your available resources, if needed, engage consultants to provide guidance, bandwidth, and expertise.

This whitepaper explores ‘bad data’ including:
• Costs and occurrence frequencies
• Causes & remediation strategies
• Data management best practices to consider
• Strategies to achieve ROI from data management projects

Data Warehouse Consultants to Host Memorial Golf Outing

PITTSBURGH (7/5/2022) – Data Warehouse Consultants (DWC) is hosting the Bill Randall Memorial Golf Outing August 29 at Shannopin Country Club. 

Our friend Bill was taken too soon on April 5, 2021, after a long battle with Huntington’s Disease.  He led a normal life until his college years, when the disease he had since birth was first diagnosed.  He always loved playing sports and spending time with family.  Now his family is teaming up with DWC to raise awareness of this disease and to honor Bill’s memory.  

All the proceeds from the golf outing will benefit the Bill Randall Memorial Scholarship at Duquesne University and the Huntington’s Disease Society of America (HDSA).

According to the HDSA, Huntington’s Disease is a genetic brain disease in which there is a mistake in one’s DNA instructions that builds our bodies and keeps them running. Over time, the damage caused to the brain leads to physical, mental and emotional deterioration. There is currently no cure.

The outing begins with registration and a continental breakfast begins at 7 a.m at Shannopin Country Club. An 8 a.m. shotgun start will kick off the 18 holes in scramble format.  There will be prizes for various contests during the outing, including an opportunity to win a new vehicle from Classic Chevrolet for a hole-in-one.  Afterwards there will be a cookout style lunch at the club.

Registration closes July 31 or when spots fill up.

Data Warehouse Consultants is the host and diamond sponsor of this event.  There are four other sponsorship options available for anyone who is interested. Each sponsorship level and registration option is described in detail on the Bill Randall Memorial Golf Outing website.

Data Warehouse Consultants is a consulting firm based in Pittsburgh. Founded in 2004, the firm provides custom development services for web applications, corporate databases and data warehouse environments.

County Selected for National Initiative Addressing Homelessness, Employment

The following press release was taken from Partner4Work.

PITTSBURGH – The Department of Human Services (DHS) and Partner4Work (P4W), the workforce development board for the Pittsburgh area, announced today that Allegheny County is one of seven jurisdictions selected to take part in the Pathways Forward Challenge. The Challenge, overseen by the Heartland Alliance’s National Initiatives on Poverty & Economic Opportunity, is aimed at more effectively and equitably connecting homeless and unstably housed job seekers to employment.

Allegheny County was selected based on its proposed plan to strengthen its already robust data warehouse with additional workforce data and provide analysis to more fully understand and serve the employment needs of those experiencing homelessness for the first time. DHS also plans to use this opportunity to analyze and address racial disparities within the homeless and workforce development services systems.

“We are aware that there’s an important connection between employment and housing needs, and that if you can provide stable employment for someone needing housing early enough, you can stave off potential chronic homelessness,” said Marc Cherna, DHS director. “We thank the Heartland Alliance for helping us find solutions in Allegheny County, and also allowing us to be a part of the national conversation.”

Partner4Work will collaborate with DHS in this initiative to help find ways to better meet the workforce development needs of those experiencing homelessness.

“Through this strategic partnership with Allegheny County DHS, human services and workforce development are better aligned to support some of the region’s most vulnerable citizens, breaking down barriers and opening pathways to opportunity and stability,” said Earl Buford, P4W chief executive officer.

As part of the Pathways Forward Challenge, Allegheny County will receive a $75,000 grant from the Heartland Alliance, supported by the Oak Foundation and Melville Charitable Trust, as well as technical assistance and support from a group of national peers.

“The Pathways Forward Challenge communities not only have the potential to move the needle on increasing employment for homeless job seekers but are also advancing new ideas that will add to our learning about how to create better, more equitable systems,” said Carrie Felton, project manager for Heartland Alliance’s National Center on Employment and Homelessness (NCEH).

 

# # #

Data Warehouse Consultants Releases Major Enhancements to Custom Analytics Tool

PITTSBURGH (April 25, 2019) — Data Warehouse Consultants equips Allegheny County Department of Human Services with an updated Outcomes Tool application, enabling advanced analytics across 25+ human services data sources as of February 1.

The Outcomes Tool is a web-based analytic application that allows authorized users to compile data for evaluation. The purpose is to compile aggregate results based on the specific outcomes of individual programs for Allegheny County Department of Human Services’ (DHS), a public health organization in Pittsburgh, programs and services.

Data Warehouse Consultants furnished the Outcomes Tool with an instinctive functionality and navigation making it easy to use and boosting the tool’s efficiency. This tool solves two primary problems that analysts often face.

When studying the effectiveness of a program, analysts must first identify a cohort of individuals meeting specific demographic and programmatic criteria. For example, for a prevention program the analyst may need to identify clients who have had a homeless episode and also recently been in jail. This tool enables them to define the criteria they would like for the cohort across a range of available options, and to generate a list of clients. This cohort is then saved for future analysis over time.

The analyst may want to evaluate the effect of a program to see what percentage of the group has gone back into jail within a designated amount of time.

The problem with doing this analysis is that these criteria must be evaluated independently for each individual because the release date and relevant time period are different for everyone. The first person may have been released from jail in January, and the next person in March. So determining whether the individual returned to jail within one year changes for each person. The Outcomes Tool performs this analysis individual-by-individual, then compiles the results into a common format the analyst can use for future evaluation.

Overcoming both of these hurdles simultaneously provides DHS analysts with a powerful tool that enables consistent program evaluation over time. The release of the recent enhancements to this tool have greatly increased the range of analyses that can be performed on the DHS data warehouse allowing DHS to enhance the service quality and care provided to Allegheny County residents.

 

About Allegheny County Department of Human Services: DHS is responsible for providing and administering publicly-funded human services to Allegheny County residents. DHS is dedicated to meeting these human services needs, most particularly for the county’s vulnerable populations, through an extensive range of information exchange, prevention, early intervention, case management, crisis intervention and after-care services.

About Data Warehouse Consultants:  Data Warehouse Consultants is a consulting firm based in Pittsburgh, PA.  Founded in 2004 the firm provides custom development services for web applications and corporate database and data warehouse environments.

 

Data Warehouse Consultants Introduce Unique QuickCount Tool

PITTSBURGH (April 4, 2019) — With the help of Data Warehouse Consultants, the Allegheny County Department of Human Services recently launched an updated QuickCount Tool application.

QuickCount is an online, easy-to-use interface that allows journalists, policy researchers, grant writers, nonprofit leaders and community agencies to obtain counts of participants involved in the programs and services provided by the Allegheny County Department of Human Services, a public health organization located in Pittsburgh.  

Data Warehouse Consultants led the effort to update the application with a new visually appealing layout and clearly defined color scheme linked to the “Allegheny County” brand. The features and functionalities conveniently guide users through the application for a pleasant user experience.Picture of an example graph from QuickCount

The tool is available to the public at https://quickcount.alleghenycounty.us.  Users have the option to see counts of cross-program involvement by geography and client demographics for many public programs in the County.  Information on using the tool can be found in the User’s Guide on the QuickCount Help Site.

About Allegheny County Department of Human Services: DHS is responsible for providing and administering publicly-funded human services to Allegheny County residents. DHS is dedicated to meeting these human services needs, most particularly for the county’s vulnerable populations, through an extensive range of information exchange, prevention, early intervention, case management, crisis intervention and after-care services.

About Data Warehouse Consultants:  Data Warehouse Consultants is a consulting firm based in Pittsburgh, PA.  Founded in 2004 the firm provides custom development services for web applications and corporate database and data warehouse environments.

Loading Flat Files into a Database (Method Overview)

Here we feature a few methods for loading files into an Oracle or SQL Server database. There are many methods for accomplishing this, but over the past few projects, and after implementing most of the methods discussed, I documented some of my experiences.   Here I focus on developing an application or process to load data from files, or on demand, when a more sophisticated tool like SSIS or Informatica is not already in use.

Oracle External  Tables:  When working with Oracle, I have often used external tables to assist with file loading. They are easy to setup, provide good error handling, and allow querying the data prior to loading it for validations. This has always been a favorite method for Oracle. However, on my last few projects this was not a great option.  Without access to the database server it requires coordination to setup the directory and to get access to the files in that directory on the server.  It also requires a secondary process to stage, rename, archive, and log the files on the file server.  Additionally, on a project using Amazon RDS, using external tables was simply not available.

SQL Server Bulk Methods (BULK Insert / OpenRowset): Similar to Oracle external tables, these methods provide a fast and easy way to get the data into the database.  Again, however, these require special permissions and/or access to the database server.  After going down this path on a recent project the method worked fairly well, but without access to the database server it was difficult to coordinate getting the files to the server. When using a file share on a remote server, it was difficult getting the permissions necessary on both the remote server (file share) and the database server.  It is possible, but with sensitive data it requires extra coordination and permissions, some of which are not desirable.  I also find it difficult to do detailed error handling and logging if the file does not arrive in the format that is expected.  This also requires a secondary process to stage, rename, archive, and log the files on the file server.

SSIS and other tools:  If SSIS (or other tools) are available; I find it to be a great method for loading the files quickly into the database.  Most of my experience with the tools is SSIS to SQL Server, but I know others have similar capabilities.  Typically if a client is already working with SSIS and/or Informatica, the client has processes in place for handling the files. If not, there is often a lot of cost and overhead to setup this process from scratch.  For our purposes let’s assume the developer is not already working with one of these tools at the client site.

I have had success with loading data using each method described thus far. Other than SSIS (or other third party tools), they all require a separate application or scripting to handle the actual files. An application that reads the file, does validations, does logging, moves/renames files, and archives the files when complete. This is sometime done through scripting, but often requires a more in depth approach for logging and auditing the actions. Additionally, all of the methods discussed thus far require additional coordination, permissions, and/or access. There are other BULK methods I have seen or used that are not mentioned. Most of the other methods that I have used were for a one time load of large data, an initial migration, a large data transfer, or import/exports.

C# Application: This leads me to an application-based method. My preference in the last couple of projects has been to use a C# application to handle the file moving/renaming/logging, file parsing, and data loading. This method is easily scheduled, configured, and migrated.  It typical requires little to no additional coordination, software, or permissions.  One, fairly simple and self-contained application, can handle the processing.  Additionally, the application is very similar whether running against Oracle or SQL Server (Cloud or on-premises).  The data loading aspects are also easy converted to web applications that require user uploading of data to the database.

Once an application template is designed for logging, and error handling, it is fairly simple to expand the application to handle numerous formats in the same manner.  This method has been efficient, easy to implement, and easy to use between various projects. Within the C# application I have used various options to load the data, and found that some are much more efficient than others.  Click Here to read more about the methods we have used within our C# applications.

Loading Flat Files into a Database with C#

This article features some methods that I have used to load data within a C# application.  The methods discussed are an Insert Loop (with bind parameters), the DataAdapter.Update() method, and a custom method of passing the table to a stored procedure. For this article, a simple website demo was setup that accepted a 6-column CSV file upload, and parsed it into a Data Table (int, string, string, string, string, string). Every X records, that Data Table was inserted into a cloud hosted remote database.

For the purpose of comparison, the SQL Server BULK INSERT method was also used (NOTE: The time for the BULK INSERT method below is AFTER moving the files to the database server). The SQL Server portion of this demo was done on an Amazon EC2 instance, so BULK INSERT was an option. It does take a small amount of time to get the file to the remote server, but this time is negligible for the purpose of the comparison. If access and permissions to the database server are not an issue, this certainly highlights that the bulk method(s) are typically much faster than the methods being discussed in this article.  If processing very large files; if processing them very often; if access to the database server is an option; if a repeatable process can be written to move the files to the database server; this simple demo illustrates that the bulk methods are much more efficient.  However, there are a lot of IF’s in that statement.

For fairly large files, repeatable processes, and simple C# solutions, this comparison also highlights that the custom table to stored procedure method is a pretty decent and cost-effective solution in many scenarios.

The timings from my demonstration, for each of the methods, are shown in the table below:

Method Record Counts Estimated Time
Insert Loop 20K Records – 5K Batches 9 minutes
DataAdapter.Update() 20K Records – 5K Batches 35 seconds
DataAdapter.Update() 100K Records – 5K Batches 3 minutes
Stored Procedure (Input Table) 20K Records – 5K Batches 7 seconds
Stored Procedure (Input Table) 100K Records – 5K Batches 35 seconds
Stored Procedure (Input Table) 500K Records – 5K Batches 3 Minutes
Stored Procedure (Input Table) 1 Million Records – 5K Batches 6 Minutes
BULK INSERT 20K Records < 1 Second
BULK INSERT 100K Records 2 Seconds
BULK INSERT 500K Records 8 Seconds
BULK INSERT 1 Million Records 15 Seconds

Insert Loop: A very common method that I have seen and used is simple insert loops with bind variables.  It provides easy control over the data, and easy error handling.  It is also very slow if the database is remote.  I have seen many developers (including myself) use this method in a local development environment, only to have the process grind to a halt in another environment where the database is remote or cloud based.  I have successfully used this method to load hundreds, and even thousands of rows.  However, this method does not scale well when dealing with tens of thousands, hundreds of thousands, or millions of rows to a remote database. For small files, and/or local database, this method is easy to implement and provides good control over the data.


string mySql = @"
INSERT INTO DemoFileUpload(RecID, Col1, Col2, Col3, Col4, Col5)
VALUES (@RecID, @Col1, @Col2, @Col3, @Col4, @Col5)
";
using (SqlConnection myConn = new SqlConnection())
using (SqlCommand myCmd = new SqlCommand())
{
myConn.ConnectionString = db.GetDBConn1();
myCmd.CommandText = mySql;
myCmd.CommandType = CommandType.Text;
myCmd.Connection = myConn;
myConn.Open();

// Add the command parameters
myCmd.Parameters.Add(“@RecID”, SqlDbType.Int);
myCmd.Parameters.Add(“@Col1”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col2”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col3”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col4”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col5”, SqlDbType.NVarChar, 32);

foreach (DataRow myRow in p_DT.Rows)
{
// Set the params
myCmd.Parameters[“@RecID”].Value = Convert.ToInt32(myRow[“RecID”]);
myCmd.Parameters[“@Col1”].Value = myRow[“Col1”].ToString();
myCmd.Parameters[“@Col2”].Value = myRow[“Col2”].ToString();
myCmd.Parameters[“@Col3”].Value = myRow[“Col3”].ToString();
myCmd.Parameters[“@Col4”].Value = myRow[“Col4”].ToString();
myCmd.Parameters[“@Col5”].Value = myRow[“Col5”].ToString();

// Execute the insert
myCmd.ExecuteNonQuery();
}

myConn.Close();
}

Data Adapter:  DataAdapter.Update() method is a very simple method to transfer data from the data table to the database.  However, on more complex scenarios or for larger datasets, I often spend a lot of time researching the various options to tune it correctly so that it works more efficiently. I also typically spend a lot of time reviewing the magic of this method, and prefer to have greater control.  While I have had success using this method, as the data grows larger, I find that I must customize the process, where the next methods work for larger and smaller sets.  I find the DataAdapter.Update() method greatly out-performs the insert loop, and greatly under-performs the next methods described.

using (SqlConnection myConn = new SqlConnection())
{
myConn.ConnectionString = db.GetDBConn1();
// Initialize with an empty set structure: All inserts
SqlDataAdapter myDA = new SqlDataAdapter(“SELECT * FROM DemoFileUpload WHERE 1=2”, myConn);
SqlCommandBuilder myCB = new SqlCommandBuilder(myDA);
myDA.MissingSchemaAction = MissingSchemaAction.Add;
myDA.UpdateBatchSize = myUpdateBatchSize;
myDA.Update(p_DT);
}

 

Datatable to Stored Procedure: The methods I have used on the last couple of projects pass larger amounts of data to the database, while giving control over how much data, and logging those actions.  I find these methods perform somewhat comparable to the bulk methods for reasonable amounts of data (albeit quite slower), but are easier to integrate into a self-contained application.  While these methods certainly aren’t as fast as most of the bulk methods I have used, they are fairly efficient, especially for the simplicity of setup and migration.  I also find them simple to understand, setup, implement, and replicate across multiple projects.

First, the input file is parsed, line by line into a data table. This is pretty typical of every method I use.  Every X rows [as set by configuration] the data table is sent to the database and processed as a set.

If working with SQL Server, I maintain a user-defined table type in the database with that same structure, and a stored procedure that takes in that table type, and insert/selects from that input table into the database. I then pass the data table from the C# application to the stored procedure to insert the data.

string mySql = “[dbo].[spDemoFileUploadInsert]”;
using (SqlConnection myConn = new SqlConnection())
using (SqlCommand myCmd = new SqlCommand())
{
myConn.ConnectionString = db.GetDBConn1();
myCmd.CommandText = mySql;
myCmd.CommandType = CommandType.StoredProcedure;
myCmd.Connection = myConn;
myConn.Open();

// Add the command parameters
myCmd.Parameters.Add(“@tblInput”, SqlDbType.Structured).Value = p_DT;
myCmd.ExecuteNonQuery();
myConn.Close();
}

 

If working with Oracle the process is similar.  Rather than taking in a table type parameter, I create a stored procedure that accepts a CLOB.  In the C# application, I convert the data table to an XML string, and pass that string as a CLOB into the stored procedure. In the stored procedure, I again do a simple insert/select from the input CLOB using Oracle XMLTYPE.

string myXMLRec;
using (StringWriter sw = new StringWriter())
{
p_DT.WriteXml(sw);
myXMLRec = sw.ToString();
}

myConn.ConnectionString = db.GetDBConn1();

myCmd.Connection = myConn;
myCmd.CommandText = “MyPackage.MyProcedure”;
myCmd.CommandType = CommandType.StoredProcedure;

myCmd.Parameters.Add(“p_XmlRec”, OracleDbType.Clob).Value = myXMLRec;

myConn.Open();
myCmd.ExecuteNonQuery();
myConn.Close();

 

While these methods are slower than the built-in bulk methods, I have found them to be simple, repeatable, and fairly efficient, while supporting the logging and control that is required in most of my projects. They are also easily configurable with respect to the batch size of records.  The batch can be easily increased or decreased from an application configuration based on the performance in a specific environment.  While these methods have the dependency of the stored procedure, I typically have stored procedures that do the subsequent processing of the staged data in the database. Therefore, these methods integrate nicely with the overall process.

Prototype Review of Azure SQL Data Warehouse

Data Warehouse Consultants is constantly investing in our staff to learn new technologies and offerings.  The summary below was published by our analyst Mukesh Thangadurai regarding the use of the Azure SQL Data Warehouse, which is Microsoft’s current cloud offering focusing on the data warehousing space.  For more information on this prototype or similar work we can do for your company, please use our Contact Form.

Environment

  • Microsoft Azure SQL Data Warehouse
  • Microsoft SQL server 2016
  • Microsoft SQL Server Integration Services

Performance / Costs

  1. DWU 100 – costs 1.51 USD/Hour (Minimum)
  2. DWU 400 – costs 6.05 USD/Hour (Default)

Implementation 1: Transportation Data

  1. Data Set details
  • 20 Dimension Tables
  • 8 Fact Tables (200+ million records)
  1. The flat files for this implementation were obtained by connecting to a source SQL Server database and converting tabular data to flat files.
  2. These flat files are then loaded to the Azure blob using SSIS.
  3. Polybase scripts were written, which converts these flat files into relational tabular format.
  4. TSQL procedures were then written to complete the ETL part and converting this data into a format as required by the target tables
  5. These TSQL procedures can also be triggered using SSIS.
  6. We were also able to test scenarios for incremental loading, where there are specific cases of new files coming in and old files being archived out.
  7. The entire process including the above step was completed using SSIS.

Implementation 2: Consumer Data

  1. Data Set details
  • 8 Dimension Tables
  • 1 Fact Table (168 million records)
  1. The flat files for dimension and fact data are already in a structured, delimited format.
  2. Since no data cleansing or transformation was required, dimension data was uploaded straight to the data warehouse, using the function AzureDWUploadTask in SSIS.
  3. This function auto-creates a temporary external table and deletes them the actual tables are loaded.
  4. The data in fact table was uploaded in phases using regular data upload.
  5. To upload 168 million records, the time taken was 45 minutes and the procedure to move all the data to the fact table using the procedure took another 35 minutes.
  6. The entire data upload was taken care using SSIS.

 

Observations

Detailed benchmark data captured with various queries tested on this system with both DWU100 and DWU400

  1. Query performance for intensive queries improved by at least 40% when scaled up from DWU100 to DWU400.
  2. Scaling was effective and took less than 10 minutes to scale up or down.
  3. Azure SQL DWH has some limitations in DDL and doesn’t support primary keys.
  4. The compatibility with SSIS is very tight and the entire process can be orchestrated using SSIS.
  5. Though there are some limitations like lack of primary keys, unavailability of certain functions such as @@rowcount.  A separate limitations document has been created to capture these.  This still seems like a strong solution which is an upgrade to the conventional databases but not quite as complicated/drifting away from traditional databases.
  6. The data warehouse provides an option to pause the system when not in use, drastically minimizing operations costs.
  7. Azure SQL DWH enjoys great compatibility with SSIS and the entire implementation above was completed using SSIS.
  8. Polybase does not support DML and has dependency with PDW external table.  It also doesn’t skip header rows.
  9. In External Tables default values and constraints are not allowed
  10. In Regular tables default values in column definitions are limited to literals and constants only.
  11. Non deterministic expressions or functions like GETDATE() and CURRENT_TIMESTAMP are not supported
  12. There is significant caching, and re-running the queries back-to-back shows improved performance.
  13. The following T-SQL operations are limited or not supported.  This can require substantial TSQL code changes including Temp Tables or CTAS statements.
  • ANSI JOINS on UPDATEs
  • ANSI JOINs on DELETEs
  • ANSI JOINs on DELETEs
  • ANSI JOINs on DELETEs
  • WITH clauses / Common table expressions followed by INSERT, UPDATE, DELETE or MERGE
  1. Identity / sequence fields on tables are not supported in REPLICATED tables

Recommendations

Azure SQL DWH was a perfect solution for the above requirement and scaling helped maintaining the expenses by setting it at DWU100 for data upload and when querying dimensions and to scale up to DWU400 when running large queries.  This would be a perfect solution for organizations who want to test moving to cloud and still have a backup option for recovery scenarios.

Our Experience with Azure HD Insight

Data Warehouse Consultants is constantly investing in our staff to learn new technologies and offerings.  The summary below was published by our analyst Mukesh Thangadurai regarding the use of the Azure HD Insight product, which is Microsoft’s current Hadoop offering.  For more information on this prototype or similar work we can do for your company, please use our Contact Form.

Environment

  • Microsoft Azure HDInsight
  • Microsoft SQL Server Integration Services

Data Set details:

  • 8 Dimension Tables
  • 1 Fact Table (168 million records)

Pricing

Excluding data storage costs

2 master, 2 slave architecture: 4.01$/hour

2 master, 4 slave architecture: 6.59$/hour

Implementation Notes

  1. The flat files for dimension and fact data were already in a structured, delimited format.
  2. Azure HDInsight Cluster gives the advantage of accessing files straight from Azure blob, without requiring any actual loading to the servers in the cluster.  This saves a ton of time as loading Terabytes of data to the cluster would be a time-consuming task.
  3. The data for fact tables and dimension tables were already uploaded using SSIS
  4. To upload 168 million records to the blob, it took 45 minutes.
  5. The entire data upload was done using SSIS.
  6. Now, HDinsight Cluster can be created in two ways.
    1. Option 1: Create it from the azure portal using the UI.
    2. Option 2: Create it using SSIS using Create HDInsight cluster function.
  7. We tried both these methods and they took almost same time in creating the cluster.
  8. Microsoft sets the default Master nodes as 2 and gives the flexibility of choosing the number of slave nodes.  In our implementation, we trialed with 2 slave nodes and 4 slave nodes in separate implementations.
  9. Once the clusters were created, the DDL for creating the hive tables were run.  These DDLs, specify the location of the files in the Azure blob.
  10. We also had options of creating user defined partitions. User defined partitions can be established by marking each folder of data as a separate partition.
  11. We implemented the prototype using both these styles, however predominant testing happened with unpartitioned data.
  12. Once the dataset is created, we used HiveQL to query the data.
  13. We can either use SQL server or the cluster dashboard to query this data

Key Observations:

  1. Query performance significantly improved when upgraded from 2 slave nodes to 4 slave nodes.
  2. Unlike the conventional Hadoop architecture, Azure defaults the number of master nodes to 2 instead of 1.
  3. Azure does offer two options to store metadata.
    1. Option 1: In a separate database of choice
    2. Option 2: In the same location the cluster is being built.
  4. The second option is the default option and the drawback is that the metadata gets purged when the cluster gets deleted.
  5. Unlike Azure Data Warehouse, we do not have the option of pausing the cluster when not in use and we are charged for every hour the cluster exists.
  6. Hence, the best way to make use of the cluster is to have all the required data already uploaded to the blob, have the DDL scripts and queries for reports or analysis complete, then create the cluster, run the DDL and analysis scripts and delete the cluster.
  7. The entire task of uploading data to blob, creating cluster, running the required DDLs, deleting the cluster can be orchestrated using SSIS.
  8. There were some expected restrictions like no DML authorization and some unexpected limitations like inability to use subqueries for comparison in the where clause.

Recommendations

Azure HDInsight feels like a true big data solution more suited for high end analytics than for a traditional data warehouse. Theclusters cost a minimum of $4 for the 2 slave node and $6 for the more efficient 4 slave nodes.  This cost structure seems more suited for high end temporary data analytics, rather than a traditional frequently queried data warehouse.