Posts

Loading Flat Files into a Database (Method Overview)

Here we feature a few methods for loading files into an Oracle or SQL Server database. There are many methods for accomplishing this, but over the past few projects, and after implementing most of the methods discussed, I documented some of my experiences.   Here I focus on developing an application or process to load data from files, or on demand, when a more sophisticated tool like SSIS or Informatica is not already in use.

Oracle External  Tables:  When working with Oracle, I have often used external tables to assist with file loading. They are easy to setup, provide good error handling, and allow querying the data prior to loading it for validations. This has always been a favorite method for Oracle. However, on my last few projects this was not a great option.  Without access to the database server it requires coordination to setup the directory and to get access to the files in that directory on the server.  It also requires a secondary process to stage, rename, archive, and log the files on the file server.  Additionally, on a project using Amazon RDS, using external tables was simply not available.

SQL Server Bulk Methods (BULK Insert / OpenRowset): Similar to Oracle external tables, these methods provide a fast and easy way to get the data into the database.  Again, however, these require special permissions and/or access to the database server.  After going down this path on a recent project the method worked fairly well, but without access to the database server it was difficult to coordinate getting the files to the server. When using a file share on a remote server, it was difficult getting the permissions necessary on both the remote server (file share) and the database server.  It is possible, but with sensitive data it requires extra coordination and permissions, some of which are not desirable.  I also find it difficult to do detailed error handling and logging if the file does not arrive in the format that is expected.  This also requires a secondary process to stage, rename, archive, and log the files on the file server.

SSIS and other tools:  If SSIS (or other tools) are available; I find it to be a great method for loading the files quickly into the database.  Most of my experience with the tools is SSIS to SQL Server, but I know others have similar capabilities.  Typically if a client is already working with SSIS and/or Informatica, the client has processes in place for handling the files. If not, there is often a lot of cost and overhead to setup this process from scratch.  For our purposes let’s assume the developer is not already working with one of these tools at the client site.

I have had success with loading data using each method described thus far. Other than SSIS (or other third party tools), they all require a separate application or scripting to handle the actual files. An application that reads the file, does validations, does logging, moves/renames files, and archives the files when complete. This is sometime done through scripting, but often requires a more in depth approach for logging and auditing the actions. Additionally, all of the methods discussed thus far require additional coordination, permissions, and/or access. There are other BULK methods I have seen or used that are not mentioned. Most of the other methods that I have used were for a one time load of large data, an initial migration, a large data transfer, or import/exports.

C# Application: This leads me to an application-based method. My preference in the last couple of projects has been to use a C# application to handle the file moving/renaming/logging, file parsing, and data loading. This method is easily scheduled, configured, and migrated.  It typical requires little to no additional coordination, software, or permissions.  One, fairly simple and self-contained application, can handle the processing.  Additionally, the application is very similar whether running against Oracle or SQL Server (Cloud or on-premises).  The data loading aspects are also easy converted to web applications that require user uploading of data to the database.

Once an application template is designed for logging, and error handling, it is fairly simple to expand the application to handle numerous formats in the same manner.  This method has been efficient, easy to implement, and easy to use between various projects. Within the C# application I have used various options to load the data, and found that some are much more efficient than others.  Click Here to read more about the methods we have used within our C# applications.

Loading Flat Files into a Database with C#

This article features some methods that I have used to load data within a C# application.  The methods discussed are an Insert Loop (with bind parameters), the DataAdapter.Update() method, and a custom method of passing the table to a stored procedure. For this article, a simple website demo was setup that accepted a 6-column CSV file upload, and parsed it into a Data Table (int, string, string, string, string, string). Every X records, that Data Table was inserted into a cloud hosted remote database.

For the purpose of comparison, the SQL Server BULK INSERT method was also used (NOTE: The time for the BULK INSERT method below is AFTER moving the files to the database server). The SQL Server portion of this demo was done on an Amazon EC2 instance, so BULK INSERT was an option. It does take a small amount of time to get the file to the remote server, but this time is negligible for the purpose of the comparison. If access and permissions to the database server are not an issue, this certainly highlights that the bulk method(s) are typically much faster than the methods being discussed in this article.  If processing very large files; if processing them very often; if access to the database server is an option; if a repeatable process can be written to move the files to the database server; this simple demo illustrates that the bulk methods are much more efficient.  However, there are a lot of IF’s in that statement.

For fairly large files, repeatable processes, and simple C# solutions, this comparison also highlights that the custom table to stored procedure method is a pretty decent and cost-effective solution in many scenarios.

The timings from my demonstration, for each of the methods, are shown in the table below:

Method Record Counts Estimated Time
Insert Loop 20K Records – 5K Batches 9 minutes
DataAdapter.Update() 20K Records – 5K Batches 35 seconds
DataAdapter.Update() 100K Records – 5K Batches 3 minutes
Stored Procedure (Input Table) 20K Records – 5K Batches 7 seconds
Stored Procedure (Input Table) 100K Records – 5K Batches 35 seconds
Stored Procedure (Input Table) 500K Records – 5K Batches 3 Minutes
Stored Procedure (Input Table) 1 Million Records – 5K Batches 6 Minutes
BULK INSERT 20K Records < 1 Second
BULK INSERT 100K Records 2 Seconds
BULK INSERT 500K Records 8 Seconds
BULK INSERT 1 Million Records 15 Seconds

Insert Loop: A very common method that I have seen and used is simple insert loops with bind variables.  It provides easy control over the data, and easy error handling.  It is also very slow if the database is remote.  I have seen many developers (including myself) use this method in a local development environment, only to have the process grind to a halt in another environment where the database is remote or cloud based.  I have successfully used this method to load hundreds, and even thousands of rows.  However, this method does not scale well when dealing with tens of thousands, hundreds of thousands, or millions of rows to a remote database. For small files, and/or local database, this method is easy to implement and provides good control over the data.


string mySql = @"
INSERT INTO DemoFileUpload(RecID, Col1, Col2, Col3, Col4, Col5)
VALUES (@RecID, @Col1, @Col2, @Col3, @Col4, @Col5)
";
using (SqlConnection myConn = new SqlConnection())
using (SqlCommand myCmd = new SqlCommand())
{
myConn.ConnectionString = db.GetDBConn1();
myCmd.CommandText = mySql;
myCmd.CommandType = CommandType.Text;
myCmd.Connection = myConn;
myConn.Open();

// Add the command parameters
myCmd.Parameters.Add(“@RecID”, SqlDbType.Int);
myCmd.Parameters.Add(“@Col1”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col2”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col3”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col4”, SqlDbType.NVarChar, 32);
myCmd.Parameters.Add(“@Col5”, SqlDbType.NVarChar, 32);

foreach (DataRow myRow in p_DT.Rows)
{
// Set the params
myCmd.Parameters[“@RecID”].Value = Convert.ToInt32(myRow[“RecID”]);
myCmd.Parameters[“@Col1”].Value = myRow[“Col1”].ToString();
myCmd.Parameters[“@Col2”].Value = myRow[“Col2”].ToString();
myCmd.Parameters[“@Col3”].Value = myRow[“Col3”].ToString();
myCmd.Parameters[“@Col4”].Value = myRow[“Col4”].ToString();
myCmd.Parameters[“@Col5”].Value = myRow[“Col5”].ToString();

// Execute the insert
myCmd.ExecuteNonQuery();
}

myConn.Close();
}

Data Adapter:  DataAdapter.Update() method is a very simple method to transfer data from the data table to the database.  However, on more complex scenarios or for larger datasets, I often spend a lot of time researching the various options to tune it correctly so that it works more efficiently. I also typically spend a lot of time reviewing the magic of this method, and prefer to have greater control.  While I have had success using this method, as the data grows larger, I find that I must customize the process, where the next methods work for larger and smaller sets.  I find the DataAdapter.Update() method greatly out-performs the insert loop, and greatly under-performs the next methods described.

using (SqlConnection myConn = new SqlConnection())
{
myConn.ConnectionString = db.GetDBConn1();
// Initialize with an empty set structure: All inserts
SqlDataAdapter myDA = new SqlDataAdapter(“SELECT * FROM DemoFileUpload WHERE 1=2”, myConn);
SqlCommandBuilder myCB = new SqlCommandBuilder(myDA);
myDA.MissingSchemaAction = MissingSchemaAction.Add;
myDA.UpdateBatchSize = myUpdateBatchSize;
myDA.Update(p_DT);
}

 

Datatable to Stored Procedure: The methods I have used on the last couple of projects pass larger amounts of data to the database, while giving control over how much data, and logging those actions.  I find these methods perform somewhat comparable to the bulk methods for reasonable amounts of data (albeit quite slower), but are easier to integrate into a self-contained application.  While these methods certainly aren’t as fast as most of the bulk methods I have used, they are fairly efficient, especially for the simplicity of setup and migration.  I also find them simple to understand, setup, implement, and replicate across multiple projects.

First, the input file is parsed, line by line into a data table. This is pretty typical of every method I use.  Every X rows [as set by configuration] the data table is sent to the database and processed as a set.

If working with SQL Server, I maintain a user-defined table type in the database with that same structure, and a stored procedure that takes in that table type, and insert/selects from that input table into the database. I then pass the data table from the C# application to the stored procedure to insert the data.

string mySql = “[dbo].[spDemoFileUploadInsert]”;
using (SqlConnection myConn = new SqlConnection())
using (SqlCommand myCmd = new SqlCommand())
{
myConn.ConnectionString = db.GetDBConn1();
myCmd.CommandText = mySql;
myCmd.CommandType = CommandType.StoredProcedure;
myCmd.Connection = myConn;
myConn.Open();

// Add the command parameters
myCmd.Parameters.Add(“@tblInput”, SqlDbType.Structured).Value = p_DT;
myCmd.ExecuteNonQuery();
myConn.Close();
}

 

If working with Oracle the process is similar.  Rather than taking in a table type parameter, I create a stored procedure that accepts a CLOB.  In the C# application, I convert the data table to an XML string, and pass that string as a CLOB into the stored procedure. In the stored procedure, I again do a simple insert/select from the input CLOB using Oracle XMLTYPE.

string myXMLRec;
using (StringWriter sw = new StringWriter())
{
p_DT.WriteXml(sw);
myXMLRec = sw.ToString();
}

myConn.ConnectionString = db.GetDBConn1();

myCmd.Connection = myConn;
myCmd.CommandText = “MyPackage.MyProcedure”;
myCmd.CommandType = CommandType.StoredProcedure;

myCmd.Parameters.Add(“p_XmlRec”, OracleDbType.Clob).Value = myXMLRec;

myConn.Open();
myCmd.ExecuteNonQuery();
myConn.Close();

 

While these methods are slower than the built-in bulk methods, I have found them to be simple, repeatable, and fairly efficient, while supporting the logging and control that is required in most of my projects. They are also easily configurable with respect to the batch size of records.  The batch can be easily increased or decreased from an application configuration based on the performance in a specific environment.  While these methods have the dependency of the stored procedure, I typically have stored procedures that do the subsequent processing of the staged data in the database. Therefore, these methods integrate nicely with the overall process.

Header/Detail XML Query in Oracle

On a client project today we had the need to create a table from an XML data file sent in a hierarchical format. The XML file contains multiple orders. Each order is for a particular store which is provided inside a header tag. Each store can order multiple items which are provided via sub tags of each header. A simplified sample of the file looks like this:

[box type=”shadow”]

<orders>
<order_header>
        <id>{6315E9FF}</id>
        <account_number>12345</account_number>
        <account_name>JOES CONVENIENCE</account_name>
        <store_address>123 MAIN STREET</store_address>
        <store_address_city>Pittsburgh</store_address_city>
        <store_address_state>PA</store_address_state>
        <store_address_postcode>15235</store_address_postcode>
        <order_line>
            <product_code>123456</product_code>
            <product_externalid>123456</product_externalid>
            <product_name>Peg Counter Spinner</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        <order_line>
            <product_code>723678</product_code>
            <product_externalid>723678</product_externalid>
            <product_name>Skittles</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        </order_header>
        <order_header>
        <id>{3CB06C14}</id>
        <account_number>23456</account_number>
        <account_name>SMITH LIMITED</account_name>
        <store_address>132 WEST AVENUE</store_address>
        <store_address_city>Pittsburgh</store_address_city>
        <store_address_state>PA</store_address_state>
        <store_address_postcode>15213</store_address_postcode>
        <order_line>
            <product_code>123456</product_code>
            <product_externalid>123456</product_externalid>
            <product_name>Peg Counter Spinner</product_name>
            <order_quantity>1</order_quantity>
        </order_line>
        </order_header>
</orders>

[/box]

The challenge was to write an Oracle query that could easily return both the header and detail information on a single row.  After some quick research I found that this can be done by using two separate XMLTables that hit the same file.  The first XMLTable contains store-level information.  The second XMLTable contains item level information.  When used together the second table starts with the XMLType (order_lines) that is defined in the first XMLTable.  The outer query simply selects fields you need.  It’s essentially a Cartesian product but it only intersects where appropriate given the “passing” clause. Here is a sample of the query to read the data.  This query assumes the file name is orders.xml and it is sitting in the previously defined XMLDIR Oracle directory.

[box type=”shadow”]

SELECT stores.order_id
      ,stores.account_number
      ,stores.account_name
      ,stores.retail_number
      ,stores.store_name
      ,stores.store_address
      ,stores.store_state
      ,stores.store_zip
      ,items.*
FROM XMLTable('/orders/order_header'
	     passing xmltype(  bfilename('STRIPE_UAT','wrigley_orders.xml'), nls_charset_id('AL32UTF8') )
         columns  order_id varchar2(2000) path 'id'
                 ,account_number varchar2(50)  path 'account_number'
                 ,account_name   varchar2(50)  path 'account_name'
                 ,retail_number  varchar2(12)  path 'store_externalid'
                 ,store_name     varchar2(50)  path 'store'
                 ,store_address  varchar2(100) path 'store_address'
                 ,store_state    varchar2(5)   path 'store_address_state'
                 ,store_zip      varchar2(10)  path 'store_address_postcode'
                 ,order_lines    XMLTYPE       path 'order_line'
     ) stores,
     XMLTable('/order_line'
	     passing stores.order_lines
         columns  product_code   varchar2(100) path 'product_code'
                 ,product_name   varchar2(100) path 'product_name'
                 ,order_quantity varchar2(100) path 'order_quantity'
     ) items

[/box]

This methodology can easily be extended to query an XML structure n levels deep to return information in one result set.

ASP.NET Issue: The SqlDataSource control does not have a naming container…

I have received the following .NET error several times in the past when developing nested databound controls. I have found the solution does not have good coverage on the Internet. The curious thing about this error is that it does not appear until 2nd or 3rd postback on the page.

[box]The SqlDataSource control does not have a naming container. Ensure that the control is added to the page before calling DataBind[/box]

The cause is simple, however. It occurs when you use a ControlParameter on a DataSource that is within another data control. I ran into it again today on a Gridview within a DataList for example. The error occurs because the inner control cannot find the control containing the parameter for some reason.

The solution is simple as well. You must remove the ControlParameter from the DataSource that is inside the other control. This can be accomplished by using a normal <asp:Parameter> and assigning a DefaultValue to it when the outer control binds.