SSIS - XML Source Script

Posted by simonsabin on SQL Blogcasts See other posts from SQL Blogcasts or by simonsabin
Published on Thu, 11 Mar 2010 20:20:55 GMT Indexed on 2010/03/11 21:49 UTC
Read the original article Hit count: 763

The XML Source in SSIS is great if you have a 1 to 1 mapping between entity and table. You can do more complex mapping but it becomes very messy and won't perform. What other options do you have?

The challenge with XML processing is to not need a huge amount of memory. I remember using the early versions of Biztalk with loaded the whole document into memory to map from one document type to another. This was fine for small documents but was an absolute killer for large documents.

You therefore need a streaming approach.

For flexibility however you want to be able to generate your rows easily, and if you've ever used the XmlReader you will know its ugly code to write.

That brings me on to LINQ. The is an implementation of LINQ over XML which is really nice. You can write nice LINQ queries instead of the XMLReader stuff. The downside is that by default LINQ to XML requires a whole XML document to work with. No streaming.

Your code would look like this. We create an XDocument and then enumerate over a set of annoymous types we generate from our LINQ statement

XDocument x = XDocument.Load("C:\\TEMP\\CustomerOrders-Attribute.xml");

 

foreach (var xdata in (from customer in x.Elements("OrderInterface").Elements("Customer")

                       from order in customer.Elements("Orders").Elements("Order")

                       select new { Account = customer.Attribute("AccountNumber").Value

                                  , OrderDate = order.Attribute("OrderDate").Value }

                       ))

{

    Output0Buffer.AddRow();

    Output0Buffer.AccountNumber = xdata.Account;

    Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);

}

As I said the downside to this is that you are loading the whole document into memory.

I did some googling and came across some helpful videos from a nice UK DPE Mike Taulty http://www.microsoft.com/uk/msdn/screencasts/screencast/289/LINQ-to-XML-Streaming-In-Large-Documents.aspx. Which show you how you can combine LINQ and the XmlReader to get a semi streaming approach. I took what he did and implemented it in SSIS. What I found odd was that when I ran it I got different numbers between using the streamed and non streamed versions. I found the cause was a little bug in Mikes code that causes the pointer in the XmlReader to progress past the start of the element and thus

foreach (var xdata in (from customer in StreamReader("C:\\TEMP\\CustomerOrders-Attribute.xml","Customer")

                               from order in customer.Elements("Orders").Elements("Order")

                               select new { Account = customer.Attribute("AccountNumber").Value

                                          , OrderDate = order.Attribute("OrderDate").Value }

                               ))

        {

            Output0Buffer.AddRow();

            Output0Buffer.AccountNumber = xdata.Account;

            Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);

        }

These look very similiar and they are the key element is the method we are calling, StreamReader. This method is what gives us streaming, what it does is return a enumerable list of elements, because of the way that LINQ works this results in the data being streamed in.

static IEnumerable<XElement> StreamReader(String filename, string elementName)

{

    using (XmlReader xr = XmlReader.Create(filename))

    {

        xr.MoveToContent();

        while (xr.Read()) //Reads the first element

        {

            while (xr.NodeType == XmlNodeType.Element && xr.Name == elementName)

            {

                XElement node = (XElement)XElement.ReadFrom(xr);

 

                yield return node;

            }

        }

        xr.Close();

    }

}

This code is specifically designed to return a list of the elements with a specific name. The first Read reads the root element and then the inner while loop checks to see if the current element is the type we want. If not we do the xr.Read() again until we find the element type we want. We then use the neat function XElement.ReadFrom to read an element and all its sub elements into an XElement. This is what is returned and can be consumed by the LINQ statement. Essentially once one element has been read we need to check if we are still on the same element type and name (the inner loop) This was Mikes mistake, if we called .Read again we would advance the XmlReader beyond the start of the Element and so the ReadFrom method wouldn't work.

So with the code above you can use what ever LINQ statement you like to flatten your XML into the rowsets you want. You could even have multiple outputs and generate your own surrogate keys.

 

 

 

 


© SQL Blogcasts or respective owner

Related posts about SQL Server 2005

Related posts about Tips and Tricks