My first #Biml project: Data flow tasks

In the previous post we’ve added some tasks to our package, and the only thing left to do is to add a data flow task that will extract the data from our source table and populate the staging table. Here’s what our Biml code looks like so far:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
 
    <Connections>
        <OleDbConnection Name="BI_Admin" CreateInProject="true" ConnectionString="Data Source=<>;Initial Catalog=BI_Admin;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
        <OleDbConnection Name="Staging" CreateInProject="true" ConnectionString="Data Source=<>;Initial Catalog=Staging;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
        <OdbcConnection Name="RLM" CreateInProject="true" ConnectionString="<>" DelayValidation="true" />
    </Connections>
 
    <Packages>
 
        <Package Name="Staging_PD8001" DelayValidation="true" ProtectionLevel="DontSaveSensitive">
            
            <Variables> 
                <Variable Name="Data_Source_Load_Start_Date" DataType="String">2022-01-01</Variable> 
                <Variable Name="Data_Source_Load_Start_Day" DataType="String">1</Variable> 
                <Variable Name="Data_Source_Load_Start_Month" DataType="String">1</Variable> 
                <Variable Name="Data_Source_Load_Start_Year" DataType="String">2022</Variable> 
            </Variables>

            <Tasks> 

                <Container Name="Sequence Container - Package" DelayValidation="true" ConstraintMode="Linear"> 
                    <Tasks>
                        <ExecuteSQL Name="Execute SQL Task - Set Load Start Date Variables" ConnectionName="BI_Admin" DelayValidation="true" ResultSet="SingleRow"> 
                            <DirectInput>exec [BI_Admin].[ETL].[Get_Derived_Date_Parameters];</DirectInput> 
                            <Results> 
                                <Result Name="Data_Source_Load_Start_Date" VariableName="User.Data_Source_Load_Start_Date" /> 
                                <Result Name="Data_Source_Load_Start_Day" VariableName="User.Data_Source_Load_Start_Day" /> 
                                <Result Name="Data_Source_Load_Start_Month" VariableName="User.Data_Source_Load_Start_Month" /> 
                                <Result Name="Data_Source_Load_Start_Year" VariableName="User.Data_Source_Load_Start_Year" /> 
                            </Results> 
                        </ExecuteSQL>

                        <ExecuteSQL Name="Execute SQL Task - RLM - Truncate Staging Table" ConnectionName="Staging" DelayValidation="true" ResultSet="None"> 
                            <DirectInput>truncate table [Staging].[RLM].[PD8001];</DirectInput> 
                        </ExecuteSQL>
                    </Tasks>
                </Container> 

            </Tasks>

        </Package>
 
    </Packages>
 
</Biml>

Before we start adding the code for our data flow task, there are a few things worth pointing out:

  • Our data source is a DB2 database and we’re connecting through an ODBC connection (System DSN in this case), resulting in a few nuances when using Biml to generate the package. I’ll point those out as we go through the steps.
  • We’re going to use an expression to dynamically filter the incoming data, based on the variables we’ve extracted from our Admin database, which will in turn introduce some complexity.

The data flow task we’re aiming to produce will look like this when we’re done:

data flow designer 1

There are of course a lot of hidden details you cannot see from the image above, but it will all become apparent as we walk through the steps.

Step 1: Add the data flow task itself

To create the actual data flow task (without any transformations) is very easy, and the Biml code looks like this:

<Dataflow Name="Data Flow Task - RLM - PD8001" DelayValidation="true">
    <Transformations>

    </Transformations>
</Dataflow>

Since data flow tasks contain one or many transformations, we have already added the Transformations Biml tag.

Step 2: Add the data flow source

To add the ODBC source to our transformation, we use the following code:

<OdbcSource Name="ODBC Source - RLM - PD8001" Connection="RLM" BatchSize="50000" ValidateExternalMetadata="true" UseBatch="true">
    <DirectInput>
        select cast(CPY001 as varchar(10)) as CPY001
            ,  cast(DIV001 as varchar(10)) as DIV001
            ,  cast(ACT001 as varchar(10)) as ACT001
            ,  cast(STR001 as varchar(10)) as STR001
            ,  cast(CTL001 as varchar(10)) as CTL001
            ,  cast(STY001 as varchar(20)) as STY001
            ,  cast(CLR001 as varchar(20)) as CLR001
        from   PD8001 
    </DirectInput>
</OdbcSource>

The only properties required for an ODBC source is Name and Connection, and the rest is really dependent on the type of source connection you’re adding…some have more or different properties you can set. In my specific case I wanted to explicitly set the batch size to 50,000 rows and the fetch mode to “Batch”, which brings up one of a few (and very minor) gripes I have with Biml: Property names in Biml do not always match what you see in the SSIS designer (FetchMode in the SSIS designer vs. UseBatch in Biml).

If you are used to working in the SSIS designer, it will take some time to work through the language reference and find the matching Biml properties. I understand why they aren’t and cannot always be the same, but I’d like to see the matching SSIS property name added to the Biml documentation to make it easier for developers. Here’s the language reference of the ODBC source for a list of properties you can set in this element.

Note that even though we are planning on using an expression to dynamically set the query against our source database, we still have to provide a query within the DirectInput tag, which is the equivalent of what you will provide in the designer as SQL command text if you choose SQL Command as data access mode. (nuance #1)

Side Note: If you’ve had to deal with implicit conversion issues in SSIS as many times as I have, you’ll understand why I am converting all source fields to varchar in the query :-/

Step 3: Add the data flow destination

Adding the destination to our data flow task transformation is super easy:

<OleDbDestination Name="OLE DB Destination - Staging - PD8001" ConnectionName="Staging" BatchSize="50000" MaximumInsertCommitSize="50000">
    <ExternalTableOutput Table="[RLM].[PD8001]" />
</OleDbDestination>

A standard practice of mine is to name staging table fields exactly as they are in the source. Apart from the primary benefit of an easy reference to what it was called in the source system, the default behavior of SSIS data flows adds another by automatically mapping between the two. That means no additional code or the need to explicitly map the source and destination fields <happy dance />.

As best practice you should also (always) set the values of the BatchSize and MaximumInsertCommitSize properties explicitly, as I have done in the above.

Step 4: Add the expression

Another nuance (#2) of ODBC sources is that you cannot add an expression to the source connection directly. Instead, you have to add the expression to the data flow task itself.

The net effect of this is that your Biml code (for the expression) will be nested as a direct child of the data flow task instead of the ODBC source and it will look something like this:

<Expressions>
    <Expression ExternalProperty="[ODBC Source - RLM - PD8001].[SqlCommand]">
        "select cast(CPY001 as varchar(10)) as CPY001
              , cast(DIV001 as varchar(10)) as DIV001
              , cast(ACT001 as varchar(10)) as ACT001
              , cast(STR001 as varchar(10)) as STR001
              , cast(CTL001 as varchar(10)) as CTL001
              , cast(STY001 as varchar(20)) as STY001
              , cast(CLR001 as varchar(20)) as CLR001
 
          from  PD8001
 
          where SYR001 = " + @[User::Data_Source_Load_Start_Year]
    </Expression>
</Expressions>

Note that in order to associate the expression with your ODBC data source, the first part of the ExternalProperty property’s value should match the name of your ODBC source. This is no different to what you would see in the SSIS designer. The rest of the expression is self-explanatory and I won’t bore you with a detailed explanation.

Here’s what the entire data flow’s Biml code will look like after all of our additions:

<Dataflow Name="Data Flow Task - RLM - PD8001" DelayValidation="true">
 
    <Expressions>
        <Expression ExternalProperty="[ODBC Source - RLM - PD8001].[SqlCommand]">
            "select cast(CPY001 as varchar(10)) as CPY001
                  , cast(DIV001 as varchar(10)) as DIV001
                  , cast(ACT001 as varchar(10)) as ACT001
                  , cast(STR001 as varchar(10)) as STR001
                  , cast(CTL001 as varchar(10)) as CTL001
                  , cast(STY001 as varchar(20)) as STY001
                  , cast(CLR001 as varchar(20)) as CLR001
 
             from   PD8001
 
             where  SYR001 = " + @[User::Data_Source_Load_Start_Year]
        </Expression>
    </Expressions>
 
    <Transformations>

        <OdbcSource Name="ODBC Source - RLM - PD8001" Connection="RLM" BatchSize="50000" ValidateExternalMetadata="true" UseBatch="true">
            <DirectInput>
                select cast(CPY001 as varchar(10)) as CPY001
                     , cast(DIV001 as varchar(10)) as DIV001
                     , cast(ACT001 as varchar(10)) as ACT001
                     , cast(STR001 as varchar(10)) as STR001
                     , cast(CTL001 as varchar(10)) as CTL001
                     , cast(STY001 as varchar(20)) as STY001
                     , cast(CLR001 as varchar(20)) as CLR001
                from   PD8001 
            </DirectInput>
        </OdbcSource>
 
        <OleDbDestination Name="OLE DB Destination - Staging - PD8001" ConnectionName="Staging" BatchSize="50000" MaximumInsertCommitSize="50000">
            <ExternalTableOutput Table="[RLM].[PD8001]" />
        </OleDbDestination>
 
    </Transformations>
 
</Dataflow>

You can download the entire Biml file here.

We did it!! We now have Biml code that will generate a fully functional SSIS package. Let’s give it a spin…

Step 5: Generate & test the package

To generate the package, all you have to do is right-click on the Biml file in SSDT (or Visual Studio) and select the Generate SSIS Packages option. Assuming a successful compilation, you should see your new package in the relevant project folder.

The last nuance (#3) I’d like to point out, is a Biml compiler issue that seems to be only related to ODBC sources, and you’ll see an error similar to this if you deploy and run the newly generated package:

package error

At the time of writing this blog post there is unfortunately no fix yet, but you can resolve the error by opening up the ODBC source in the SSIS designer. You don’t need to change anything…just opening up the source will refresh its metadata and correct whatever has gone wrong during the Biml process.

What’s next?

The next (logical) step in our Biml progression is to reuse the same Biml code to generate multiple similar packages, without copying and pasting the code over and over again. The next post in this series will start looking at some ways in which we can do that.

Happy Biml-ing!!

 

Other posts in this series:

My first #Biml project

My first #Biml project: Starting small

My first #Biml project: Adding tasks

 

Advertisements

3 Replies to “My first #Biml project: Data flow tasks”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s