Home SSIS
Category:

SSIS

I cannot believe it took me almost three months to recover from the house move. At last, I feel like I am back to routine. Now I have my home office; I am quite proud of it, it involved much DIY work. Wish I am one of those YouTube stars, who show off their interior decorating skills 😉 I know none of this excites you, so let’s look at the actual blog post…

 

 

Validating Data using SSIS:

Recently I have got a requirement to validate flat files using SSIS. Then I realised, why I never thought of doing this before. It would have saved hours I wasted in emails or checking data to understand the problem. We face problems with dirty data very often, it could be because of invalid characters or lookup errors or something else. Based on the ETL process you identify this data problem at various stages. However, having a data validation task enables you to accept or reject the data at the very initial stage. I have to thank my client for giving me this task. Because it is something I have not done so far and I used script component (Which I do not use very often, so I def need to write this blog post before I forget it.)

The goal of our SSIS package is to find all the validation failures. And the validations we want to check are:

  1. Only Yellow and Orange colour products are valid
  2. Product Column is mandatory
  3. Product Name must be capital
  4. Amount is a mandatory field
  5. Only two decimals values are allowed in the Amount field
  6. No comma’s (,) allowed in the Amount field
  7. No special characters in the Product column
  8. No 0’s allowed in the Amount Filed
  9. No NULL values in the Amount Field
  10. No duplicated products with the same column

I used Script component as a transformation in the Data Flow task. This is where all the logic resides. In the script component, I added a new output called Validation with two new output columns called FailedColumn and ValidationDescription. I also included all the source columns.

In the script component under Input0_ProcessInputRow, I included my code. Mainly I was using If condition as per my validation, As I wanted to capture all validation errors, so I added validation error rows to Validation Buffer.

Sample Script:

string key1 = String.Concat(Row.Product, Row.Color);

if (key1.Equals(previousKeyValue))

{

ValidationBuffer.AddRow();

ValidationBuffer.ValidationDescription = "No duplicated products with the same colour and different amount";

ValidationBuffer.FailedColumn = "Product and Color";

/*Map existing Columns*/

ValidationBuffer.Amount = Row.Amount;

ValidationBuffer.Product = Row.Product;

ValidationBuffer.Color = Row.Color;

}

I populated ValidationDescription and Failed column with relevant information. This enables me to understand exactly which column failed and why. And my output looks like this.

Now I can easily see where and what the issues are. Also I can just redirect these rows to a table or send them in an email or just ignore these rows and load rest or abort the ETL process and notify the relevant person to correct the file.

Prathy

You can download the SSIS project here  or the complete code of the Script Component for all above mentioned validations:

 
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
 
/*Added*/
using System.Reflection;
using System.Text;
using System.Globalization;
using System.Collections.Generic;
using System.Windows.Forms;
 
#endregion
 
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
        public override void PreExecute()
    {
        base.PreExecute();
        /*
         * Add your code here
         */
    }
 
       public override void PostExecute()
    {
        base.PostExecute();
        /*
         * Add your code here
         */
    }
    
    private static readonly char[] SpecialChars = "!@#$%^&*()".ToCharArray();
    private string previousKeyValue = string.Empty;
 
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
 
        string key1 = String.Concat(Row.Product, Row.Color);
        if (key1.Equals(previousKeyValue))
        {
 
            ValidationBuffer.AddRow();
            ValidationBuffer.ValidationDescription = "No duplicated products with the same colour and different amount";
            ValidationBuffer.FailedColumn = "Product and Color";
 
            /*Map existing Columns*/
            ValidationBuffer.Amount = Row.Amount;
            ValidationBuffer.Product = Row.Product;
            ValidationBuffer.Color = Row.Color;
        }
 
        previousKeyValue = key1;
 
        /*Two decimals values are allowed for Amount */
        string[] arr = Row.Amount.Split('.');
        if (arr.Length > 1)
        {
            if (arr[1].Length > 2)
            {
                ValidationBuffer.AddRow();
                ValidationBuffer.ValidationDescription = "Only two decimals values are allowed in the Amount field ";
                ValidationBuffer.FailedColumn = "Amount";
 
                /*Map existing Columns*/
                ValidationBuffer.Amount = Row.Amount;
                ValidationBuffer.Product = Row.Product;
                ValidationBuffer.Color = Row.Color;
            }
        }
        /*No special characters in the Product column */
 
 
        if(!String.IsNullOrEmpty(Row.Product))
    
        {
            int indexOf = Row.Product.IndexOfAny(SpecialChars);
            if (indexOf != -1)
            {
 
                ValidationBuffer.AddRow();
                ValidationBuffer.ValidationDescription = "No special characters in the Product column ";
                ValidationBuffer.FailedColumn = "Product";
 
                /*Map existing Columns*/
                ValidationBuffer.Amount = Row.Amount;
                ValidationBuffer.Product = Row.Product;
                ValidationBuffer.Color = Row.Color;
            }
          
        }
        if (string.Equals(Row.Amount, "0"))
        {
            ValidationBuffer.AddRow();
            ValidationBuffer.ValidationDescription = "O is not a valid value in the AMount filed ";
            ValidationBuffer.FailedColumn = "Amount";
 
            /*Map existing Columns*/
            ValidationBuffer.Amount = Row.Amount;
            ValidationBuffer.Product = Row.Product;
            ValidationBuffer.Color = Row.Color;
        }
        if (string.IsNullOrEmpty(Row.Amount))
        {
            ValidationBuffer.AddRow();
            ValidationBuffer.ValidationDescription = "NUll or Blank is not a valid value in the AMount filed ";
            ValidationBuffer.FailedColumn = "Amount";
 
            /*Map existing Columns*/
            ValidationBuffer.Amount = Row.Amount;
            ValidationBuffer.Product = Row.Product;
            ValidationBuffer.Color = Row.Color;
        }
        var AmountArr = Row.Amount.Contains(",");
        if (AmountArr == true)
        {
            ValidationBuffer.AddRow();
            ValidationBuffer.ValidationDescription = ", is not a valid character in the Amount filed ";
            ValidationBuffer.FailedColumn = "Amount";
 
            /*Map existing Columns*/
            ValidationBuffer.Amount = Row.Amount;
            ValidationBuffer.Product = Row.Product;
            ValidationBuffer.Color = Row.Color;
        }
 
        if (String.IsNullOrEmpty(Row.Product))
        {
            ValidationBuffer.AddRow();
            ValidationBuffer.ValidationDescription = "Product is a mandatory field";
            ValidationBuffer.FailedColumn = "Product";
 
            /*Map existing Columns*/
            ValidationBuffer.Amount = Row.Amount;
            ValidationBuffer.Product = Row.Product;
            ValidationBuffer.Color = Row.Color;
        }
        //Controling area
        if (!string.Equals(Row.Color, "Yellow") || !string.Equals(Row.Color, "Orange"))
        {
            ValidationBuffer.AddRow();
            ValidationBuffer.ValidationDescription = "Only Yellow and Orange colour products are valid";
            ValidationBuffer.FailedColumn = "Color";
 
            /*Map existing Columns*/
            ValidationBuffer.Amount = Row.Amount;
            ValidationBuffer.Product = Row.Product;
            ValidationBuffer.Color = Row.Color;
        }
        if (!String.IsNullOrEmpty(Row.Product))
        {
            int i = String.Compare(Row.Product, Row.Product.ToUpper());
            if (String.Compare(Row.Product, Row.Product.ToUpper()) != 0)
            {
                ValidationBuffer.AddRow();
                ValidationBuffer.ValidationDescription = "Product Name must be capital";
                ValidationBuffer.FailedColumn = "Product";
 
                /*Map existing Columns*/
                ValidationBuffer.Amount = Row.Amount;
                ValidationBuffer.Product = Row.Product;
                ValidationBuffer.Color = Row.Color;
            }
        }
               
 
 
    }
 
    public override void CreateNewOutputRows()
    {
       
 
    }
 
}

 

5 comments
0 FacebookTwitterPinterestEmail

This is the second part of SSIS Package Parts blog post, I would recommend you to read other part first!

In the previous post we saw basics of SSIS Control Flow Package Parts, now let’s look a bit more in detail:

Package Part in Code Mode:

In design mode, an empty package part looks like regular SSIS package apart from Package Part has only Control flow and data flow tab. When we compare the source code between a new SSIS package(.dtsx) and a new control flow package part (.dtsxp), both looks pretty similar with obvious package type, refId and CreationName as different. So using BIML to create package parts won’t be much different compared to SSIS Packages.image

image

One of the first use cases that came into my mind when I first saw control flow template was to have the extensibility of reusing the code. Whenever I create an SSIS package to populate a Star Schema, I usually use a separate framework like many others; I try to capture a lot of metadata related to SSIS packages and execution. So I can monitor how many rows got inserted, updated or some other information related to executions and performances. However, one of the tedious thing with SSIS is minimal support with code reusability. Whenever you copy an SSIS Task to a new package or a new solution it loses all the existing connections and sometimes parsing too, you need to remap the connections again, but SSIS Control Flow Package Part directly addresses this issue.

Control flow Package Part for SSIS Audit:

So I want to create a package part to monitor SSIS Audit Begin and Audit End events for each execution. For this purpose, I want to use some system variables and user variables. Once I created the first variable in Control Flow Part, the first thing I noticed was all the variables were task scoped.

image

image

I did not understand the purpose of not having the flexibility of changing the variables scope in SSIS Control Flow Package Part Designer. Next thing, I tried was adding another task by assuming I can choose which task to execute during the execution based on the task enable/disable property. Immediately I got an error saying “Control flow package part need to contain exactly one executable.” This error explains why variables were task scoped only. As only one executable allowed in package parts, having task scoped variables will not lead to any confusion when the same name variables already exist in the SSIS Packages.

image

Well for my purpose I need two Execute SQL Tasks, one at the begin and one at the End. I thought I can just add two Execute SQL Tasks but by looking at the limitations the only way I can achieve this is by creating two different package parts. Hence, I created two SSIS Package Parts. Then I was curious to see why entire SSIS toolbox available in the package part designer when you can not add more than one executable? E.g., Containers. Later I placed a Sequence Container, and surprisingly I can add more than one Execute SQL tasks, and it executes perfectly fine. So if you want to use more than one task in a package part then use Sequence Container. It nicely differentiates each executable in Package Part configuration.

Sequencecontainer 2SequenceContainer1

I thought I will anyway create two separate Execute SQL Tasks because package part clearly defines what it meant to do. Both Package Parts I created has task scoped variables and SQL Statement Source as an expression.

2016-06-14_16-24-52.jpg

I can see my two SSIS Package Parts in the SSIS Tool Box. I can just drag and drop these two SSIS Package Parts into Control Flow Template I can quickly identify SSIS Package part by looking at little P icon on the top right, and also I can see it uses the expression with little fx icon on top left. Initially, I just want to see how my package works with new package parts.

And basic execution worked perfectly fine. I needed another SSIS Audit End package part with failed Execution result, as I can use package parts inside SSIS packages as many times as I want and I was using user variables for execution result value, so I just changed the User Variable Execution Result value to get the desired result. It is better to use variables whenever possible to make Package parts more dynamic.

image_thumb-31

 Next the tricky bit; how I can pass the Row count? I have package scoped variables which hold the row count. The only way I can think of doing this was manually changing the scope of the variables. ( I am so glad it was not restricted 🙂 )

image

Once I manually changed the value, I was able to use these two variables in the Data Flow Task. And the job was done. My package worked just like the way I want.

If I want I can change the scope of the variables of all control flow parts, if same variable value already exists in the package SSIS automatically renames the newly moved variable or we can just delete the control flow part variables which we already have under Package scope

image

Well, it works, with few workarounds, I did manage to achieve what I want.  But there were few hiccups while I was developing this small solution and imagine creating a robust solution.

Blank Screen:

The first and most annoying thing were the blank screens. I don’t know it was my laptop or installation, but a blank screen with no error information, Really !?!?

image

Well, I first got this error when I opened a package part which used as a reference in another SSIS Project. Not only that there were several other occasions, I saw this blank screen, and I really hope it was my PC issue than SSIS. Every time I got this blank screen error, I deleted the package part from solution and recreated.

Package Parts Execution:

Another unbelievable thing was there is no execution or debugging option for package Parts. Really? We can see brake points option available in SSIS parts. However, there is no use. There is no way I can execute a package part with in Package Part designer, the only way I can execute by placing it in an SSIS package.

image

I am sure everyone want to test their package part before they put it in the SSIS Package, there are so many other scenarios where you need to execute your task or executable before you use it somewhere…

Reusability:

As I mentioned earlier, the primary use case of package parts is code reusability. But if there is no track of where it has been used then it is a difficult job to control changes. If I make changes to package part of a solution, and the package which uses this package part was still open, then I get a notification about reference, apart from that there was no other way I can see where the package part has been used.

PPsave.jpg

When I add an existing package part as a reference to the project and make changes accidently. imageThe other packages which are using the package part will get the blank screen (At least on my PC!). When the package part added as a reference and if you change or rename it, another package where it has been used or created will have errors. But you can modify the name once it is in SSIS package, should not change the file name!

Expressions:

Package part does allow expressions, and you can see package part using expressions, but these expressions are allowed at an executable level only. For, e.g., if you have a sequence container with two SQL Tasks in package part. You can see expressions used at sequence container level but not the expression used at Execute SQL Task level however if your package part has only one executable then you can see all underlined expressions (It is better not to change expressions when using package part as a reference or added as a reference to the project)

2016-06-26_00-03-15.jpg

Package part with Sequence Container and multiple Execute SQL tasks with Expressions

2016-06-26_00-06-20.jpg

Package Part with single Execute SQL Task using Expressions

Connection Strings:

Package Part brings all connection string along with it. These connection strings are not visible in connection managers window, but you can see them under any other tasks connection managers or in configuring/ execution window of SSISDB Catalog. All package parts connections were prefixed by Package Part. To avoid confusion, it is better to name all packages with relevant name

2016-06-26_00-13-392016-06-26_00-51-42

Visual Studio Crashes:

Visual Studio crashes and when I reopened the project I lost all my references with package part. None of my package parts has P symbol,  after crash all package parts look and behave just like regular tasks. In a way, it is nice because my package still works but I lost all the references, though. (If package documentation says, to edit package part go to x location and make changes. That wouldn’t work! )

Some points:

  1. No option to change the connection string

    image

  2. The only warning you get is

    image

  3. We can not use a package part inside a package part.
  4. The deployment will not deploy Package Parts to SSISDB.
  5. No debugging options were available for package parts in the package part designer.
  6. Each Package part should contain exactly one executable – We can not create a control flow package part with two executable, SSIS will not allow saving the package part. But we can create a package part with no executable, but SSIS will not enable you to place it in Control Flow. In both scenarios, it raises same error.

Summary:

Do I like it? Don’t know. Will I use it? Depends.
It can be a great addition to any SSIS package but version 1. Needs a lot of improvements. If you are thinking of using it, I believe that you need to have strict standards in place. I would like to create a solution with all the package parts and make sure to put them under source control and have strict rules related to change control and annotations will probably make it easy.  Well it’s just my opinion after spending couple of days with Control Flow Package Parts, would love to hear what you think.

Next blog post in the series will be Logging…

Till next time,

Prathy 🙂

7 comments
0 FacebookTwitterPinterestEmail

This blog post has become longer than I thought, so decided to post as two parts. First part covers introduction and second is bit more in detail – hope it makes it easy to read!

Control flow Package Parts, one of the SSIS 2016 features which excited me most when Matt Masson did a session about imagewhat’s new in SSIS and MDS at Pass Summit 2016.Whenever there is something new, it is quite common to get positive and negative comments. But it seems like SSIS Control Flow Templates or the new name SSIS Control Flow Package Parts ( Does it mean, we can expect other types of package parts in the future?) received more negative comments than positive. I have used Control Flow template briefly with my previous blog post example; however, this blog post gives me a chance to look in detail to learn and understand what are missing from my expectations.

Creating an SSIS Control Flow Package Part:

For me the only thing which is straight forward with Package Parts is creating a Package Part. Creating a new package part is very easy. You can simply go to Package Parts section under the project where you wanted to create one, right-click and when you click on the new control flow Package Part, it creates a new .dtsxp file. You can not only create one but also can import existing one. Once it is created it looks just exactly same as SSIS package with same SSIS Tool Box.

image

Package Part in Solution Explorer

image

Creating a New Control Flow Package Part

Adding an Existing Control Flow Package Part:

You can add existing package part two ways: as a reference or without reference

image

When you add as a reference, package part still reference to the original location of the file, and when you add it as a non-reference, it copy’s the file to the new project location. We can see the difference by looking at the source file path or icon of the package part; it is a tiny icon, though.

image

Obviously, when there is a reference, then all the underlined changes will get affected. Which option to choose is entirely based on the use case? I will explain bit more about this in the next blog post, because it has it’s own pro’s and con’s.

 Using Package Part:

Once the Package Part created or added, you can see them in SSIS Tool Box. We can simply drag and drop them into Control Flow as any other SSIS Control Flow Tasks.

Configuring Control Flow Package Part:

In the configuration window of package part, you can see two tabs one to configure connection manager and one for properties. If you want to change the value of a property then, you can click on the property and change the property path. If the underline package part using any expressions, it brings back those expressions too, it is just same as copying and pasting existing task, but here it brings the connection with itself also better GUI to change the properties. Package part allows only one executable, but if you want to add more than one executable, then you can use Sequence Container. It is useful in certain cases, like when you don’t want to create multiple package parts; by using a sequence container you can add more than one executable and use enable/disable property of executable for execution.

image

Package Part Properties configurations window

image

Package Part Connection Managers configurations window

image

Package Part configurations for Sequence Container with multiple executables

That’s all you can just drag and drop this pre-created SSIS Control Flow Package Parts in any SSIS package you want. If require, you can configure the properties of package part. Easy 🙂

But unfortunately it is not as straight forward. Variables doesn’t work the way you expected, you get unexpected blank screen as an error and many more. I will address all these in the next blog post…

Till Then,

Prathy

2 comments
0 FacebookTwitterPinterestEmail

SSIS has always been my favourite among all MSBI tools. I think it is ETL which interests me than specific SSIS. As an SSIS fan girl, SQL Server 2012 release was my favourite. It was a huge release with numerous improvements (Don’t forget rounded corners ). If we compare SQL Server 2012 and 2016 releases, there are not as many exciting enhancements in the release, but some noteworthy enhancements are there. Hence, this blog post is to look into new features of SSIS 2016 in detail, so I will get a chance to explore them. I am going to publish this post as series so it will be convenient to read and write Smile

Package Deployment:

One of the most complained or requested feature after SSIS 2012 release was package deployment. I am not a huge fan of this, but I can see why people were missing this feature. I have seen people debating about SSIS, saying “SSIS is same as any other development so you deploy entire solution at a time, not individual packages”. Probably Microsoft also considered SSIS development as any other development and introduced project deployment in SSIS 2012. I totally agree with the argument, and I would like to do that way, but most of SSIS developers deploy individual packages…

Well, it’s nice to see package deployment back in SSIS 2016. So what happens when you decided to deploy individual packages?

We all know how to do project deployment, new incremental package deployment in SSIS 2016 is also same as project deployment, but you can right click on the individual package or multiple selected packages and choose deploy.

image

All the rest of the steps are same as the previous deployment. Though, if you want to see what packages are selected or if you changed your mind about deployment after you started deployment wizard; you can click on the previous button which gives us a new window in deployment wizard where we can choose the packages you want to deploy. One thing I noticed here is you can deselect, but you are not allowed to select new one’s, well you can select new one’s but wizard will not allow you to go to the next step (Next button was greyed out)

Usually, SSIS does project deployment from .ispac ( Integration Services Project Deployment File) file. When you choose package deployment instead of using .ispac file, it selects the packages which you wanted to deploy. You can see in the following screenshot only selected packages were loaded into deployment wizard. Also instead of choosing .ispac location deployment wizard is using Packages folder path. If you change deployment model to project deployment, it will ask for .ispac file path

image

Package Deployment

2016-06-06_01-16-07

Project Deployment

When we look at deployed packages in SSISDB, surprisingly we can see all packages are deployed irrespective of incremental deployment.

SELECT [package_id]
      ,[project_version_lsn]
      ,[name]
      ,[package_guid]
      ,[description]
      ,[version_build]
      ,[version_comments]
  FROM [SSISDB].[internal].[packages]

In the following screenshot, we can see, yellow highlighted is the deployment where ChildPackage2 and Master Package selected and in the green highlighted one only master package has been chosen to deploy.When we look at version_build column value, we get the assurance that it is indeed incremental package deployment. All packages have a row in [SSISDB].[internal].[packages] because of [project_version_lsn] which is the column used to related to project table. So all the relationships between different tables in SSISDB will work smoothly.

image

[SSISDB].[internal].[packages]

Is new package deployment model addressing existing issues?

  1. We still can not reference packages from different project or packages from SSISDB using Execute Package Task. Before SSIS 2012 we were able to reference to SSIS Package Store and execute packages from different solution or deployed solutions. There are various workarounds to achieve this; I prefer using Execute SQL task to call Stored Procedure to execute a deployed package but easy reference in EPT is something still missing for me.
    image

    (EPT) Execute Package Task

    2. Parameter Bindings in Execute Package Task:

    Execute Package Task still does not support variables; If you want to send parent package variable value to a child package variable, you first need to create a parameter in the child package and then use that parameter value as an expression in the variable where you wanted to use the value from the parent package.
    No biggie but most of the tasks still does not support parameters. e.g. We can not use SQL Command from Parameter in OLEDB Source, or ExcuteSQLTask result set can not be a parameter or For each Loop variable mapping does not support parameters.

image

Wrap-Up:

As I mentioned earlier, It is nice to see incremental package deployment. It’s great that Microsft listened to user’s voice and implemented this feature.  I don’t know what’s going on behind the screens, but when Packages deployed individually to a server, SSISDB nicely references the previously deployed packages. Probably just checking the version build of all packages and where it is same it just copy the existing packages metadata. But if they managed package deployment, I don’t think it will be difficult for Microsoft to develop SSISDB reference under Execute Package Task, maybe in the next release …

Next blog post in this series will be my favourite feature, Control Flow Templates.

Till next time,

Prathy 🙂

0 comment
0 FacebookTwitterPinterestEmail

While I am in the flow of blogging, I want to write about an issue which troubled me recently. It definitely deserves a blog post.So that I will know where to look when I face the similar problem again. Recently I started working with Teradata, obviously being a Microsoft person, I have not enjoyed it much but luckily my job was majorly involved in writing select query than any other complicated stuff.

However, I came across a scenario where I need to bring few thousands of rows to SQL Server, which involves complex operations with millions of row’s of data on Teradata and then matching/joining/lookups with different data sources using SSIS. It was a case where Teradata Volatile tables come handy as I want to do all complex joins and operations on Teradata box then extract the only small set of data which requires further transformations and lookups. Well, I thought I can easily do this with ADO.NET connection and variables. Unfortunately, it was not that easy with my limited C# skills which ultimately led to this blog post.

So in this post, I am going to cover two topics: How to use Teradata Volatile tables and using SSIS objects variable as a source with managed ADO.NET connections.

Teradata and SSIS:

Attunity Teradata components are popular to extract data from Teradata databases. I read, this component performs better than using ADO.NET source component. I have not made any comparisons, however as per Microsoft Document, “Attunity Teradata Components uses the Teradata Parallel Transporter API and internal buffering with Teradata PT operators directly rather than requiring access through a managed interfaces”. Paper also mentioned in the conclusion saying that “The Microsoft Connector for Teradata by Attunity provides a high-performance means of loading and unloading data to and from Teradata databases”. With my scenario, I want to work with Volatile tables, so I cannot use Attunity Teradata components.Hence, I gone for ADO.Net Connection Type. The very first thing I need to do while I am working with Teradata is setting Query Banding.

Setting The Query Band:

I used Execute SQL Task to set the Query Band with ADO.NET connection type and System DSN. I want to use this query band for all the queries under this session that’s why I updated the RetainSameConnection property to TRUE.

22-03-16 16-09-24

Creating Volatile table:

In the next step, I created a volatile table using another Execute SQL Task. If you want, you can put all the statements under one Execute SQL Task, but I prefer to put them individually for readability. Hola we are done with creating our first Volatile table using SSIS, and we can use this table in any of our further operations…I want to select data from the volatile table I just created and use the selected data as a source in my data flow task for further data transformations or anything related to Data Flow Task.

22-03-16 16-02-54

I selected data using Execute SQL Task and returned the full result set to use as a source in the Data Flow Task. Pretty much same as my previous post which was based on Tim Mitchell blog post. Besides, here comes the problem. When I used the same script as I used in my last blog post, I was getting following error.

2016-03-20_18-09-50

ADO.Net Connection:

I thought I must have typed something wrong, so went back to my script checked few times, changed few things with my minimal C# knowledge but no success. I tried same example with SQL Server and  OLEDEB Connection, which worked fine. I thought something wrong with Teradata (Probably SSIS Execute SQL task was not supporting Teradata). Apparently when I looked at the error in detail, it clearly says Object is not an ADODB recordset.  With a bit of googling, I found that error is common in the .Net world also found an interesting blog post by Valentino Vranken explaining SSIS object package variable and how it depends on the connection type. First time I realised executes SQL Task result set depends on the connection type. As Valentino blog post explains in debug mode, I can clearly see the value type property of two system object variable is different.

                                image2016-03-21_23-06-37

ADO.net provides System.Data.Dataset and OLEDB provide System._ComObject. Basically, with OLEDB, I used Fill Method to extract data from the variable but with ADO.Net, I was getting error. Surprisingly I can use Fill method in the Control Flow with Script Task but not in the Data Flow Task with Script Component. When I use Fill in the Script Component as Source, I was getting following error

‘DataTable’ does not contain a definition for ‘Fill’ and no extension method ‘Fill’ accepting a first argument of type ‘DataTable’ could be found (are you missing a using directive or an assembly reference?)

But it works fine in the Control flow task with Script Task. Frankly, I do not know why but I overcame this situation by creating a Dataset. Here is the C# code I used with two connection types.

OLEDB Connection:

using System;

using System.Data;

using System.Data.OleDb;

using System.Data.SqlClient;

public override void CreateNewOutputRows()
    {
        OleDbDataAdapter oleDA = new OleDbDataAdapter();
        DataTable dt = new DataTable();
        oleDA.Fill(dt, Variables.OLEDB);
        foreach (DataRow dr in dt.Rows)

       {

           OLEDBBuffer.AddRow();
            OLEDBBuffer.Firstname = (dr["Firstname"].ToString());
            OLEDBBuffer.Lastname = (dr["Lastname"].ToString());
        }

   }

ADO.NET Connection:

using System;

using System.Data;

using System.Data.OleDb;

using System.Data.SqlClient;

  public override void CreateNewOutputRows()
     {
         OleDbDataAdapter oleDA = new OleDbDataAdapter();
         DataSet ds = (DataSet)this.Variables.ADO;
         DataTable dt = ds.Tables[0];
         foreach (DataRow dr in dt.Rows)

        {

            ADOBuffer.AddRow();
             ADOBuffer.Firstname = (dr["Firstname"].ToString());
             ADOBuffer.Lastname = (dr["Lastname"].ToString());
         }

    }
}

Finally it worked. This may not be the proper way of doing, but it worked with my scenario. ADO.Net connections usually don’t perform better compared to other connection types like OLEDB but with my scenario there was no impact on the performance. If someone know why Fill method works with Script Task but not with Script Component, please let me know 🙂

You can download SSIS Package from here: http://tinyurl.com/SSISTeradata  (For security reason I did not included Teradata part . But it has two connection types with SQL Server)

Enjoy,
Prathy

7 comments
2 FacebookTwitterPinterestEmail

After years of hesitation finally, I think I am ready to write my first blog post, or I should say I am inclined to publish a public blog post. I chose SSIS & R as the subject because I like SSIS and also since Microsoft released Power Query, I was always curious to extract Twitter data, but it was never too easy for me. Now SQL Server R Services made it quite easy.

One of my favorite feature among all new SQL Server 2016 features is integration with R. I don’t want to mislead saying I understand it very well, but I can proudly say I am trying to learn and trust me it is a good language to learn 🙂 . So note, you may find poorly written R script in this post. Also, Microsoft has decided to integrate R wherever feasible. Hence, there are different methods of using R, but I chose SSIS so I can demonstrate how we can use R in SSIS.

To achieve our purpose, first thing first, we need to enable R Services in our SQL Server 2016 instance. An excellent blog post by Henk Vadrevalk explains all about, how to install and configure SQL Server R Services http://henkvandervalk.com/introducing-microsoft-sql-server-2016-r-services . After SQL Server R Services enabled and ready run, you can execute the following script to find out all pre-installed packages and find R Services works or not.

EXECUTE  sp_execute_external_script
@language=N'R'
,@script = N'str(OutputDataSet);
packagematrix <- installed.packages();
NameOnly <- packagematrix[,1];
OutputDataSet <- as.data.frame(NameOnly);'
,@input_data_1 = N'SELECT 1 as col'
WITH RESULT SETS ((PackageName nvarchar(250) ))]

imageOnce we convinced that R services were working, then we are ready to proceed! The only challenge I faced was to figure out how to install new packages.I used RStudio to install packages and moved them across to SQL 2016 folder. Obviously, that was not the exact process, the blog post by Microsoft explains how to install additional packages: http://henkvandervalk.com/introducing-microsoft-sql-server-2016-r-services. As we saw in the above code, we are doing nothing but executing a stored procedure which has R Script in it. So, I’ll use Execute SQL Task in SSIS to run SQL Server R script to extract data by using one of the available open source R libraries, simple 🙂 . I will be using TwitterR Package/library, there are many other free libraries available, but I think it’s a good way to start; just because it is simple. To extract data using Twitter API first we need to create a Twitter App.

Twitter API:

As per Twitter documentation, there are four different objects that you’ll encounter in the API: Tweets, Users, Entities and Places, each object has several properties.For the sake of simplicity, I am looking at Tweets only. I just want to search Twitter based on a hashtag and extract a certain number of tweets just to show how it works.I created a twitter app called SQLServerRServices. Once the app created we can get: api_key, api_secret, access_token and access_token_secret which are required to access Twitter API. Next step is writing R Script.

R Script:

I have to admit writing R in the SQL Server Management Studio is not fun.I felt like it doesn’t have debugging capabilities which eventually made my troubleshooting a tough task. To overcome this, I used R-Studio to write R script until I was sure that my script was working. It was a lot easier compared to SSMS! Here is the script I used! (I replaced my API key’s with xxxx)

R Script using TwitterR:

/*Stored procedure to execute R Script in SQL Server 2016*/
execute sp_execute_external_script
@language = N'R'

/*R Script*/
, @script = N'

#Load the necessary packages
library(base64enc);
library(twitteR);
library(ROAuth);
library(bitops);
library(RCurl);
library(httr);

#OAUTH
oauth_endpoints("twitter");

#API keys from Twitter APP
api_key <- "xxxxxxxxxxxxxxx";
api_secret <- "xxxxxxxxxxxxxxxxxxx";
access_token <- "xxxxxxxxxxxxxxxxxxxxxxxx";
access_token_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxx";
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret);

#searchTwitter - is TwitterR function to search Twitter
PowerBITweets = searchTwitter("PowerBI", n=10);

#Create a data frame with extracted data
df <- do.call("rbind", lapply(PowerBITweets , as.data.frame))

#Choose the selected columns
frame<-as.data.frame(df[,c(11,5,3,12,15,16)])
', @input_data_1 = N''

#Send created data frame as output result set
, @output_data_1_name = N'frame'
with result sets (([Handle] NVARCHAR(max),[DateTime] DATETIME,[FavCount] int,[RetweetCount] int,[Longitude] NVARCHAR(100),[Latitude] NVARCHAR(100) ));

Script in Detail:

We can execute R in SQL Server 2016 by using stored procedure called sp_execute_external_script. @Language represents the language we want to use in the script (It looks like we could probably able to run Python or any other languages shortly 🙂 ). Next step is to load all required packages.

After loading all the packages then it’s OAUTH handshake. With supplied API access tokens and key’s, we can authenticate Twitter using following bit of code.

oauth_endpoints("twitter");

#API keys from Twitter APP
api_key <- "xxxxxxxxxxxxxx";
api_secret <- "xxxxxxxxxxxxxx";
access_token <- "xxxxxxxxxxxxxx-xxxxxxxxxxxxxx";
access_token_secret <- "xxxxxxxxxxxxxx";
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret);

Next step is to search Twitter by using SearchTwitter function. This feature works based on supplied string.There are several arguments we can supply; here I am providing two arguments: #PowerBI -Search String  and n – maximum number of tweets

SQLBitsTweets = searchTwitter(“#PowerBI”, n=10);

SearchTwitter function returns the list of status objects. In the next step, I am using three functions to convert returned list into a data frame.

#Create a data frame with extracted data df <- do.call(“rbind”, lapply(SQLBitsTweets , as.data.frame))

This data frame contains different data column, but I am interested in only a few columns, so I chose the columns I am interested in the next step.

#Choose the selected columns frame<-as.data.frame(df[,c(11,5,3,12,15,16)])

I am not supplying any input data to the query. Hence; my @Input_data_1 value is blank

‘, @input_data_1 = N”

For output, I want to return the selected data, so I gave the data frame name and in the result sets I given the column names and datatypes.

, @output_data_1_name = N’frame’ with result sets (([Handle] NVARCHAR(max),[DateTime] DATETIME,[FavCount] int,[RetweetCount] int,[Longitude] NVARCHAR(100),[Latitude] NVARCHAR(100) ));

It’s impossible to know all the column names and data types beforehand; that’s when RStudio comes handy. We can do this with SSMS too, but RStudio is easy

Identifying data types:

One of the beauties of RStudio is we can see the Environment values and data. In the following image on the right side, you can Data and Values. Under data df is the data frame created when executed the code: df <- do.call(“rbind”, lapply(PowerBITweets , as.data.frame))  image

When we click on the little table icon next to df, it opens the dataframe in a new window, where we can see the data and column names.

image

To identify the data types we can use class or type of to get the list of column names and respective data types. This worked for me for this example, but if you want to know more about it, there is a debate in the stack overflow. Beware it’s a quite old post but I found interesting: http://tinyurl.com/jj9c3e9

image

Finally, we are sure that RScript is working with Twitter API. Next step is to execute and schedule this extract using SSIS.

Creating SSIS Package:

I created an SSIS control flow template to retrieve Twitter data. This template has one Execute SQL Task with 4 task scoped variables:

TwitterHashTag – Search String

NoOfTweets – Number of tweets I want to obtain based on my search

Script – I want to make R script dynamic using  SSIS variables. Hence, I am using another variable called the script with Expression. I can use SQL statement as an expression in the SSIS Execute SQL Task but using variables makes it easy to document, and I can use BIDS Helper to create and modify variables.

imageimage

I am going to use this template in my SSIS package. Creating SSIS Control Flow Template is another blog post. There some excellent blog posts about control flow templates by Koen Verbeeck. Editing control flow template is easy, here I wanted to use the extracted data/ result set in the Data Flow Task as a source for that reason I changed the scope of the TwitterExtract Control flow template variable to Package level.

image

image

Next step is to prepare Data Flow Task. In the data flow task, I used Script Component as a Source.

image

There is no point of me repeating, how to use Sys Object variable as the source for the data flow task when Tim Mitchell explained it all.I created all columns in the result set as output columns in my source component. Again I used same data types; I used for R Script output parameter.

image

 

 

 

 

 

 

 

 

 

And the C# script I used in the Script Component Source

public override void CreateNewOutputRows()
{
OleDbDataAdapter oleDA = new OleDbDataAdapter();
DataTable dt = new DataTable();
oleDA.Fill(dt, Variables.RResultSet);

foreach (DataRow dr in dt.Rows)

{

ScriptOutputBuffer.AddRow();

ScriptOutputBuffer.Handle = (dr["Handle"].ToString());

ScriptOutputBuffer.DateTime = DateTime.Parse(dr["DateTime"].ToString());

ScriptOutputBuffer.FavCount = int.Parse(dr["FavCount"].ToString());

ScriptOutputBuffer.RetweetCount = int.Parse(dr["RetweetCount"].ToString());

ScriptOutputBuffer.Latitude = (dr["Latitude"].ToString());

ScriptOutputBuffer.Longitude = (dr["Longitude"].ToString());

}

}

Script component generates the output buffer which can be used to connect to OLEDB destination with SQL Server connection.That’s all; now we can schedule and see this data. We don’t have to use SSIS to create one simple SQL Server table, but I am using this as an example.

Here we go now we can see who is tweeting about PowerBI and who’s tweets got more retweeted and favorited. I have not done testing by going to each profile and see how many tweets were POWERBI tweets. My only testing is, I can see familiar Twitter handles which I regularly follow ;). For this example, I extracted only 1000 tweets, but we can extract tweets based on tweet time and schedule to get all tweets between the given time. Using some other TwitterR function, we can also get user profile and their profile images which can lead to an excellent interactive PowerBI report. I tried to publish to the web feature with my Wordpress blog, for some reason, Wordpress decided not to insert an iframe; until I figure out the workaround here is the URL where you can see my quick and dirty report – https://app.powerbi.com/view?r=eyJrIjoiYzg5YTUwN2EtYmMzZi00YzRlLTlkZTgtNzQzZjJiOTNiMjUyIiwidCI6ImExMjczNjNlLTcxNzAtNDg0Yi1iZDY4LTAyYjgwZjIzZDI4MiIsImMiOjh9

PowerBI File

Have fun,

Prathy

 

2 comments
5 FacebookTwitterPinterestEmail