I cannot believe it took me almost three months to recover from the house move. At last, I feel like I am back to routine. Now I have my home office; I am quite proud of it, it involved much DIY work. Wish I am one of those YouTube stars, who show off their interior decorating skills 😉 I know none of this excites you, so let’s look at the actual blog post…
Validating Data using SSIS:
Recently I have got a requirement to validate flat files using SSIS. Then I realised, why I never thought of doing this before. It would have saved hours I wasted in emails or checking data to understand the problem. We face problems with dirty data very often, it could be because of invalid characters or lookup errors or something else. Based on the ETL process you identify this data problem at various stages. However, having a data validation task enables you to accept or reject the data at the very initial stage. I have to thank my client for giving me this task. Because it is something I have not done so far and I used script component (Which I do not use very often, so I def need to write this blog post before I forget it.)
The goal of our SSIS package is to find all the validation failures. And the validations we want to check are:
- Only Yellow and Orange colour products are valid
- Product Column is mandatory
- Product Name must be capital
- Amount is a mandatory field
- Only two decimals values are allowed in the Amount field
- No comma’s (,) allowed in the Amount field
- No special characters in the Product column
- No 0’s allowed in the Amount Filed
- No NULL values in the Amount Field
- No duplicated products with the same column
I used Script component as a transformation in the Data Flow task. This is where all the logic resides. In the script component, I added a new output called Validation with two new output columns called FailedColumn and ValidationDescription. I also included all the source columns.
In the script component under Input0_ProcessInputRow, I included my code. Mainly I was using If condition as per my validation, As I wanted to capture all validation errors, so I added validation error rows to Validation Buffer.
Sample Script:
string key1 = String.Concat(Row.Product, Row.Color); if (key1.Equals(previousKeyValue)) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "No duplicated products with the same colour and different amount"; ValidationBuffer.FailedColumn = "Product and Color"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; }
I populated ValidationDescription and Failed column with relevant information. This enables me to understand exactly which column failed and why. And my output looks like this.
Now I can easily see where and what the issues are. Also I can just redirect these rows to a table or send them in an email or just ignore these rows and load rest or abort the ETL process and notify the relevant person to correct the file.
Prathy
You can download the SSIS project here or the complete code of the Script Component for all above mentioned validations:
#region Namespaces using System; using System.Data; using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; /*Added*/ using System.Reflection; using System.Text; using System.Globalization; using System.Collections.Generic; using System.Windows.Forms; #endregion [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { public override void PreExecute() { base.PreExecute(); /* * Add your code here */ } public override void PostExecute() { base.PostExecute(); /* * Add your code here */ } private static readonly char[] SpecialChars = "!@#$%^&*()".ToCharArray(); private string previousKeyValue = string.Empty; public override void Input0_ProcessInputRow(Input0Buffer Row) { string key1 = String.Concat(Row.Product, Row.Color); if (key1.Equals(previousKeyValue)) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "No duplicated products with the same colour and different amount"; ValidationBuffer.FailedColumn = "Product and Color"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } previousKeyValue = key1; /*Two decimals values are allowed for Amount */ string[] arr = Row.Amount.Split('.'); if (arr.Length > 1) { if (arr[1].Length > 2) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "Only two decimals values are allowed in the Amount field "; ValidationBuffer.FailedColumn = "Amount"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } } /*No special characters in the Product column */ if(!String.IsNullOrEmpty(Row.Product)) { int indexOf = Row.Product.IndexOfAny(SpecialChars); if (indexOf != -1) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "No special characters in the Product column "; ValidationBuffer.FailedColumn = "Product"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } } if (string.Equals(Row.Amount, "0")) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "O is not a valid value in the AMount filed "; ValidationBuffer.FailedColumn = "Amount"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } if (string.IsNullOrEmpty(Row.Amount)) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "NUll or Blank is not a valid value in the AMount filed "; ValidationBuffer.FailedColumn = "Amount"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } var AmountArr = Row.Amount.Contains(","); if (AmountArr == true) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = ", is not a valid character in the Amount filed "; ValidationBuffer.FailedColumn = "Amount"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } if (String.IsNullOrEmpty(Row.Product)) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "Product is a mandatory field"; ValidationBuffer.FailedColumn = "Product"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } //Controling area if (!string.Equals(Row.Color, "Yellow") || !string.Equals(Row.Color, "Orange")) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "Only Yellow and Orange colour products are valid"; ValidationBuffer.FailedColumn = "Color"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } if (!String.IsNullOrEmpty(Row.Product)) { int i = String.Compare(Row.Product, Row.Product.ToUpper()); if (String.Compare(Row.Product, Row.Product.ToUpper()) != 0) { ValidationBuffer.AddRow(); ValidationBuffer.ValidationDescription = "Product Name must be capital"; ValidationBuffer.FailedColumn = "Product"; /*Map existing Columns*/ ValidationBuffer.Amount = Row.Amount; ValidationBuffer.Product = Row.Product; ValidationBuffer.Color = Row.Color; } } } public override void CreateNewOutputRows() { } }
6 comments
This is genius! Thanks for posting this, it will help me as I am trying to do data validations on incoming CSV files now.
Also a good approach would be to use DQS Task for validation. You could define rules for each field or connected rules for multiple fields. Afterwords you would have a validation information for each field with a rule. The disadvantage: You need an Enterprise Version of SQL Server to use it.
Great job! This helped us a lot!
Really useful and well-presented! Thank you very much!
I am using your solution to get validation data/lists from a source as to make the process less hard-coded. Any tips I could use?
SSIS operations and functions are useful in almost all aspects especially database connectivity.SSIS operations can actually become simple SSIS Upsert which means Update or Insert operations are used.https://zappysys.com/products/ssis-powerpack/ssis-upsert-destination
Hi Prathy,
Thanks for sharing such nice post on applying validation rules on SSIS package.
I am exploring SSIS package solution to validate and implement custom business rules in my project. Can you provide suggestions on this?
Scenario:
1. Solution will read a flat file with 50-100 rows (records)
2. Before importing these rows into database tables, each field in every column has to be validated against certain business rules. If that value does not satisfy certain rule, that row will be skipped and read next row.
3. Can “For each loop” container and script component be handy in this case?
Thank you,
Binoy