Using Regular Expressions in SSIS

Introduction

In this short article, you will learn how to write Regular expressions in SSIS (i.e. Regex) and what tool to use to test them. You will also find helpful resources on how to write more sophisticated expressions and learn more about them. For demo purposes, we will use FREE SSIS Regex Parser Task to parse and extract the text using Regex.

 

You can use Regular expressions in several SSIS PowerPack connectors:

Writing Regular Expressions in SSIS

Based on the SSIS component you use it will include the filename in filtering options or match the text: See below Syntax to write Regex in ZappySys tools. We support additional construct {{X, Y}} at the end of Regex to control two parameters. This additional construct is useful for data extracted from matching regex. If you want to Test patterns and not worry about data extraction then no need to use last {{ Occurance_Index, Group_Index_Or_Name}}

Syntax:
<your Regular Expression>[{{Occurance_Index|*[,Group_Index_Or_Name]}}]

Where:
Occurance_Index=Occurrence index you want to extract (X=0 means the first match) and * means all matches. Use a minus sign to get an occurrence from reverse (e.g. {{-0}} returns the last match)
Group_Index_Or_Name=Group index/name within your search pattern (Groups are indicated by parentheses in regular expression, Y=0 means first group). If you named your group in the pattern then you can use the Group name. To use Group Name you must use the new version (the old version doesn’t support this)

How to name a group?

E.g. (\w+)@(?<domain>\w+.com)

How to use the group name in the match extract?

E.g. (\w+)@(?<domain>\w+.com){{0,domain}}

Example Input:

Let’s assume we have the following input text. We will test various Expressions.

Sample Regex Expressions

Expression Description
(?s).* Match anything including new lines. To match anything without new line just use (.*)
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*) Get first email id from text ({{0}} is omitted from end because {{0}} id default)
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{-0}} Get last email id from text
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{*}} Get all email addresses (separate them with new line). When you suffix Regular expression with {{*}} it will return all matches.
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{2}} Get third email id from text (i.e. ends with {{X}} where X is occurrence index starting from 0)
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{0,2}} Get first email pattern match (i.e. Index=0) and extract domain (i.e. 2nd group). Index starting from 0 for occurrence and group
(\d*)-(\d*)-(\d*) Get first phone number from text (If you don’t include {{X,Y}} at the end then it will be always [0,0])
^((?!demo|test).)*$ Match whole input text if it does not contain words like demo or test. If word found then No Match
<tag>((.|\n)*?)</tag>{{0,1}} Extract anything between <tag>…</tag> (Include new line char i.e. \n)
<tag>(.*)</tag>{{0,1}} Extract anything between <tag>…</tag> (Exclude new line char i.e. \n)
<!\[CDATA\[((.|\n)*?)\]\]\>{{0,1}} Extract content from CData section of XML Data (This can be CSV, JSON or nested XML )
^$ Match blank string

More Regular Expression Examples

Input Text Regex Matched text Comment
<row id=”123″ process=”Y”> id=”([^”]*)”{{0,1}} 123 This expression shows how to extract group value (i.e. {{0,1}} – first match and 2nd group ). It extracts text between double quotes using [^”]* pattern , match anything until double quote is found.  {{0,1}} syntax is ZappySys specific so it may not work with other Regex engines.
<data>123</data> <data>([^<]*)<\/data>{{0,1}} 123 This expression shows how to extract group value (i.e. {{0,1}} – first match and 2nd group ). It extracts text between double quotes using [^<]* pattern , match anything until <  is found.  {{0,1}} syntax is ZappySys specific so it may not work with other Regex engines.
File_20180930_source.txt File File Will match text/filename that has “File” keyword in it.
File_20180930_SOURCE.dat
File_20180930_source.dat
source|SOURCE SOURCE and source Will match text/filenames that contain either “source” or “SOURCE” keyword.
File_20180930_source.txt File.+source File_20180930_source Will match text/filename that contains keyword that starts with “File” and ends with “source”.
Basically, you can use this pattern if you want to match two keywords in the text that appear in particular order.
File_20180930_source.txt
File_20180830_source.dat
\.txt$|\.dat$ .txt and .dat Will match text/all filenames that end with “.txt” or “.dat”.
File_20180930_source.txt
file_20190102_source.txt
^(F|f)ile_\d{8} File_20180930
file_20190102
Will match text/filename that starts with “File_” or “file_” and then followed by 8 digits.
File_20180930_source.txt
File_20190101_none.txt
(.+)_(.+)_(.+){{0,2}} 20180930 Will match text that has three groups of text strings, separated by “_”.
Non-Regex {{0,2}} notation will bring back second group (index “2”) of first match (index “0”).
File_20180930_source.txt
File_20190101_none.txt
(.+)_(.+)_(.+){{1,2}} 20190101 Will match text that has three groups of text strings, separated by “_”.
Non-Regex {{1,2}} notation will bring back second group (index “2”) of second match (index “1”).
File_20180930_source.txt
File_20190101_none.txt
(.+)_(.+)_(.+){{*,2}} 20180930
20190101
Will match text that has three groups of text strings, separated by “_”.
Non-Regex {{*,2}} notation will bring back second group (index “2”) of all matches (index “*”). Returned matches are separated by \r\n
<html>
<img src=”/img-1.png” />
<img src=”/img-2.png” />
<img src=”/img-3.png” />
</html>
<img[^>]+src=”([^”>]+)”{{*,1}} /img-1.png
/img-2.png
/img-3.png
Will return image URLs from HTML content. We used {{*,1}} means it will pull all occurrences and for each match it will extract first group (which is just src attribute value).
null ^((?!null\b).)*$ <blank> Returns blank if null word found (match all except null)
black white ^((?!red|blue|orange).)*$ black white Returns full string as is if any of those 3 words (i.e. red, blue, orange) not found anywhere in the string
black white red ^((?!red|blue|orange).)*$ <blank> Returns blank if any of those 3 words (i.e. red, blue, orange) found anywhere in the string

Regex Examples (Using SSIS Regular Expression Parser Task)

Here is an example how Regex (.+)_(.+)_(.+){{1,2}} works in Regular Expression Parser Task (FREE):

Using Regular Expressions in SSIS Regex Parser Task (Extract Groups)

Using Regular Expressions in SSIS Regex Parser Task (Extract Groups)

Using Groups / Occurrence Index

Some tasks like SSIS Regex Parser Task (FREE) supports extracting value from specific occurrence and specific part of matched pattern using special syntax at the end of your pattern (see below).

Where :
OccuranceIndex is 0 based (0=extract first occurrence)
GroupIndex is 0 based (0=extract first matching group from pattern. First group is always entire text. )

See above screenshot in previous section for example.

Tools

The best tool we’ve found to write and test Regex is Regex Hero (will require IE with Silverlight if you want to use it online, in the browser):

Another great site for Regex testing is https://regex101.com (Works in any browser unlike previous one)

and few more sites as below

http://www.regexr.com/
http://www.regexlib.com/
http://www.regular-expressions.info/

Resources

Regular Expressions cheat-sheet to hang on the wall

Regular Expressions quick reference

 

Posted in SSIS Advanced File System Task, SSIS Amazon Storage Task, SSIS Azure Blob Storage Task, SSIS Regex Parser Task, SSIS SFTP Task, SSIS Tasks, SSIS Tips & How-Tos, Tools and tagged , .