Using Regular Expressions in SSIS

Contents hide

1 Introduction

2 Writing Regular Expressions in SSIS

3 More Regular Expression Examples

4 Regex Examples (Using SSIS Regular Expression Parser Task)

5 Using Groups / Occurrence Index

6 Tools

7 Resources

Introduction

In this short article, you will learn how to write Regular expressions in SSIS (i.e. Regex) and what tool to use to test them. You will also find helpful resources on how to write more sophisticated expressions and learn more about them. For demo purposes, we will use FREE SSIS Regex Parser Task to parse and extract the text using Regex.

You can use Regular expressions in several SSIS PowerPack connectors:

Writing Regular Expressions in SSIS

Based on the SSIS component you use it will include the filename in filtering options or match the text: See below Syntax to write Regex in ZappySys tools. We support additional construct {{X, Y}} at the end of Regex to control two parameters. This additional construct is useful for data extracted from matching regex. If you want to Test patterns and not worry about data extraction then no need to use last {{ Occurance_Index, Group_Index_Or_Name}}

Syntax:
<your Regular Expression>[{{Occurance_Index|*[,Group_Index_Or_Name]}}]

Where:
Occurance_Index=Occurrence index you want to extract (X=0 means the first match) and * means all matches. Use a minus sign to get an occurrence from reverse (e.g. {{-0}} returns the last match)
Group_Index_Or_Name=Group index/name within your search pattern (Groups are indicated by parentheses in regular expression, Y=0 means first group). If you named your group in the pattern then you can use the Group name. To use Group Name you must use the new version (the old version doesn’t support this)

How to name a group?

E.g. (\w+)@(?<domain>\w+.com)

How to use the group name in the match extract?

E.g. (\w+)@(?<domain>\w+.com){{0,domain}}

Example Input:

Let’s assume we have the following input text. We will test various Expressions.

Customer => AAA
Email => aaa@google.com
Phone => 101-222-3333
========
Customer => BBB
Email => bbb@yahoo.com
Phone => 102-222-3333
========
Customer => CCC
Email => ccc@hotmail.com
Phone => 103-222-3333
========
Customer => DDD
Email => ddd@outlook.com
Phone => 104-222-3333

Customer => AAA

Email => aaa@google.com

Phone => 101-222-3333

========

Customer => BBB

Email => bbb@yahoo.com

Phone => 102-222-3333

========

Customer => CCC

Email => ccc@hotmail.com

Phone => 103-222-3333

========

Customer => DDD

Email => ddd@outlook.com

Phone => 104-222-3333

Sample Regex Expressions

Expression	Description
(?s).*	Match anything including new lines. To match anything without new line just use (.*)
\w+([-+.’]\w+)@(?<domain>\w+([-.]\w+)\.\w+([-.]\w+)*)	Get first email id from text ({{0}} is omitted from end because {{0}} id default)
\w+([-+.’]\w+)@(?<domain>\w+([-.]\w+)\.\w+([-.]\w+)*){{-0}}	Get last email id from text
\w+([-+.’]\w+)@(?<domain>\w+([-.]\w+)\.\w+([-.]\w+)){{}}	Get all email addresses (separate them with new line). When you suffix Regular expression with {{*}} it will return all matches.
\w+([-+.’]\w+)@(?<domain>\w+([-.]\w+)\.\w+([-.]\w+)*){{2}}	Get third email id from text (i.e. ends with {{X}} where X is occurrence index starting from 0)
\w+([-+.’]\w+)@(?<domain>\w+([-.]\w+)\.\w+([-.]\w+)*){{0,2}}	Get first email pattern match (i.e. Index=0) and extract domain (i.e. 2nd group). Index starting from 0 for occurrence and group
(\d)-(\d)-(\d*)	Get first phone number from text (If you don’t include {{X,Y}} at the end then it will be always [0,0])
^((?!demo\|test).)*$	Match whole input text if it does not contain words like demo or test. If word found then No Match
<tag>((.\|\n)*?)</tag>{{0,1}}	Extract anything between <tag>…</tag> (Include new line char i.e. \n)
<tag>(.*)</tag>{{0,1}}	Extract anything between <tag>…</tag> (Exclude new line char i.e. \n)
<!\[CDATA\[((.\|\n)*?)\]\]\>{{0,1}}	Extract content from CData section of XML Data (This can be CSV, JSON or nested XML )
^$	Match blank string

More Regular Expression Examples

Input Text	Regex	Matched text	Comment
<row id=”123″ process=”Y”>	id=”([^”]*)”{{0,1}}	123	This expression shows how to extract group value (i.e. {{0,1}} – first match and 2nd group ). It extracts text between double quotes using [^”]* pattern , match anything until double quote is found. {{0,1}} syntax is ZappySys specific so it may not work with other Regex engines.
<data>123</data>	<data>([^<]*)<\/data>{{0,1}}	123	This expression shows how to extract group value (i.e. {{0,1}} – first match and 2nd group ). It extracts text between double quotes using [^<]* pattern , match anything until < is found. {{0,1}} syntax is ZappySys specific so it may not work with other Regex engines.
File_20180930_source.txt	File	File	Will match text/filename that has “File” keyword in it.
File_20180930_SOURCE.dat File_20180930_source.dat	source\|SOURCE	SOURCE and source	Will match text/filenames that contain either “source” or “SOURCE” keyword.
File_20180930_source.txt	File.+source	File_20180930_source	Will match text/filename that contains keyword that starts with “File” and ends with “source”. Basically, you can use this pattern if you want to match two keywords in the text that appear in particular order.
File_20180930_source.txt File_20180830_source.dat	\.txt$\|\.dat$	.txt and .dat	Will match text/all filenames that end with “.txt” or “.dat”.
File_20180930_source.txt file_20190102_source.txt	^(F\|f)ile_\d{8}	File_20180930 file_20190102	Will match text/filename that starts with “File_” or “file_” and then followed by 8 digits.
File_20180930_source.txt File_20190101_none.txt	(.+)_(.+)_(.+){{0,2}}	20180930	Will match text that has three groups of text strings, separated by “_”. Non-Regex {{0,2}} notation will bring back second group (index “2”) of first match (index “0”).
File_20180930_source.txt File_20190101_none.txt	(.+)_(.+)_(.+){{1,2}}	20190101	Will match text that has three groups of text strings, separated by “_”. Non-Regex {{1,2}} notation will bring back second group (index “2”) of second match (index “1”).
File_20180930_source.txt File_20190101_none.txt	(.+)_(.+)_(.+){{*,2}}	20180930 20190101	Will match text that has three groups of text strings, separated by “_”. Non-Regex {{,2}} notation will bring back second group (index “2”) of all matches (index “”). Returned matches are separated by \r\n
<html> <img src=”/img-1.png” /> <img src=”/img-2.png” /> <img src=”/img-3.png” /> </html>	<img[^>]+src=”([^”>]+)”{{*,1}}	/img-1.png /img-2.png /img-3.png	Will return image URLs from HTML content. We used {{*,1}} means it will pull all occurrences and for each match it will extract first group (which is just src attribute value).
null	^((?!null\b).)*$	<blank>	Returns blank if null word found (match all except null)
black white	^((?!red\|blue\|orange).)*$	black white	Returns full string as is if any of those 3 words (i.e. red, blue, orange) not found anywhere in the string
black white red	^((?!red\|blue\|orange).)*$	<blank>	Returns blank if any of those 3 words (i.e. red, blue, orange) found anywhere in the string

Regex Examples (Using SSIS Regular Expression Parser Task)

Here is an example how Regex (.+)_(.+)_(.+){{1,2}} works in Regular Expression Parser Task (FREE):

Using Regular Expressions in SSIS Regex Parser Task (Extract Groups)

Using Groups / Occurrence Index

Some tasks like SSIS Regex Parser Task (FREE) supports extracting value from specific occurrence and specific part of matched pattern using special syntax at the end of your pattern (see below).

Your Regex Pattern Here{{OccuranceIndex,GroupIndex}}

1	Your Regex Pattern Here{{OccuranceIndex,GroupIndex}}

Where :
OccuranceIndex is 0 based (0=extract first occurrence)
GroupIndex is 0 based (0=extract first matching group from pattern. First group is always entire text. )

See above screenshot in previous section for example.

Tools

The best tool we’ve found to write and test Regex is Regex Hero (will require IE with Silverlight if you want to use it online, in the browser):

Another great site for Regex testing is https://regex101.com (Works in any browser unlike previous one)

and few more sites as below

http://www.regexr.com/
http://www.regexlib.com/
http://www.regular-expressions.info/

Resources

Regular Expressions cheat-sheet to hang on the wall

Regular Expressions quick reference