Introduction
In this short article, you will learn how to write Regular expressions in SSIS (i.e. Regex) and what tool to use to test them. You will also find helpful resources on how to write more sophisticated expressions and learn more about them. For demo purposes, we will use FREE SSIS Regex Parser Task to parse and extract the text using Regex.
You can use Regular expressions in several SSIS PowerPack connectors:
- SSIS Regex Parser Task (FREE),
- Azure Blob Storage Task,
- Advanced File System Task,
- Amazon S3 Storage Task and others.
Writing Regular Expressions in SSIS
Based on the SSIS component you use it will include the filename in filtering options or match the text: See below Syntax to write Regex in ZappySys tools. We support additional construct {{X, Y}} at the end of Regex to control two parameters. This additional construct is useful for data extracted from matching regex. If you want to Test patterns and not worry about data extraction then no need to use last {{ Occurance_Index, Group_Index_Or_Name}}
Syntax:
<your Regular Expression>[{{Occurance_Index|*[,Group_Index_Or_Name]}}]
Where:
Occurance_Index=Occurrence index you want to extract (X=0 means the first match) and * means all matches. Use a minus sign to get an occurrence from reverse (e.g. {{-0}} returns the last match)
Group_Index_Or_Name=Group index/name within your search pattern (Groups are indicated by parentheses in regular expression, Y=0 means first group). If you named your group in the pattern then you can use the Group name. To use Group Name you must use the new version (the old version doesn’t support this)
How to name a group?
E.g. (\w+)@(?<domain>\w+.com)
How to use the group name in the match extract?
E.g. (\w+)@(?<domain>\w+.com){{0,domain}}
Example Input:
Let’s assume we have the following input text. We will test various Expressions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Customer => AAA Email => aaa@google.com Phone => 101-222-3333 ======== Customer => BBB Email => bbb@yahoo.com Phone => 102-222-3333 ======== Customer => CCC Email => ccc@hotmail.com Phone => 103-222-3333 ======== Customer => DDD Email => ddd@outlook.com Phone => 104-222-3333 |
Sample Regex Expressions
Expression | Description |
(?s).* | Match anything including new lines. To match anything without new line just use (.*) |
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*) | Get first email id from text ({{0}} is omitted from end because {{0}} id default) |
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{-0}} | Get last email id from text |
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{*}} | Get all email addresses (separate them with new line). When you suffix Regular expression with {{*}} it will return all matches. |
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{2}} | Get third email id from text (i.e. ends with {{X}} where X is occurrence index starting from 0) |
\w+([-+.’]\w+)*@(?<domain>\w+([-.]\w+)*\.\w+([-.]\w+)*){{0,2}} | Get first email pattern match (i.e. Index=0) and extract domain (i.e. 2nd group). Index starting from 0 for occurrence and group |
(\d*)-(\d*)-(\d*) | Get first phone number from text (If you don’t include {{X,Y}} at the end then it will be always [0,0]) |
^((?!demo|test).)*$ | Match whole input text if it does not contain words like demo or test. If word found then No Match |
<tag>((.|\n)*?)</tag>{{0,1}} | Extract anything between <tag>…</tag> (Include new line char i.e. \n) |
<tag>(.*)</tag>{{0,1}} | Extract anything between <tag>…</tag> (Exclude new line char i.e. \n) |
<!\[CDATA\[((.|\n)*?)\]\]\>{{0,1}} | Extract content from CData section of XML Data (This can be CSV, JSON or nested XML ) |
^$ | Match blank string |
More Regular Expression Examples
Input Text | Regex | Matched text | Comment |
<row id=”123″ process=”Y”> | id=”([^”]*)”{{0,1}} | 123 | This expression shows how to extract group value (i.e. {{0,1}} – first match and 2nd group ). It extracts text between double quotes using [^”]* pattern , match anything until double quote is found. {{0,1}} syntax is ZappySys specific so it may not work with other Regex engines. |
<data>123</data> | <data>([^<]*)<\/data>{{0,1}} | 123 | This expression shows how to extract group value (i.e. {{0,1}} – first match and 2nd group ). It extracts text between double quotes using [^<]* pattern , match anything until < is found. {{0,1}} syntax is ZappySys specific so it may not work with other Regex engines. |
File_20180930_source.txt | File | File | Will match text/filename that has “File” keyword in it. |
File_20180930_SOURCE.dat File_20180930_source.dat |
source|SOURCE | SOURCE and source | Will match text/filenames that contain either “source” or “SOURCE” keyword. |
File_20180930_source.txt | File.+source | File_20180930_source | Will match text/filename that contains keyword that starts with “File” and ends with “source”. Basically, you can use this pattern if you want to match two keywords in the text that appear in particular order. |
File_20180930_source.txt File_20180830_source.dat |
\.txt$|\.dat$ | .txt and .dat | Will match text/all filenames that end with “.txt” or “.dat”. |
File_20180930_source.txt file_20190102_source.txt |
^(F|f)ile_\d{8} | File_20180930 file_20190102 |
Will match text/filename that starts with “File_” or “file_” and then followed by 8 digits. |
File_20180930_source.txt File_20190101_none.txt |
(.+)_(.+)_(.+){{0,2}} | 20180930 | Will match text that has three groups of text strings, separated by “_”. Non-Regex {{0,2}} notation will bring back second group (index “2”) of first match (index “0”). |
File_20180930_source.txt File_20190101_none.txt |
(.+)_(.+)_(.+){{1,2}} | 20190101 | Will match text that has three groups of text strings, separated by “_”. Non-Regex {{1,2}} notation will bring back second group (index “2”) of second match (index “1”). |
File_20180930_source.txt File_20190101_none.txt |
(.+)_(.+)_(.+){{*,2}} | 20180930 20190101 |
Will match text that has three groups of text strings, separated by “_”. Non-Regex {{*,2}} notation will bring back second group (index “2”) of all matches (index “*”). Returned matches are separated by \r\n |
<html> <img src=”/img-1.png” /> <img src=”/img-2.png” /> <img src=”/img-3.png” /> </html> |
<img[^>]+src=”([^”>]+)”{{*,1}} | /img-1.png /img-2.png /img-3.png |
Will return image URLs from HTML content. We used {{*,1}} means it will pull all occurrences and for each match it will extract first group (which is just src attribute value). |
null | ^((?!null\b).)*$ | <blank> | Returns blank if null word found (match all except null) |
black white | ^((?!red|blue|orange).)*$ | black white | Returns full string as is if any of those 3 words (i.e. red, blue, orange) not found anywhere in the string |
black white red | ^((?!red|blue|orange).)*$ | <blank> | Returns blank if any of those 3 words (i.e. red, blue, orange) found anywhere in the string |
Regex Examples (Using SSIS Regular Expression Parser Task)
Here is an example how Regex (.+)_(.+)_(.+){{1,2}} works in Regular Expression Parser Task (FREE):
Using Groups / Occurrence Index
Some tasks like SSIS Regex Parser Task (FREE) supports extracting value from specific occurrence and specific part of matched pattern using special syntax at the end of your pattern (see below).
1 |
Your Regex Pattern Here{{OccuranceIndex,GroupIndex}} |
Where :
OccuranceIndex is 0 based (0=extract first occurrence)
GroupIndex is 0 based (0=extract first matching group from pattern. First group is always entire text. )
See above screenshot in previous section for example.
Tools
The best tool we’ve found to write and test Regex is Regex Hero (will require IE with Silverlight if you want to use it online, in the browser):
Another great site for Regex testing is https://regex101.com (Works in any browser unlike previous one)
and few more sites as below
http://www.regexr.com/
http://www.regexlib.com/
http://www.regular-expressions.info/
Resources
Regular Expressions cheat-sheet to hang on the wall
Regular Expressions quick reference