Introduction
In this post you will learn how to use FREE SSIS Regex Parser Task along with REST API Task to extract HTML content in few clicks.
Scenario
Assume that you want to search certain keywords from Bing or google and want to know how many pages found for that keyword. Url for search would be something like http://www.bing.com/search?q=regex where regex is our search word.
When page is returned view source code of that page and you will find tag like below.
1 |
<span class="sb_count" data-bm="4">21,00,000 results</span> |
What we want is number 21,00,000 using Regular expression pattern search.
Step-By-Step : Extract HTML Tag value using Regex Expression
- Download and Install SSIS PowerPack (It includes FREE SSIS Regex Parser Task )
- Create new SSIS Package
- Drag ZS REST API Task on Control flow designer from SSIS Toolbox
- Double click to configure the task. Enter URL you like to fetch e.g. http://www.bing.com/search?q=regex
- Click on Response Tab and check Save response option. Select Save to Variable. If needed create new variable.
- Click Test (Scroll at the bottom to see html content)
- Now drag ZS Regex Parser Task and connect with REST API Task
- Select Variable which will hold HTML text you like to parse.
- Enter following expression and map target to some Variable if you like to save extracted value. Below expression ends with {{0,1}} which means extract first match and 2nd group of that match (0 based Index). 2nd group of match will hold actual count of search result. If you omit {{x,y}} at the end then {{0,0}} is used.
1\<span\s*\w*\s*class="sb_count"\s*>\s*(?<p2>[0-9,.]*){{0,1}} - In the above step you can select Variable as Input or use placeholder in Direct string (e.g {{Use::varHtml}} )
- You can also connect ZS Logging task to show extracted value
Here is final flow.