Introduction
In this post you will learn how to extract data from web pages using SSIS. In other words, read HTML Table in SSIS, then Loop through extracted links and finally download files. To achieve this scenario we will use SSIS HTML Source and REST API Task.
Prerequisites
Before we perform the steps listed in this article, you will need to make sure the following prerequisites are met:- SSIS designer installed. Sometimes it is referred to as BIDS or SSDT (download it from the Microsoft site).
- Basic knowledge of SSIS package development using Microsoft SQL Server Integration Services.
- Make sure ZappySys SSIS PowerPack is installed (download it, if you haven't already).
- (Optional step). Read this article, if you are planning to deploy packages to a server and schedule their execution later.
Step-By-Step
To achieve desired extraction from web pages / download files we have to use few different components. Let’s get started.
Read HTML Table from Web Page
First, let’s read desired content from web URL using SSIS HTML Source. This component can read data from URL(s), Local HTML File(s) or Direct HTML String stored in a SSIS variable.
- Create a new SSIS Project
- In the control flow, Right click anywhere in the designer and click Variables option.
- Create following 3 variables which we will use later on.
FileList (Object Type)
ZipFileName (String Type)
ZipFileUrl (String Type) – Default Value must not be blank (Enter anything like http://a.com ). - Drag Data Flow Task from SSIS Toolbox to designer
- Go to data flow designer by clicking on the data flow task.
- Drag and drop ZS HTML Table Source and double click to edit it. Enter the HTML Web Page URL from where you like to read information. For example purpose, we will use below URL. On that page we have a table with files we need to download.
1http://www.patentsview.org/download - Now let’s go to Extract Links and Images tab and check Output Links for each column. This will give us link for Zip File URL which we want to download.
- Click Preview to see data.
- Now click on OK to save the UI.
- In next section we will configure Recordset Destination
Save Records to SSIS Variable
Now let’s look at how to save HTML Table into in memory recordset which we will loop through later on. You can save data directly to relational destination such as SQL Server or Oracle but in our scenario we are extracting links and downloading them one by one.
- In continuation of previous step, Once done with HTML Table Source configuration, Drag Recordset Destination from Data flow toolbox
- Connect HTML Table Source to Recordset Destination
- Select Variable to hold recordset (Must be Object Datatype – in our case we have User::FileList)
- Select Input columns you like to use. In our case we will Pick two columns as below (Name and Link).
- Click OK to save Recordset destination.
- In our next step we will look at how to loop through these records.
Loop through records in SSIS
Once you done with previous step, switch to control flow tab and follow below steps.
- Drag ForEach Loop Task from SSIS Toolbox.
- Connect Data Flow Task to ForEach Loop Task
- Double click ForEach Loop task to configure
- On the Collection Pane, select Enumerator as Foreach ADO Enumerator as below.
- On the Variable Mappings tab you can map columns you care about. For example in the previous section we selected two input columns for recordset destination. We can map same columns here in same the same order.
- Click OK to save our settings.
Download File from URL in SSIS
Now, after Loop configuration we can define what we want to do in each iteration. For example we can simply drag Logging Task and display Variables we extracted in ForEach Loop Variable Mappings tab. For downloading links we extracted in previous step you can use REST API Task as below.
- Drag ZS REST API Task from SSIS Toolbox and drop inside ForEach Loop.
- You can now set Dynamic URL using placeholder {{User::ZipFileUrl}}
- Now one more thing we need to do is define Target path where we will save downloaded file.
- Go to, Response Settings Tab, Check Save Response option and Select Save to File.
- Also, check Treat Response As Binary
- Click OK to save REST API Task.
That’s it. You can now run SSIS Package and see things in action.
Conclusion
In this article, we saw how easy it is to extract unstructured from web in few clicks. Using components such as HTML Table SOurce and REST API Task we can read / parse HTML Pages. Download SSIS PowerPack to explore many other options including 70+ tasks / components we have not discussed in this article.