Introduction
The article shows how to download images from a web page using SSIS. Sometimes we need to download the images from a web page. In this post, we will show how to do this.
Prerequisites
Before we perform steps listed in this article, you will need to make sure following prerequisites are met:
- SSIS designer installed. Sometimes it is referred as BIDS or SSDT (download it from Microsoft site).
- Basic knowledge of SSIS package development using Microsoft SQL Server Integration Services.
- Make sure ZappySys SSIS PowerPack is installed (download it).
- Optional (If you want to Deploy and Schedule ) – Deploy and Schedule SSIS Packages
Steps-by-step process to download images from HTML using SSIS
Use REST API task to get the HTML body
1. Drag and drop the REST API Task from SSIS toolbox, select the html page you want and save it in a variable.
2. Go to Response Settings. Check the option Save the response content. In Save Mode, select Save to File. In the option Enter File Path, write the path for the html file.
Parse the image with Regular Expression parser task
3. From the SSIS toolbox drag and drop Regular Expression Parser Task on the Control flow designer surface.
4. The next step is to save the source path from the images. You need to use Regex and here are two example you can use, in this page Regex101 you can check more details about the expressions we are using:
Expression 1: <img.*?src="(.*?)"{{*}} Expression 2: src="([a-z\-_0-9\/\:\.]*\.(png|jpg|jpeg|gif|png))"{{*}}
Read the image source with CSV source in order to download images from a Web page
5. Now, Drag and Drop SSIS Data Flow Task from SSIS Toolbox.
6. Double click on the DataFlow task to see DataFlow designer surface.
7. From the SSIS toolbox drag and drop CSV source and insert the variable you are using from the previous task
Get the image name, image full path and the destination folder to download images from a Web Page
8. From the SSIS toolbox drag and drop Derived Column transform to remove the HTML code and get the image name.
Expression for the Image name:
1 |
(DT_WSTR,1000)REVERSE(LEFT(REVERSE(REPLACE(REPLACE(Column1,"src=\"",""),"\"","")),(FINDSTRING(REVERSE(REPLACE(REPLACE(Column1,"src=\"",""),"\"","")),"/",1) - 1))) |
Remove the HTML code:
1 |
REPLACE(REPLACE(Column1,"src=\"",""),"\"","") |
9. Now we will drag and drop another Derived Column, this one is for the file path for the images and validate if the image URL is valid. If the image URL has “http” in the stringm then it is valid otherwise you need to add the rest of the URL. Here is our example for the filePath and validate the URL image. If everything is OK, we can send the image URL.
1 2 3 4 5 |
File path: (DT_WSTR,2500)@[User::vDestinationFolder] + ImageName Validate image URL: ((FINDSTRING(Column1,"http",1) > 0) ? Column1 : (DT_WSTR,2500)("https://www.urlexample/image/" + Column1)) Note: @[User::vDestinationFolder] is a variable we used for the local path |
Send the request and save the image In the folder TO download images from a Web Page
10. Now we will drag and drop a web API destination and select the column in the input column to URL. If the request is right, we can save the image in the local file.
11. Now drag and drop an Export Column transformation and select the ResponseText from the request and the file path.
12. Finally, we use a trash destination to close the flow.
Conclusion
If everything is OK, you will be able to download the images from your HTML page. To do that, we read the list of URLs from a list. Then we get the name of the images using expressions. Then we add a path for each image to store them. Finally, we save the images.