How to download images from a web page using SSIS

Introduction

The article shows how to download images from a web page using SSIS. Sometimes we need to download the images from a web page. In this post, we will show how to do this.

Prerequisites

Before we perform steps listed in this article, you will need to make sure following prerequisites are met:

  1. SSIS designer installed. Sometimes it is referred as BIDS or SSDT (download it from Microsoft site).
  2. Basic knowledge of SSIS package development using Microsoft SQL Server Integration Services.
  3. Make sure ZappySys SSIS PowerPack is installed (download it).
  4. Optional (If you want to Deploy and Schedule ) – Deploy and Schedule SSIS Packages

Steps-by-step process to download images from HTML using SSIS

Use REST API task to get the HTML body

1. Drag and drop the REST API Task from SSIS toolbox, select the html page you want and save it in a variable.

Select the page you want to get the images

2. Go to Response Settings. Check the option Save the response content. In Save Mode, select Save to File. In the option Enter File Path, write the path for the html file.

Save the page in a file

Parse the image with Regular Expression parser task

3. From the SSIS toolbox drag and drop Regular Expression Parser Task on the Control flow designer surface.

4. The next step is to save the source path from the images. You need to use Regex and here are two example you can use, in this page Regex101 you can check more details about the expressions we are using:

Expression 1: <img.*?src="(.*?)"{{*}} Expression 2: src="([a-z\-_0-9\/\:\.]*\.(png|jpg|jpeg|gif|png))"{{*}}

download images from a web page - Regex configuration

Regex expression to get the image code from the page

Read the image source with CSV source in order to download images from a Web page

5. Now, Drag and Drop SSIS Data Flow Task from SSIS Toolbox.

6. Double click on the DataFlow task to see DataFlow designer surface.

7. From the SSIS toolbox drag and drop CSV source and insert the variable you are using from the previous task

CSV Source configuration

Use CSV source to read the variable

Get the image name, image full path and the destination folder to download images from a Web Page

8. From the SSIS toolbox drag and drop Derived Column transform to remove the HTML code and get the image name.

Expression for the Image name:

Remove the HTML code:

Image name

Add a new column and remove HTML code

9. Now we will drag and drop another Derived Column, this one is for the file path for the images and validate if the image URL is valid. If the image URL has “http” in the stringm then it is valid otherwise you need to add the rest of the URL. Here is our example for the filePath and validate the URL image. If everything is OK, we can send the image URL.

Expression in derived column

Add a new column and verify the HTML path

Send the request and save the image In the folder TO download images from a Web Page

10. Now we will drag and drop a web API destination and select the column in the input column to URL. If the request is right, we can save the image in the local file.

Web API destination for images URL

Send a request for all images you get

11. Now drag and drop an Export Column transformation and select the ResponseText from the request and the file path.

download images from a web page - Export properties

Select the image response and the local file path

12. Finally, we use a trash destination to close the flow.

download images from a web page - All the tasks together

The final result

Conclusion

If everything is OK, you will be able to download the images from your HTML page. To do that, we read the list of URLs from a list. Then we get the name of the images using expressions. Then we add a path for each image to store them. Finally, we save the images.

Posted in REST API, SSIS CSV Source and tagged , , .