SSIS HTML Table Source
PreviousNext

SSIS HTML Table Source can be used for HTML content scraping (e.g. extract data placed inside TABLE tags). Many times valuable information is stored on web pages and not available to download as traditional data file format. With this component you can extract HTML table data in few clicks without coding.

Download SSIS PowerPack

Content

Video Tutorial

Coming soon...

Step-By-Step

In this tutorial we are going to Extract Data from Wikipedia. HTML Table scraping from any URL or HTML files stored locally (Direct URL mode or Read from SSIS variable). We can extract Data from Various Method(using Table Number, By CSS, By XPath).
  1. Firstly, You need to Download and Install SSIS ZappySys PowerPack.
  2. Once you finished first step, Open Visual Studio and Create New SSIS Package Project.
  3. Now, Drag and Drop SSIS Data Flow Task from SSIS Toolbox.
    SSIS Data Flow Task - Drag and Drop
  4. Double click on the Data Flow task to see Data Flow designer surface.
  5. Here, In Visual Studio, drag and drop the ZS HTML Table Source in the design surface.
    SSIS HTML Table Source - Drag and Drop

How to Extract Table data Using Table Number.

  1. Double click on HTML Table Source to configure it.
  2. Set Direct Access Mode, Click on Example1, Select ByNumber Table Extract Mode. Change Table Number You want to extract Data.
    https://en.wikipedia.org/wiki/World_population
    SSIS HTML Table Source - Table Number
  3. Click On Preview button to see Data Preview.
  4. That's all, now just click on OK button to save HTML Table Source Setting UI.

How to Extract Table data Using ByCSS Mode.

  1. We can also read HTML Table using ByCSS Access Mode, Double click on HTML Table Source to configure it.
  2. Set ByCss Table Extract Mode, use your Table CSS Class.
    infobox
    SSIS HTML Table Source - CSS Class
  3. Click On Preview button to see Data Preview.
  4. That's all, now just click on OK button to save HTML Table Source Setting UI.

How to Extract Table data Using XPath Mode.

  1. We can also read HTML Table using XPath Access Mode, Double click on HTML Table Source to configure it.
  2. Set ByXPath Table Extract Mode, Use your table CSS Class in Table XPath. You can replace class name.
    //table[@class='infobox']
    SSIS HTML Table Source - XPath
  3. Click On Preview button to see Data Preview.
  4. That's all, now just click on OK button to save HTML Table Source Setting UI.

How to Extract Links / Images from Table cell for each row and column.

  1. Here, you can extract links and images too, just go to Extract Link and Images Tab for configure it.
  2. Check on Output links and images for each column.
    SSIS HTML Table Source - Extract Links/Images
  3. Click On Preview button to see Data Preview.
  4. You can select column you want to Extract Data and you can rename even change DataType it.
    SSIS HTML Table Source - Select Columns
  5. That's all, now just click on OK button to save HTML Table Source Setting UI.
  6. Now, Just Drag and Drop Our Free ZS Trash Destination from SSIS Toolbox.
    SSIS Trash Destination - Drag and Drop
  7. Now single click on the HTML Table Source, once you see blue arrow from source ... connect it to Trash Destination.
  8. Lets Double click on ZS Trash Destination to Configure it.
    SSIS Trash Destination - Configure
  9. That’s it we are ready to load HTml Table data.
    SSIS HTML Table Source - Execute

Properties

Property Name Description
LoggingMode LoggingMode determines how much information is logged during Package Execution. Set Logging mode to Debugging for maximum log.

Available Options (Use numeric value listed in bracket if you have to define expression on this property (for dynamic behavior).

Option Description
Normal [0] Normal
Medium [1] Medium
Detailed [2] Detailed
Debugging [3] Debugging
PrefixTimestamp When you enable this property it will prefix timestamp before Log messages.
AccessMode Specifies where url(s)/path(s) stored. URLs or Paths can be provided as direct input, from ssis variable or file

Available Options (Use numeric value listed in bracket if you have to define expression on this property (for dynamic behavior).

Option Description
Direct [0] Direct
Variable [1] Variable
Connection [2] Connection
DirectValue URL(s) or HTML File Path(s) from where you want to extract your HTML Tables. For multiple URLs or Paths enter them on new line. Only applicable when AccessMode is DirectValue.
VariableName SSIS variable name which holds URL(s) or html file paths(s) from where you want to extract data. Only applicable when AccessMode is Variable.
TableExtractMethod Table extract method

Available Options (Use numeric value listed in bracket if you have to define expression on this property (for dynamic behavior).

Option Description
ByNumber [0] ByNumber
ByCss [1] ByCss
ByXPath [2] ByXPath
MaxGuessRows How many rows to scan in order to determine datatype for each column
TableNumber Table Number inside web page which you want to extract. Number starts with 1
TableXPath XPath expression to detect table you want to parse (e.g. //table[@class='mydata'])
TableClass CSS Class name used in table. This is helpful if you don't know exact number of table in HTML code but you know the CSS class name used for table.
HasHeaderRow True is table has header row else set to false. If header row is found then column name is extracted from header row.
HeaderRowNumber Header row number. Row number starting from 1
SkipRowsTop Total rows to skip from top of the table before reading data. By default data starts after header row (if specified HasHeaderRow).
SkipRowsBottom Total rows to skip from bottom of the table.
OutputLinks Set this option to true if you want to extract hyper links from each cell. When this option is checked you will see new column with _Links suffix. If you have multiple links in a single table cell then links are separated by vertical bar. To extract only one link per column set MaxLinksPerColumn = 1
OutputImages Set this option to true if you want to extract images from each cell. When this option is checked you will see new column with _Images suffix. If you have multiple images in a single table cell then images are separated by vertical bar. To extract only one image per column set MaxImagesPerColumn = 1
MaxLinksPerColumn Maximum number of hyper links you want to extract for each cell. This option is ignored if OutputLinks is set to false. Set 0 to extract all links from cell.
MaxImagesPerColumn Maximum number of images you want to extract for each cell. This option is ignored if OutputImages is set to false. Set 0 to extract all images from cell.
TrimWhiteSpace Trim whitespaces from front and end of images you want to extract for each cell. This option is ignored if OutputImages is set to false. Set 0 to extract all images from cell.
MaxRows Specifies maximum number of data rows to output (Similar as TOP N in SQL query)
EnableGroupDetect Specifies whether you want to enable group detection. Sometime you have table with Grouping rows (using colspan=N) in that case enable this option so you can get output in __groupName column.
TreatInputAsHtmlString Treat input as raw HTML string rather than URL.
CharacterSet Character set for text (e.g. windows-1250 )

Setting UI

SSIS HTML Table Source - Setting UI
SSIS HTML Table Source - Setting UI
SSIS HTML Table Source - Setting UI
SSIS HTML Table Source - Setting UI

Articles / Tutorials

Click here to see all articles for [SSIS HTML Table Source] category
Read HTML Table in SSIS – Extract / Download Links / Images

Read HTML Table in SSIS – Extract / Download Links / Images

Introduction In this post you will learn how to extract data from web pages using SSIS. In other words, read HTML Table in SSIS, then Loop through extracted links and finally download files. To achieve this scenario we will use SSIS HTML Source and REST API Task.     Step-By-Step To achieve desired extraction from web pages […]


How to read and write data to HTML in SSIS

How to read and write data to HTML in SSIS

Introduction to read and write data to HTML in SSIS In this article, we will show how to send values from an SSIS Variable to an HTML file. We will use the SSIS Advanced File System Task to store the list of system files of a folder into a variable and then we will use […]



Copyrights reserved. ZappySys LLC.