How to read large XML / JSON file in SSIS (3 Million Rows in 3 Mins)

Introduction

In this post we will learn how to use ZappySys SSIS XML Source or ZappySys SSIS JSON Source  to read large XML or JSON File (Process 3 Million rows in 3 minutes – 1.2 GB file).

If you use default settings to read data then it may result into OutOfMemory Exception so we will outline few techniques which will enable high performance Streaming Mode rather than In-memory load of entire file.

Prerequisites

Before we parse very large XML or JSON files using SSIS , you will need to make sure following prerequisites are met.

  1. SSIS designer installed. Sometimes it is referred as BIDS or SSDT (download it from Microsoft site).
  2. Basic knowledge of SSIS package development using Microsoft SQL Server Integration Services.
  3. You have at least two sample files… One file must be small dataset (less than 10 MB if possible). We will  use small dataset file during design mode to get metadata and see data preview on the Component UI.
  4. Second XML file is the big file with full dataset you like to parse at runtime.
  5. Make sure SSIS PowerPack is installed. Click here to download.

 

Step-By-Step : Reading large XML file (SSIS XML Source)

Now let’s look at how to read large XML file (e.g. 3 Million rows or more) using ZappySys XML Source in SSIS. Below steps are identical for  ZappySys JSON Source  too (Except Step#7).

  1. Open SSIS Designer and drag Data Flow from SSIS Toolbox
  2. Double click Data Flow Task to switch to Data flow designer
  3. Now drag ZS XML Source on the surface from SSIS Toolbox.
  4. Double click ZS XML Source and specify small dataset file  path you like to parse. (e.g.  c:\data\customer_small.xml )
  5. Click on Select Filter button to find Node which will be treated as Array. Once you close the Filter Browse Dialog. Append --FAST  Your Expression may look like below.
    $.Root.Row--FAST
  6. Now uncheck   Include Parent Columns option
  7. Check  Enable Performance Option  (For JSON Source Skip this step)
    NOTE: Try to enable Performance mode setting after you Select Filter (by using smaller dataset file). Once filter is set you can check enable performance mode and make sure following two settings are correctly set (Option#2 needs new version – SSIS v4.1+ / ODBC v1.4+).
    (1) On Array Handling Tab – Set Array node name (must be only one entry). For example your Filter is $.DATA.ORDER[*] then you can enter ORDER (2) On Advanced Filter Options Tab – Enter all unwanted Tag names you like to Skip. For example if you have other arrays like PRODUCT or CUSTOMER then enter PRODUCT,CUSTOMER

  8. Your setting may look like below

    Configure XML source or JSON Source for Very Large Data File (Streaming mode for High Performance)

    Configure XML source or JSON Source for Very Large Data File (Streaming mode for High Performance)

  9. Click Preview to verify data (Adjust Filter if needed to extract correct Hierarchy)
    Preview XML or JSON File data using SSIS XML Source or JSON Source

    Preview XML or JSON File data using SSIS XML Source or JSON Source

  10. Click on Columns Tab
  11. Change Scan Row count to 3000 or more and Click on Refresh Column .
  12. Select Guess 4x , Check Lock, Check Reset and Click OK like below. At runtime if you ever get error about Data Type Issue you can always adjust this later on too. Make sure Lock column is set to avoid setting reset for manually changed columns. For more information on metadata changes check this article
    SSIS Metadata Options - JSON / XML File Parsing

    SSIS Metadata Options – JSON / XML File Parsing

  13. Click OK to save settings.
  14. Now Right click on XML Component > Click Properties. Change DirectPath  property to original file path (large file) (e.g. c:\data\customers_large.xml  ). Save Package.
  15. Now you can run your SSIS Package from Designer or Command line. As you see in the below screenshot that there is virtually no memory pressure when you enable stream mode. Thanks to ZappySys unique XML / JSON Parsing Engine. When streaming mode is enabled file is not loaded into memory for parsing rather than that it only reads record by record to process very large JSON or XML file.In our below example we used Windows 7 Desktop, 16GB RAM, 4 Core i7 64 bit CPU. It took around 3 Minutes to Parse 3 Million Records (1.2 GB big XML file) . If you Parse JSON file then it can be even faster due to compact size.
    Reading Very Large XML or JSON File using SSIS (Stream Mode for High Performance)

    Reading Very Large XML or JSON File using SSIS (Stream Mode for High Performance)

Step-By-Step : Reading very large JSON file (SSIS JSON Source)

Reading very large JSON file using ZappySys JSON Source  has exact same steps described in above section except two changes. You have to use ZS JSON Source and skip Step#7 (Check Enable Performance Mode – This option is not available JSON Source).

Parsing very large XML File with Multiple Arrays

Now let’s discuss a scenario which can result in OutOfMemory Exception unless you tweak some extra options. Assume you have file structure like below.
NOTE: This will only work in version 4.1.0 or later (in SSIS PowerPack) or 1.4.0 or later for ODBC PowerPack

Notice that in above XML it has 3 different nodes (For Product, Customer and Order). If you try to extract Orders it might fail with OutOfMemory Exception because it needs to scan large XML before it can hit First Order node. To solve this issue you can adjust following 2 settings.

  1. On Array Handling Tab you have following
  2. On Advanced Filter Options tab enter Following two nodes which we dont want to extract.

Thats it. This will avoid excessive Memory pressure to find very first node before it can start Stream.

Parse XML Array - Performance Mode Setting

Parse XML Array – Performance Mode Setting

 

Parse Large XML - Exclude nodes by name

Parse Large XML – Exclude nodes by name

Conclusion

As you saw in this article that ZappySys SSIS PowerPack is designed to handle very large dataset in JSON or XML. We also support very large CSV and Excel files too which are not covered in this article. Download SSIS PowerPack to explore 70+ more components by yourself to make your ETL simple and Fast.

 

 

Posted in SSIS JSON Source (File/REST), SSIS XML Source (File / SOAP) and tagged , , .