Advanced topics

Metadata for ODBC and Data Gateway

Introduction

In this post we will learn how to fix some metadata / performance related issues in ODBC PowerPack Drivers using Caching / Metadata / Streaming Mode features. By default ZappySys API Drivers issues atleast two API request (First one to obtain Metadata and second, third… for Fetching Data Pages). Most of the times this should be OK and user wont even notice this unless you use tools like fiddler to see how many requests sent by driver. Sometimes its necessary to avoid extra requests to obtain metadata (For example you doing POST to create new record or API has strict Throttling). In this post we will learn various techniques how to avoid extra POST requests or how to speed up query by reading from Cache if your data doesn’t change often.

How to Speedup Performance

ZappySys Drivers may provide following features (Some options may be available only for API drivers). These features can be used to speed up query performance and solve some metadata issues.

Data Caching Option
Pre-generated Metadata (META option in WITH clause of SQL Query)
Streaming Mode for large XML / JSON files

Data Caching Options in ODBC Drivers

ZappySys drivers come with very useful Data Caching feature. This can be very useful feature to speedup performance in many cases.

If your data doesn’t change often and you need to issue same query multiple times then enabling data caching may speedup data retrieval significantly. By default ZappySys driver enables Caching for just Metadata (60 Seconds Expiration). So metadata for each query issued by ZappySys Driver is cached for 60 seconds (See below screenshot).

Here is how you can enable caching options.

Enable Caching Options for Metadata / Data for ZappySys ODBC Drivers

New version of ODBC PowerPack now supports Caching Options in WITH clause (see below). Per query cache by supplying file name.

SELECT * FROM $
WITH
(  SRC='https://myhost.com/some-api'
,CachingMode='All'  --cache metadata and data rows both
,CacheStorage='File' --or Memory
,CacheFileLocation='c:\temp\myquery.cache'
,CacheEntryTtl=300 --cache for 300 seconds
)

Handling POST requests to create / update records

As we mention earlier in some cases you might be calling POST requests to Create new records. In such case API request must be sent exactly once. By default Driver sends first request to Get metadata and then sends second request to get data using same parameters used for metadata request. This is usually fine if we reading data and not creating new row on server… (e.g. create new Customer Row). If you have case where you must call API request precisely once then you have to use META clause in the WITH query to avoid Metadata request by supplying static metadata from File or Storage. We discussed this one usecase here.

See next 2-3 sections how to use META option in your SQL Query.

Metadata Options in SQL Query

Now let’s talk about Metadata handling. Most ETL / Reporting tool need to know column type, size and precision before getting actual data from driver. If you are dealing with JSON / XML or CSV format you may realize that there is no metadata stored in file itself to describe columns / data types.

However metadata must be sent to most Reporting / ETL tool when they use ODBC Driver. ZappySys driver does intelligent scan from your local file or API response to guess datatypes for each column. In most cases driver does accurate guess but sometimes it’s necessary to adjust metadata (Specially Column Length) to avoid truncation related errors from your ETL /Reporting tool.

Issue with this automatic metadata scan is, it can be expensive (slow performance) or inaccurate (e.g. invalid datatype for some columns)

Let’s look at how to take complete control on your Metadata so you can avoid metadata related errors and speedup query performance.

Generate Metadata Manually

Let’s look at how to generate SQL query metadata using ODBC Driver UI.

We are assuming you have downloaded and installed ODBC PowerPack
Open ODBC DSN by typing “ODBC” in your start menu and select ODBC Data Sources 64 bit
Now create Add and select “ZappySys JSON Driver” for test

On the UI enter URL like below

https://services.odata.org/V3/Northwind/Northwind.svc/Invoices?$format=json

Now you can go to Preview tab and enter query like below and click Preview Data
```
select * from value
```
Once query is executed you can click Save Metadata button and select Save to File option like below. You can also Save to DSN internal storage by just giving name. If you save to internal storage by name then you can view it later under Advanced View on Properties tab (Grid Mode) > Metadata Settings > User defined metadata.

Create Metadata for ODBC Driver Query – ZappySys API Drivers

Metadata file may look like below if you used previous sample URL. You can edit this metadata as per your need.


    /*
    Available column types:
    Default, String, Int64, Long, Int, Int32, Short, Byte,
    Decimal, Double, Float, DateTime, Date, Boolean
    */
    [
    {
    "Name": "p_odata_metadata",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "p_odata_nextLink",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "ShipName",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "ShipPostalCode",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "ShipCountry",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "CustomerID",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "CustomerName",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "Address",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "Salesperson",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "OrderID",
    "Type": "Int64",
    "Length": 16777216
    },
    {
    "Name": "OrderDate",
    "Type": "DateTime",
    "Length": 16777216
    },
    {
    "Name": "RequiredDate",
    "Type": "DateTime",
    "Length": 16777216
    },
    {
    "Name": "ShippedDate",
    "Type": "DateTime",
    "Length": 16777216
    },
    {
    "Name": "ShipperName",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "ProductID",
    "Type": "Int64",
    "Length": 16777216
    },
    {
    "Name": "ProductName",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "UnitPrice",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "Quantity",
    "Type": "Int64",
    "Length": 16777216
    },
    {
    "Name": "Discount",
    "Type": "Double",
    "Length": 16777216
    },
    {
    "Name": "ExtendedPrice",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "Freight",
    "Type": "String",
    "Length": 16777216
    }
    ]

Compact Format Metadata

Version 1.4 introduced a new format for metadata. Here is an example. Each pair can be its own line or you can put all in one line. Whitespaces around any value / name is ignored. string type without length assume 2000 chars long string.

Syntax: col_name1 : type_name[(length)] [; col_name2 : type_name[(length)] ] …. [; col_nameN : type_name[(length)] ]

col1: int32;
col2: string(10);
col3: boolean;
col4: datetime;
col5: int64;
col6: double;

Example usage in SQL

SELECT * FROM tbl WITH( META='col1: int32; col2: string(10); col3: boolean; col4: datetime; col5: int64;col6: double' )

Using Cached Metadata in SQL Query

Now it’s time to use Metadata and speedup our queries. There are 3 ways you can use metadata in SQL query.

Using Metadata from File

To use metadata which is saved to a file (like our previous screenshot) use below SQL query for example. Table name may be different in your case if you didn’t use previous example URL. You can Edit Metadata file as per your need in any text editor.

select * from value WITH( meta='c:\temp\meta.txt' )

Using Metadata from DSN Storage

To use metadata which is saved to DSN Storage use below SQL query for example.

select * from value WITH( meta='My-Invoice-Meta')

Save Metadata to DSN Storage

We mentioned briefly how to save metadata to DSN Storage but in case you missed see below screenshot.

Save Metadata to DSN Storage

Edit Metadata saved to DSN Storage

Once you save Metadata to DSN Storage, here is how you can view and edit.

edit-view-odbc-metadata-stored-in-dsn-storage

View / Edit Metadata stored in DSN Storage

Using Metadata from Direct Setting (Embedded Metadata)

Sometimes its also convenient to embed metadata rather than relying on file location or DSN metadata storage. Here is how to supply metadata using embedded approach. Possible datatypes are String, Int64, Long, Int, Int32, Short, Byte, Decimal, Double, Float, DateTime, Date, Boolean.

select * from value WITH( meta='[
    {
    "Name": "p_odata_metadata",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "p_odata_nextLink",
    "Type": "String",
    "Length": 16777216
    },
    {
    "Name": "ShipName",
    "Type": "String",
    "Length": 16777216
    },
    ...........
    ...........
    ...........
    ]' )

Dynamically adding a column in a SQL query

If you need an additional field for reading in the EndPoint, which is not supported by Google Drive Connector, you can add a new column by mapping it to Google Drive API response field with META clause:

Original SQL query

Suppose you have this Hello-World Connector EndPoint get_friends which calls Sandbox World API and returns a list of friends with their IDs and names:

SELECT Id, Name FROM get_friends

Id	Name
1	Alice
2	Bob
3	Charlie

API response

But you know that the API response contains more information about each friend, e.g. their birthdate and the exact date and time when they were born, which is not returned by the EndPoint:

{  
        "totalFriends": 3,  
        "friends": [  
           {  
               "id": 1,
               "name": "Alice",  
               "birthdate": "2000-01-02",  
               "extras": {  
                   "bornAt": "2000-01-02T12:23:45.0001Z"
               }  
           },  
           {  
               "id": 2,
               "name": "Bob",  
               "birthdate": "2001-02-03",  
               "extras": {  
                   "bornAt": "2001-02-03T12:23:45.0001Z"
               }  
           },  
           {  
               "id": 3,
               "name": "Charlie",  
               "birthdate": "2002-03-04",  
               "extras": {  
                   "bornAt": "2002-03-04T12:23:45.0001Z"
               }  
           }  
        ]  
    }

Updated SQL query

You can add all the additional fields you need directly in a SQL query by using META clause and @OverrideMode:1 option within it:

SELECT Id, Name, [birthdate] as Birthdate, [extras.bornAt] as BornAt 
FROM get_friends
WITH (Meta='@OverrideMode:1;id:int;Name:string(200);[birthdate]:date;[extras.bornAt]:datetime')

@OverrideMode option is supported by ODBC PowerPack v2.0.1 or higher.

Id	Name	Birthdate	BornAt
1	Alice	2000-01-02	2000-01-02T12:23:45.0001Z
2	Bob	2001-02-03	2001-02-03T12:23:45.0001Z
3	Charlie	2002-03-04	2002-03-04T12:23:45.0001Z

For more advanced scenarios visit the Customize Google Drive Connector page.

Reading Large Files (Streaming Mode for XML / JSON)

There will be a time when you need to read very large JSON / XML files from local disk or URL. ZappySys engine by default process everything in memory, which may work fine upto certain size but if you have file size larger than OS allowed memory internal limit then you have to tweak some settings.

First lets understand the problem. Try to create new blank DSN and run below query and watch your Memory Graph in Task Manager. You will see RAM graph spikes… and query takes around 10-15 seconds to return 10 rows.

Slow Version (Fully load In memory then parse)

SELECT * FROM $
    LIMIT 10
    WITH(
    Filter='$.LargeArray[*]'
    ,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
    --,SRC='c:\data\large_file.json.gz'
    ,IncludeParentColumns='True'
    ,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
    )

Now let’s modify query little bit. Add –FAST, Turn off IncludeParentColumns and run again below modified query. You will notice it takes less than a second for same result.

FAST Version (Streaming Mode – Parse as you go)

SELECT * FROM $
    LIMIT 10
    WITH(
        Filter='$.LargeArray[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
        ,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
        --,SRC='c:\data\large_file.json.gz'
        ,IncludeParentColumns='False'  --//This Must be OFF for STREAM mode (read very large files)
        ,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
    )

Understanding Streaming Mode

Now let’s understand step-by-step what we did and why we did. By default if you’re reading JSON / XML data, entire Document is loaded into Memory for processing. This is fine for most cases but some API returns very large Document like below.

Sample JSON File

{
    rows:[
            {..},
            {..},
            ....
            .... 100000 more rows
            ....
            {..}
            ]
    }

To read from above document without getting OutOfMemory exception change following settings. For similar problem in SSIS check this article.

In the filter append –FAST (prefix dash dash)
Uncheck IncludeParentColumn option (This is needed for stream mode)
Enable Performance Mode (not applicable for JSON Driver)
Write your query and execute see how long it takes ( Table name must be $ in FROM clause, Filter must have –FAST suffix, Parent Columns must be excluded as below)

configure-streaming-mode-performance-setting-xml-json-driver

Configure Settings to read very large XML /JSON Files

odbc-json-xml-driver-read-large-file-url-gzip-compressed

Reading Very Large JSON / XML Files – Streaming Option

SQL Query for reading Large JSON File (Streaming Mode)

Here is a sample query which enables very large JSON file reading using Stream Mode using ZappySys JSON Driver

Notice Three settings.

Table name must be $ in FROM clause, Filter must have –FAST suffix, Parent Columns must be excluded (IncludeParentColumns=false) as below.

SELECT * FROM $
    --LIMIT 10
    WITH(
        Filter='$.LargeArray[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
        ,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
        --,SRC='c:\data\large_file.json.gz'
        ,IncludeParentColumns='False'  --//This Must be OFF for STREAM mode (read very large files)
        ,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
    )

SQL Query for reading Large XML File (Streaming Mode)

Here is a sample query which enables very large JSON file reading using Stream Mode using ZappySys XML Driver

Notice one extra option EnablePerformanceMode = True for Large XML File Processing and following three changes.

Table name must be $ in FROM clause, Filter must have –FAST suffix, Parent Columns must be excluded (IncludeParentColumns=false) as below.

SELECT * FROM $ 
--LIMIT 10
WITH(
     Filter='$.doc.Customer[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
    ,SRC='https://zappysys.com/downloads/files/customer_10k.xml'
    --,SRC='c:\data\customer_10k.xml'
    ,IncludeParentColumns='False'  --//This Must be OFF for STREAM mode (read very large files)
    ,FileCompressionType='None' --GZip, Zip or None (Zip format only available for Local files)
    ,EnablePerformanceMode='True'  --try to disable this option for simple files
)

SQL Query for reading large files with parent columns or 2 levels deep

So far we saw one level deep array with Streaming mode. Now assume a scenario where you have a very large XML or JSON file which requires filter more than 2 level deep. (e.g. $.Customers[*].Orders[*] or $.Customers[*].Orders[*].Items[*] ) , and also you need parent columns (e.g. IncludeParentColumns=True).

If you followed previous section, we mentioned that for Streaming mode you must set IncludeParentColumns=False. So what do you do in that case?

Well, you can use JOIN Query as below to support that scenario. You may notice how we extracting Branches for each record and passing to child Query query. Notice that rather than SRC we are using DATA in child query.

SELECT a.RecID,a.CustomerID, b.* FROM $
    LIMIT 10
    WITH(
        Filter='$.LargeArray[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
        ,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
        --,SRC='c:\data\large_file.json.gz'
        ,IncludeParentColumns='False'  --//This Must be OFF for STREAM mode (read very large files)
        ,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
        ,Alias='a'
        ,JOIN1_Data='[$a.Branches$]'
        ,JOIN1_Alias='b'
        ,JOIN1_Filter=''
    )