YouTube Connector - Metadata in ODBC Driver / Performance Options
Contents
Introduction
In this post we will learn how to fix some metadata / performance related issues in ODBC PowerPack Drivers using Caching / Metadata / Streaming Mode features. By default ZappySys API Drivers issues atleast two API request (First one to obtain Metadata and second, third… for Fetching Data Pages). Most of the times this should be OK and user wont even notice this unless you use tools like fiddler to see how many requests sent by driver. Sometimes its necessary to avoid extra requests to obtain metadata (For example you doing POST to create new record or API has strict Throttling). In this post we will learn various techniques how to avoid extra POST requests or how to speed up query by reading from Cache if your data doesn’t change often.
How to Speedup Performance
ZappySys Drivers may provide following features (Some options may be available only for API drivers). These features can be used to speed up query performance and solve some metadata issues.
- Data Caching Option
- Pre-generated Metadata (META option in WITH clause of SQL Query)
- Streaming Mode for large XML / JSON files
Data Caching Options in ODBC Drivers
ZappySys drivers come with very useful Data Caching feature. This can be very useful feature to speedup performance in many cases.
If your data doesn’t change often and you need to issue same query multiple times then enabling data caching may speedup data retrieval significantly. By default ZappySys driver enables Caching for just Metadata (60 Seconds Expiration). So metadata for each query issued by ZappySys Driver is cached for 60 seconds (See below screenshot).
Here is how you can enable caching options.
New version of ODBC PowerPack now supports Caching Options in WITH clause (see below). Per query cache by supplying file name.
SELECT * FROM $
WITH
( SRC='https://myhost.com/some-api'
,CachingMode='All' --cache metadata and data rows both
,CacheStorage='File' --or Memory
,CacheFileLocation='c:\temp\myquery.cache'
,CacheEntryTtl=300 --cache for 300 seconds
)
Handling POST requests to create / update records
As we mention earlier in some cases you might be calling POST requests to Create new records. In such case API request must be sent exactly once. By default Driver sends first request to Get metadata and then sends second request to get data using same parameters used for metadata request. This is usually fine if we reading data and not creating new row on server… (e.g. create new Customer Row). If you have case where you must call API request precisely once then you have to use META clause in the WITH query to avoid Metadata request by supplying static metadata from File or Storage. We discussed this one usecase here.
See next 2-3 sections how to use META option in your SQL Query.
Metadata Options in SQL Query
Now let’s talk about Metadata handling. Most ETL / Reporting tool need to know column type, size and precision before getting actual data from driver. If you are dealing with JSON / XML or CSV format you may realize that there is no metadata stored in file itself to describe columns / data types.
However metadata must be sent to most Reporting / ETL tool when they use ODBC Driver. ZappySys driver does intelligent scan from your local file or API response to guess datatypes for each column. In most cases driver does accurate guess but sometimes it’s necessary to adjust metadata (Specially Column Length) to avoid truncation related errors from your ETL /Reporting tool.
Issue with this automatic metadata scan is, it can be expensive (slow performance) or inaccurate (e.g. invalid datatype for some columns)
Let’s look at how to take complete control on your Metadata so you can avoid metadata related errors and speedup query performance.
Generate Metadata Manually
Let’s look at how to generate SQL query metadata using ODBC Driver UI.
- We are assuming you have downloaded and installed ODBC PowerPack
- Open ODBC DSN by typing “ODBC” in your start menu and select ODBC Data Sources 64 bit
- Now create Add and select “ZappySys JSON Driver” for test
-
On the UI enter URL like below
https://services.odata.org/V3/Northwind/Northwind.svc/Invoices?$format=json
-
Now you can go to Preview tab and enter query like below and click Preview Data
select * from value
-
Once query is executed you can click Save Metadata button and select Save to File option like below. You can also Save to DSN internal storage by just giving name. If you save to internal storage by name then you can view it later under Advanced View on Properties tab (Grid Mode) > Metadata Settings > User defined metadata.
Metadata file may look like below if you used previous sample URL. You can edit this metadata as per your need.
/*
Available column types:
Default, String, Int64, Long, Int, Int32, Short, Byte,
Decimal, Double, Float, DateTime, Date, Boolean
*/
[
{
"Name": "p_odata_metadata",
"Type": "String",
"Length": 16777216
},
{
"Name": "p_odata_nextLink",
"Type": "String",
"Length": 16777216
},
{
"Name": "ShipName",
"Type": "String",
"Length": 16777216
},
{
"Name": "ShipPostalCode",
"Type": "String",
"Length": 16777216
},
{
"Name": "ShipCountry",
"Type": "String",
"Length": 16777216
},
{
"Name": "CustomerID",
"Type": "String",
"Length": 16777216
},
{
"Name": "CustomerName",
"Type": "String",
"Length": 16777216
},
{
"Name": "Address",
"Type": "String",
"Length": 16777216
},
{
"Name": "Salesperson",
"Type": "String",
"Length": 16777216
},
{
"Name": "OrderID",
"Type": "Int64",
"Length": 16777216
},
{
"Name": "OrderDate",
"Type": "DateTime",
"Length": 16777216
},
{
"Name": "RequiredDate",
"Type": "DateTime",
"Length": 16777216
},
{
"Name": "ShippedDate",
"Type": "DateTime",
"Length": 16777216
},
{
"Name": "ShipperName",
"Type": "String",
"Length": 16777216
},
{
"Name": "ProductID",
"Type": "Int64",
"Length": 16777216
},
{
"Name": "ProductName",
"Type": "String",
"Length": 16777216
},
{
"Name": "UnitPrice",
"Type": "String",
"Length": 16777216
},
{
"Name": "Quantity",
"Type": "Int64",
"Length": 16777216
},
{
"Name": "Discount",
"Type": "Double",
"Length": 16777216
},
{
"Name": "ExtendedPrice",
"Type": "String",
"Length": 16777216
},
{
"Name": "Freight",
"Type": "String",
"Length": 16777216
}
]
Compact Format Metadata
Version 1.4 introduced a new format for metadata. Here is an example. Each pair can be its own line or you can put all in one line. Whitespaces around any value / name is ignored. string type without length assume 2000 chars long string.
Syntax: col_name1 : type_name[(length)] [; col_name2 : type_name[(length)] ] …. [; col_nameN : type_name[(length)] ]
col1: int32;
col2: string(10);
col3: boolean;
col4: datetime;
col5: int64;
col6: double;
Example usage in SQL
SELECT * FROM tbl WITH( META='col1: int32; col2: string(10); col3: boolean; col4: datetime; col5: int64;col6: double' )
Using Cached Metadata in SQL Query
Now it’s time to use Metadata and speedup our queries. There are 3 ways you can use metadata in SQL query.
Using Metadata from File
To use metadata which is saved to a file (like our previous screenshot) use below SQL query for example. Table name may be different in your case if you didn’t use previous example URL. You can Edit Metadata file as per your need in any text editor.
select * from value WITH( meta='c:\temp\meta.txt' )
Using Metadata from DSN Storage
To use metadata which is saved to DSN Storage use below SQL query for example.
select * from value WITH( meta='My-Invoice-Meta')
Save Metadata to DSN Storage
We mentioned briefly how to save metadata to DSN Storage but in case you missed see below screenshot.
Edit Metadata saved to DSN Storage
Once you save Metadata to DSN Storage, here is how you can view and edit.
Using Metadata from Direct Setting (Embedded Metadata)
Sometimes its also convenient to embed metadata rather than relying on file location or DSN metadata storage. Here is how to supply metadata using embedded approach. Possible datatypes are String, Int64, Long, Int, Int32, Short, Byte, Decimal, Double, Float, DateTime, Date, Boolean.
select * from value WITH( meta='[
{
"Name": "p_odata_metadata",
"Type": "String",
"Length": 16777216
},
{
"Name": "p_odata_nextLink",
"Type": "String",
"Length": 16777216
},
{
"Name": "ShipName",
"Type": "String",
"Length": 16777216
},
...........
...........
...........
]' )
Reading Large Files (Streaming Mode for XML / JSON)
There will be a time when you need to read very large JSON / XML files from local disk or URL. ZappySys engine by default process everything in memory, which may work fine upto certain size but if you have file size larger than OS allowed memory internal limit then you have to tweak some settings.
First lets understand the problem. Try to create new blank DSN and run below query and watch your Memory Graph in Task Manager. You will see RAM graph spikes… and query takes around 10-15 seconds to return 10 rows.
Slow Version (Fully load In memory then parse)
SELECT * FROM $
LIMIT 10
WITH(
Filter='$.LargeArray[*]'
,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
--,SRC='c:\data\large_file.json.gz'
,IncludeParentColumns='True'
,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
)
Now let’s modify query little bit. Add –FAST, Turn off IncludeParentColumns and run again below modified query. You will notice it takes less than a second for same result.
FAST Version (Streaming Mode – Parse as you go)
SELECT * FROM $
LIMIT 10
WITH(
Filter='$.LargeArray[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
--,SRC='c:\data\large_file.json.gz'
,IncludeParentColumns='False' --//This Must be OFF for STREAM mode (read very large files)
,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
)
Understanding Streaming Mode
Now let’s understand step-by-step what we did and why we did. By default if you’re reading JSON / XML data, entire Document is loaded into Memory for processing. This is fine for most cases but some API returns very large Document like below.
Sample JSON File
{
rows:[
{..},
{..},
....
.... 100000 more rows
....
{..}
]
}
To read from above document without getting OutOfMemory exception change following settings. For similar problem in SSIS check this article.
- In the filter append –FAST (prefix dash dash)
- Uncheck IncludeParentColumn option (This is needed for stream mode)
- Enable Performance Mode (not applicable for JSON Driver)
- Write your query and execute see how long it takes ( Table name must be $ in FROM clause, Filter must have –FAST suffix, Parent Columns must be excluded as below)
SQL Query for reading Large JSON File (Streaming Mode)
Here is a sample query which enables very large JSON file reading using Stream Mode using ZappySys JSON Driver
Notice Three settings.
Table name must be $ in FROM clause, Filter must have –FAST suffix, Parent Columns must be excluded (IncludeParentColumns=false) as below.
SELECT * FROM $
--LIMIT 10
WITH(
Filter='$.LargeArray[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
--,SRC='c:\data\large_file.json.gz'
,IncludeParentColumns='False' --//This Must be OFF for STREAM mode (read very large files)
,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
)
SQL Query for reading Large XML File (Streaming Mode)
Here is a sample query which enables very large JSON file reading using Stream Mode using ZappySys XML Driver
Notice one extra option EnablePerformanceMode = True for Large XML File Processing and following three changes.
Table name must be $ in FROM clause, Filter must have –FAST suffix, Parent Columns must be excluded (IncludeParentColumns=false) as below.
SELECT * FROM $
--LIMIT 10
WITH(
Filter='$.doc.Customer[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
,SRC='https://zappysys.com/downloads/files/customer_10k.xml'
--,SRC='c:\data\customer_10k.xml'
,IncludeParentColumns='False' --//This Must be OFF for STREAM mode (read very large files)
,FileCompressionType='None' --GZip, Zip or None (Zip format only available for Local files)
,EnablePerformanceMode='True' --try to disable this option for simple files
)
SQL Query for reading large files with parent columns or 2 levels deep
So far we saw one level deep array with Streaming mode. Now assume a scenario where you have a very large XML or JSON file which requires filter more than 2 level deep. (e.g. $.Customers[*].Orders[*] or $.Customers[*].Orders[*].Items[*] ) , and also you need parent columns (e.g. IncludeParentColumns=True).
If you followed previous section, we mentioned that for Streaming mode you must set IncludeParentColumns=False. So what do you do in that case?
Well, you can use JOIN Query as below to support that scenario. You may notice how we extracting Branches for each record and passing to child Query query. Notice that rather than SRC we are using DATA in child query.
SELECT a.RecID,a.CustomerID, b.* FROM $
LIMIT 10
WITH(
Filter='$.LargeArray[*]--FAST' --//Adding --FAST option turn on STREAM mode (large files)
,SRC='https://zappysys.com/downloads/files/test/large_file_100k_largearray_prop.json.gz'
--,SRC='c:\data\large_file.json.gz'
,IncludeParentColumns='False' --//This Must be OFF for STREAM mode (read very large files)
,FileCompressionType='GZip' --Zip or None (Zip format only available for Local files)
,Alias='a'
,JOIN1_Data='[$a.Branches$]'
,JOIN1_Alias='b'
,JOIN1_Filter=''
)