How to read / write data in Google BigQuery using SSIS

Introduction

Google BigQuery API IntegrationGoogle BigQuery is a fully managed Big Data platform to run queries against large scale data. In this article you will learn how to integrate Google BigQuery data into Microsoft SQL Server using SSIS. We will leverage highly flexible JSON based REST API Connector and OAuth Connection to import / export data from Google BigQuery API just in a few clicks.

If you are looking for a similar product inside Amazon AWS Cloud then check an article about Amazon Athena.

Prerequisites

Before we look into a Step-By-Step section to extract and load data from BigQuery to SQL Server let’s make sure you meet the following requirements:

  1. SSIS designer installed. Sometimes it is referred as BIDS or SSDT (download it from Microsoft site).
  2. Basic knowledge of SSIS package development using Microsoft SQL Server Integration Services.
  3. ZappySys SSIS PowerPack installed. Click on the link to download a FREE trial.
  4. You have basic familiarity with REST API concepts and Google BigQuery API. This post uses free Public Dataset to query BigQuery table so no billing is required to get your hands dirty 🙂 However, if you like to query your own dataset then make sure you have at least one BigQuery dataset and one table with some sample data created (this part does require billing enabled). Read a Google Quickstart article for more information on how to create a new BigQuery dataset and a table.

Step-By-Step: Google BigQuery data extract using SSIS

Basically you can query Google BigQuery data in two ways:

  • Method-1: Query data using jobs/query method in BigQuery API. Use it if you expect to get a result in a fairly short amount of time. This API method generates a temp table which gets deleted after 24 hours. You can read data within that time frame using newly created JobId reference.
    NOTE: If you have records that sum up to more than 10MB of data or ~10,000 rows (assuming 1 row uses 1KB of data) then you need to proceed with two-step process as explained below.
  • Method-2: Export SQL query result using jobs/insert method in BigQuery API. Use this method if you expect a query to take a long time to finish.

Now let’s see how to query Google BigQuery data using SSIS 🙂

Create Google API Project and Register OAuth App

To use any Google API firstly you have to finish two tasks:

  1. Create Google API Project and obtain projectId (see next section).
  2. Register your own Google OAuth App and obtain ClientId and Client Secret.

Check step-by-step instructions on how to create Google API Project and create OAuth App for Google. During instructions make sure you enable BigQuery API (screenshots may differ).

NOTE: For BigQuery API you must use custom app option on OAuth connection. Google used to allow Default App but recent policy changes requires BigQuery to use Custom App to obtain your own ClientId and Client Secret (Some old Screenshots may use Default App for time being so please adjust your settings to use Custom App)

Get Google API Project ID

Once you have Google API Project created you can grab Project ID (see screenshot below). You will need Project ID in the next step when building API URL.

How to find Google API Project ID

How to find Google API Project ID?

Configure OAuth Connection for Google BigQuery API

In order to call most of Google APIs you will need an OAuth App. If you want to use your own app then refer to how to register Google OAuth Application (Get ClientID and ClientSecret) article. For simplicity we will use Default OAuth App:

  1. Right click inside Connection Managers area and click “New Connection…”
  2. From the connection type list select “ZS-OAUTH” connection type.
    Create new SSIS OAuth API Connection Manager

    Create new SSIS OAuth API Connection Manager

  3. For OAuth Provider select “Google”.
  4. For OAuth App type options leave “Use Default OAuth App” selected. If you created custom OAuth App then you would need to select “Use Custom OAuth App” option.
  5. For Scopes enter or select the following URLs (URL is just a permission name). Each URL must be in a separate line. For demo purposes we use only the first 3 scopes but we included a few more in case you like to test API in depth with a different permission set:
  6. Click Generate Token. Login using correct account if needed and then you will be prompted to click “Approve” OAuth App.
  7. Once you click OK on the login screen you will see new tokens populated on the connection screen.
  8. Click Test to make sure connection / tokens are valid and then click OK to save the connection.
    Create SSIS OAuth API Connection for Google BigQuery API

    Create SSIS OAuth API Connection for Google BigQuery API

Send BigQuery SQL Request (Wait until finish) and get JobId – (Method#1)

Once you have SSIS OAuth connection created for BigQuery API it’s time to read data from BigQuery. So, basically, there are two ways you can read BigQuery data: using query or insert method. For demo purposes we will use jobs/query method. If you want fire complex queries which can run for many minutes then refer to jobs/insert method.

If you are expecting more than 10MB of data (~10K records) then you have to split data extract in two steps:

  1. Step-1 : Call jobs/query method using REST API Task and wait until SQL is executed. Once it is finished you will Get JobId in the response. Save that JobId in a variable for later use. This job result is valid for 24-hours only.
  2. Step-2 : Read Job result using JSON Source in a Data Flow Task.
NOTE : If you are expecting less than ~10K rows then you can skip this step and just go to the next section – Read BigQuery SQL result (Method#1 or Method#2).

Now lets see how to configure REST API Task to submit query and extract JobId (which will be used in next section).

  1. In the Control Flow designer drag and drop ZS REST API Task from the SSIS toolbox.
  2. Double click the task to configure it.
  3. Change Request URL Access Mode to [Url from connection].
  4. From the connection dropdown select OAuth connection created in the previous section.
  5. Enter URL as below (replace YOUR-PROJECT-ID with a valid Project ID obtained in the previous section):
  6. Select Method as POST.
  7. Enter Body (Request Data) as below (change your query if needed; we used Public dataset for the demo):
    To use a Standard SQL query instead of Legacy SQL, add useLegacySql property and set it to false:
    For all possible parameters refer to jobs/query method documentation.
  8. Select Body Content Type as application/json.
  9. Click Test Request/Response. If things go well you will see JSON content. Then just copy JobId from the response and save for later use:
  10. Now go to Response Settings tab.
  11. Select ContentType=Json.
  12. Enter below expression to extract JobId from the response. You will need this Id to read SQL query output (this JobId is valid for 24 hrs):
  13. Choose save response content to save to a variable. Select <New Variable…> option. When prompted give a name to your variable (i.e. vJobId) and for the value field paste JobId copied from the above step:

    NOTE: This JobId expires after 24-hrs so while you are designing a package and not running it then design time value is used for testing/previewing. So make sure your JobId is valid. If needed click Test Request/Response to grab a new JobId and update the variable with a new value so you can preview data without running a full package.
  14. Click OK to save the UI.

Read BigQuery SQL result (Method#1 or Method#2)

Now let’s look at how to read data from BigQuery using Method#1 or Method#2. As mentioned before if you get more than 10MB of data (~10K records) then you must use Method#1.

In the next sections we will see how to configure JSON Source to use both methods. All steps are identical for both methods except for URL / Body and Pagination.  Method#2 does not need Pagination because less than ~10K records can be retrieved in a single response. Pagination is only needed if more than ~10K records are found.

Configure Google BigQuery Web Request (URL, Method, ContentType, Body etc.)

For Method#1
  1. In the Control Flow designer drag and drop Data Flow Task from SSIS toolbox.
  2. Double click Data Flow Task and drag and drop ZS JSON Source (For API/File) from SSIS toolbox.
  3. Double click JSON Source to edit.
  4. On the JSON Source UI configure as follows:
    1. Enter URL as below. Replace YOUR-API-PROJECT-ID with the API Project ID obtained in the previous section.
    2. Check Use Credentials option and select OAuth Connection Manager created in the previous section.
    3. For HTTP Request Method select GET.
Read data from Google BigQuery from Temp Job result

Read data from Google BigQuery from Temp Job result

For Method#2

If you are using Method#2 (retrieving less than ~10K rows) then:

  1. In the Control Flow designer drag and drop Data Flow Task from SSIS toolbox.
  2. Double click Data Flow Task and drag and drop ZS JSON Source (For API/File) from SSIS toolbox.
  3. Double click JSON Source to edit.
  4. On the JSON Source UI configure as follows:
    1. Enter URL as below. Replace YOUR-API-PROJECT-ID with API Project ID obtained in the previous section.
    2. Check Use Credentials option and select OAuth Connection Manager created in the previous section.
    3. For HTTP Request Method select POST.
    4. For Body (Request Data) enter below JSON string:
      To use a Standard SQL query instead of Legacy SQL, add useLegacySql property and set it to false:
      For all possible parameters refer to jobs/query method documentation.
    5. For Body Content Type select application/json option.
Configure SSIS JSON Source - Get data from Google BigQuery using SSIS

Configure SSIS JSON Source – Get data from Google BigQuery using SSIS

Configure Filter (Method #1 and #2)

We need to transform a single JSON response into multiple rows so we need to apply a correct filter.

  1. On JSON Source go to Filter Options tab.
  2. In the Filter field enter $.rows[*] or click [Select Filter] button to browse hierarchy you want to extract.
Select Filter for JSON Response

Select Filter for JSON Response

Configure BigQuery API Pagination Settings (suitable only for Method #1)

Most of modern APIs usually implement some sort of pagination technique so you get a part of data in each request rather than all in one go. Thus if you want more you can paginate through pages until the last page is reached. You can also read Understanding REST API Pagination in SSIS article to learn more about pagination in SSIS.

BigQuery API returns pageToken attribute in response JSON if more data is found for requested query result. You can then pass pageToken in the next URL in this format: http://my-api-url/?pageToken=xxxxxxxxxx

Now lets configure JSON Source to automate this pagination for us. On JSON Source go to Pagination Tab and enter the following two settings:

  1. Set Next Link as $.pageToken.
  2. Set Suffix for Next URL as &pageToken=<%nextlink%>.

See the screenshot below to get more clarity:

Configure BigQuery API Pagination on SSIS JSON Source

Configure BigQuery API Pagination on SSIS JSON Source

Configure Array Transformation

Now the last thing we have to configure is special 2-dimensional JSON array format used by BigQuery API:

  1. On the JSON Source UI go to 2D Array Transformation tab.
  2. Enter the following settings:
    1. For Transformation Type select Transform complex 2-dimensional array.
    2. For Column Name filter enter $.schema.fields[*].name.
    3. For Row Values Filter enter $.f[*].v.
JSON Array Transformation Options

JSON Array Transformation Options

Configure Target – Load Google BigQuery into SQL Server

Now you can connect your JSON Source to any target such as ZS Trash Destination or a real database destination such as OLEDB Destination.

If you wish to dump data from Google BigQuery to a SQL Server table then just perform the following steps:

  1. Drag and drop OLEDB Destination from SSIS Toolbox. Rename it to something like SQL Server Table.
  2. Double click on OLEDB Destination.
  3. Click New to create a new connection > Configure connection > Click OK.
  4. Click on New to create a new table > Rename default table name > Click OK.
  5. Click on Mappings tab and click OK to save UI with default mappings.
Configure SSIS OLEDB Destination - Google BigQuery to SQL Server Import

Configure SSIS OLEDB Destination – Google BigQuery to SQL Server Import

Execute Package – Loading BigQuery data into SQL Server

SSIS Package Execution - Loading Google BigQuery Data into SQL Server

SSIS Package Execution –
Loading Google BigQuery Data into SQL Server

Write data to Google BigQuery (Import JSON File or SQL data)

Now let’s look at how easy it is to import data into Google BigQuery using SSIS. We will use the same OAuth connection we created before. To learn more about inserting data into BigQuery check tabledata/insertAll method documentation.

If you want to perform data insert operation then include the following scopes in your OAuth Connection Manager and generate a token:

For this demo first create a test dataset and one table under it like shown below (billing must be enabled on your Google API Project). To do that go to https://bigquery.cloud.google.com/welcome and configure them:

Create sample dataset and table for Google BigQuery Load

And here is a REST API example to insert data into a BigQuery Table:

Request URL:

Request Headers:

Request Body:

Configure SSIS JSON Generator – Generate JSON for BigQuery Table Insert

Now let’s see how to build an HTTP request with JSON body and send it to BigQuery:

  1. Drag Data Flow Task and double click on it.
  2. Drag and configure your Source (for this demo we use Dummy Data Source with Customer Template).
  3. Drag and drop ZS JSON Generator Transform to produce JSON from a database or file records. If your source is already sending a valid JSON then you can skip this step (e.g. SQL query is returning JSON). You can also read raw JSON from a very large file (new-line separated JSON) using JSON Source with Output as Raw Data option checked.
  4. Double click JSON Generator Transform to start configuring it.
  5. First right click on Mappings node and select Add Unbound Nested Element:
    SSIS JSON Generator - Add unbound nested element

    SSIS JSON Generator – Add Unbound Nested Element

  6. Enter element alias as json and click OK.
  7. Right click on json element we just created and click Add Element(s). Select columns you like to insert to target. Click OK to save:
    SSIS JSON Generator - Add data bound columns

    SSIS JSON Generator – Add data bound columns

    SSIS JSON Generator - Select multiple columns

    SSIS JSON Generator – Select multiple columns

  8. Now let’s preview our JSON (Copy preview JSON to try in the next step – web api destination )
    Preview JSON - Generate for BigQuery Table Insert

    Preview JSON – Generate for BigQuery Table Insert

    NOTE: Table name and column names are case-sensitive so make sure your JSON attribute matches exact same way.
  9. Click OK to save UI.

Configure SSIS Web API destination – Insert data into BigQuery Table

Once you have Input JSON prepared,  now let’s configure destination for BigQuery.

  1. Drag and drop ZS Web API Destination.
  2. Connect your JSON Generator Transform to Web API destination.
  3. Configure general properties:
    SSIS Web API Destination - Configure for BigQuery Data load

    SSIS Web API Destination –
    Configure for BigQuery Data load

  4. Make sure to enter URL in this format:
    Make sure to replace 3 parts in above URL (MY_PROJECT_ID, MY_DATASET_ID, MY_TABLE_ID) with actual values from your Google Project and BigQuery dataset/table configuration.
  5. Now go to Batch Settings (For Body) tab and enable batch option and enter these settings:
    SSIS Web API Destination - Configure Batch Request settings for BigQuery Data Load

    SSIS Web API Destination – Configure Batch Request settings for Google BigQuery Data Load

  6. Enter Body Header as
  7. Enter Body Footer as
  8. Enter Row Separator as comma:
  9. Adjust batch size to an appropriate value. Change between 500 and 1000 to get optimal performance. Again, you have to test a few ways to find the right setting for your hardware and load size.
    NOTE: We did a small test on a machine with 4 cores CPU, 16GB RAM, Windows 7 installed. It took 37 seconds to load 100,000 records when Batch Size=500. When Batch Size=1000 it took 24 seconds to load the same number of records.
  10. If you want to try test insert before running full package. Go back to first tab and edit Body (Use Sample JSON generated by JSON Transform). Click Test Request / Response and confirm Success as below. You can go back to your BigQuery Portal and check one row is inserted after our test click. If everything looking good then run full package to insert all records.
    Test Google BigQuery Table Insert - SSIS Web API Destination UI

    Test Google BigQuery Table Insert – SSIS Web API Destination UI

  11. Hit OK to save UI.
  12. Run the package and verify data in Google BigQuery Console:
    Loading data into Google BigQuery using SSIS

    Loading data into Google BigQuery using SSIS

Handling Errors in BigQuery Insert

There will be a time when some records you insert may not go well in Google Bigquery. In such case you can read output from Web API destination and parse further with JSON Parser Transform. Check for certain values in the output. You must use JSON Parser after Web API destination (Connect Blue Arrow from Web API destination – Since its Soft Error it won’t redirect in Red Arrow ).

For example here is the sample JSON in POST Body for testing which produces error due to bad column name. When bad row found in batch all records will be rejected. Notice that error returns index of record in batch so you can identify which row went bad. It also returns column name in location attribute.

NOTE: Bad column name in 2nd record

Test Body (Bad):

 

Response (For Bad Input):

 

To configure error detection perform following steps.

  1. Drag and drop ZS JSON Parser Transform after Web API destination
  2. Click on Web API destination. Connect Blue arrow  to JSON Parser Transform
  3. Configure JSON Parser Transform like below
  4. Connect JSON Parser TRansform to some Destination to save error information (e.g. SQL Table or Trans destination)
Handling BigQuery Insert Errors in SSIS

Handling BigQuery Insert Errors in SSIS

Debugging Web API Requests

If you need to debug actual requests made to Google server then you can use a tool like Fiddler. It’s a very handy tool to troubleshoot JSON format issues. It will allow to see how a request is made to a server.

Using Fiddler to debug Google BigQuery API requests in SSIS

Using Fiddler to debug Google BigQuery API requests in SSIS

Common Errors

Truncation related error

The most common error you may face when you run an SSIS package is truncation error. During the design time only 300 rows are scanned from a source (a file or a REST API call response) to detect datatypes but at runtime, it is likely you will retrieve far more records. So it is possible that you will get longer strings than initially expected. For detailed instructions on how to fix common metadata related errors read an article "How to handle SSIS errors (truncation, metadata issues)".

Authentication related error

Another frequent error you may get is an authentication error, which happens when you deploy/copy a package to another machine and run it there. Check the paragraph below to see why it happens and how to solve this problem.

Deployment to Production

In SSIS package sensitive data such as tokens and passwords are by default encrypted by SSIS with your Windows account which you use to create a package. So SSIS will fail to decrypt tokens/passwords when you run it from another machine using another Windows account. To circumvent this when you are creating an SSIS package which uses authentication components (e.g. an OAuth Connection Manager or an HTTP Connection Manager with credentials, etc.), consider using parameters/variables to pass tokens/passwords. In this way, you won’t face authentication related errors when a package is deployed to a production server.

Check our article on how to configure packages with sensitive data on your production or development server.

Download Sample Package

Click here to download SSIS sample package for SSIS 2012 or Higher

Conclusion. What’s next?

In this article we have learned how to load data from Google BigQuery into SQL Server using SSIS (drag and drop approach without coding). We used SSIS JSON / REST API Connector to extract data from Google BigQuery REST API using OAuth. JSON Source Connector makes it super simple to parse complex/large JSON files or any Web API response into rows and columns so you can load data into a database, e.g. SQL Server database. Download SSIS PowerPack to try many other automation scenarios that were not discussed in this article.

Keywords:

Google BigQuery Integration with SQL Server | How to extract data from google bigquery in SSIS? | How to read data from Google BigQuery API? | Loading BigQuery Data into SQL Server. | BigQuery to SQL Server | SSIS Google Big Query Integration | SSIS Google BigQuery Import  JSON File | SSIS Google BigQuery Export data

 

Posted in Google API, REST API Integration, SSIS JSON Generator Transform, SSIS JSON Source (File/REST), SSIS WEB API Destination and tagged , , , , , , , , , .