Import / Export data from Amazon Athena using SSIS

Introduction

In our previous post we explored unlimited possibilities to call Amazon AWS API using SSIS. Now in this post we will learn how to import / export data from Amazon Athena using SSIS. Amazon Athena is very similar service like Google BigQuery which we documented already.  To read data from Amazon Athena we will use ZappySys JSON / REST API Source which supports excellent JSON parsing capability like this.

Prerequisites

Before we do hello world demo for calling Amazon AWS API, you will need to make sure following prerequisites are met.

  1. SSIS designer installed. Sometimes it is referred as BIDS or SSDT (download it from Microsoft site).
  2. Basic knowledge of SSIS package development using Microsoft SQL Server Integration Services.
  3. Access to valid AWS credentials (Access Key, Secret Key for your IAM User). Click here to learn more about IAM users and Access Key/Secret Key
  4. Make sure SSIS PowerPack is installed. Click here to download.

Steps needed to read data from Amazon Athena

To read data from Amazon Athena you need to do atleast 3 steps at very high level. Later on this article we will see each step in depth.

  1. Call StartQueryExecution API to send SQL Query (Its Job Style API so this call starts Job)
  2. Keep checking Job Status by calling  GetQueryExecution API (In the response check State field for Status – We need to look for word SUCCESS)
  3. Once Job is finished it produces CSV file in S3 Bucket (Output Location is supplied in Step-1)
Read Amazon Athena Query output in SSIS (S3 CSV files)

Read Amazon Athena Query output in SSIS (S3 CSV files)

AWS Console – Athena Query Editor / Configure Output Settings

Few things you might have to setup before you can call Athena API in SSIS

  1. Make sure your IAM user has correct permission (We used existing policies for ease of use but you can use custom policy for precise permission set)
  2. Configure default output location for your Athena queries
Configure IAM user permission to call Amazon Athena API

Configure IAM user permission to call Amazon Athena API

Set Athena Query Output Location

Set Athena Query Output Location

Testing Athena Query inside Amazon AWS Console (Query Editor)

Testing Athena Query inside Amazon AWS Console (Query Editor)

Creating Connection to call Athena API in SSIS

Any API calls made to Amazon Athena service using ZappySys JSON / REST API Source  or  ZS REST API Task needs OAuth connection.

For REST API Task configure as below (More details about using REST API Task is in the next section).

  1. Drag ZS REST API Task from Control Flow ToolboxSSIS REST Api Task - Drag and Drop
  2. Double click to edit.
  3. Change Request URL Access Mode to [Url from Connection]
  4. Now select <<New OAuth Connection>> from connection drop down
  5. On OAuth connection select Amazon AWS API v4 as Provider
  6. Enter your Access Key and Secret Key
Amazon AWS account must have access to Athena API calls and S3 Bucket 
Create SSIS Connection for Amazon AWS API Call (Athena REST API Example)

Create SSIS Connection for Amazon AWS API Call (Athena REST API Example)

Creating Table in Amazon Athena using API call

For this demo we assume you have already created sample table in Amazon Athena. If you wish to automate creating amazon athena table using SSIS then you need to call CREATE TABLE DDL command using ZS REST API Task.

  1. In the previous ZS REST API Task select OAuth connection (See previous section)
  2. Enter Request Url as below

  3. In the HTTP Headers enter below 2 headers
  4. Change Request method to POST
  5. In the Body enter below
    Notice: Rather than using hard coded  DDL SQL Statement we can use variable placeholder (e.g. "QueryString": "{{User::Query,FUN_JSONENCODE}}"   ). Use  FUN_JSONENCODE to replace new line with \r\n else JSON becomes invalid.
  6. Now click Test Request / Response. This will create a new Athena table which we will query in later section.

 

Notice that in above Body ClientRequestToken is a unique number each time you have to call StartQueryExecution API to make it simple we used SSIS system variable System::ExecutionInstanceGUID. Its unique for each execution. You can also use newid() function from SQL Server and save result to variable if you like.

Here is the screenshot of above Request

RESt API Task - Create table in Amazon Athena using SSIS (StartQueryExecution API Call)

RESt API Task – Create table in Amazon Athena using SSIS (StartQueryExecution API Call)

 

Uploading data for Amazon Athena (Source files)

Once you have athena table created, we will need some data to query. Loading data to amazon athena table is nothing but upload files to S3. If you noticed we added below line in CREATE TABLE statement in the previous section. This means all source files are located at this S3 location.

We have written blog post to explain this process (Click Here – How to load data from SQL server to S3 files).

Extract data from Amazon Athena using SSIS (Query Athena)

Now its time to read some data by writing SQL query for Athena data (i.e. S3 files). As we mentioned earlier, reading data from Athena can be done using following steps.

  1. Call  StartQueryExecution API  using REST API Task to execute your Query (It returns you QueryExecutionID). Use same technique as Create Table example. Replace CREATE Table command in body with your SQL Query (New line must be replaced by \r\n in JSON).
  2. Use REST API Task to call GetQueryExecution API and configure periodic status check loop (Task keeps waiting until Success or Failure Value not found in Response)
  3. Once data is available read data using S3 CSV File Source
  4. At last you can clean up query output files or keep it if some other process wants to read it (You can also clean up S3 files using bucket retention policy automatically at scheduled interval).

Now lets see each step in depth

Step1-Start Amazon Athena Query Execution

Now first thing is to execute Athena Query by calling StartQueryExecution API . Below steps are almost same steps as we saw in section Creating Table in Amazon Athena using API call

Only part is different here is SQL query is SELECT query rather than CREATE TABLE.

  1. Drag and drop ZS REST API Task from SSIS toolbox
    SSIS REST Api Task - Drag and Drop
  2. Enter Request Url as below

  3. In the HTTP Headers enter below 2 headers
  4. Change Request method to POST
  5. In the Body enter below
    Notice: Rather than using hard coded  SQL Query we can use variable placeholder (e.g. "QueryString": "{{User::Query,FUN_JSONENCODE}}"   ). Use  FUN_JSONENCODE to replace new line with \r\n else JSON becomes invalid.

    RESt API Task - Create table in Amazon Athena using SSIS (StartQueryExecution API Call)

  6. Now go to Response Settings Tab and configure below way to extract QueryExecutionId (we will need it later).

    Set Extract Type as Json.
    Enter Expression as $.QueryExecutionId Check Save output to Variable and create new Variable (we can name as QueryExecutionId )

    Call Athena SQL, Save Amazon Athena QueryExecutionId in SSIS Variable

    Call Athena SQL, Save Amazon Athena QueryExecutionId in SSIS Variable

  7. That’s it now we can move to next step

 

Step2 – Wait until Athena Query Execution is done

New version of ZS REST API Task includes Status check feature. This feature avoids complex looping logic. Here is how to configure to check Job Status (Wait until Query execution is finished). Steps are almost same as previous section except three things ( Headers, Body and Status Check Tab)

For Status check we will call GetQueryExecution API .

  1. Drag and drop ZS REST API Task from SSIS toolbox
    SSIS REST Api Task - Drag and Drop
  2. Enter Request Url as below

  3. In the HTTP Headers enter below 2 headers
  4. Change Request method to POST
  5. In the Body enter below
    Notice: Here we have used Variable in the body. This variable we extracted in the previous step. It contains QueryExecution ID for the Job we submitted.
    SSIS Amazon Athena - Call GetQueryExecution API (Status Check)

    SSIS Amazon Athena – Call GetQueryExecution API (Status Check)

  6. Now go to Status Check Tab and configure below way to implement Status check loop. We want to stop check once status is SUCCESS or FAILED or CANCELLED. If we find FAILED or CANCELLED in State field then we must fail the step with error.
    1. Check Enable Status Check
    2. Enter Success Value as SUCCEEDED
    3. Check Fail Task option and enter this regex in the text box  FAILED|CANCELLED
    4. In the Interval enter 5 seconds delay after each iteration.
  7. That’s it. Status check Tab will look like below. Now we can move to next step
    REST API Status check Loop - Keep checking until JOB is done

    REST API Status check Loop – Keep checking until JOB is done

 

Step3-Read data from Athena Query output files (CSV / JSON stored in S3 bucket)

When you create Athena table you have to specify query output folder and data input location and file format (e.g. CSV, JSON, Avro, ORC, Parquet …) they can be GZip, Snappy Compressed. Once you execute query it generates CSV file. Zappysys can read CSV, TSV or JSON files using S3 CSV File Source or S3 JSON File Source connectors. In this section we will use CSV connector to read Athena output files.

Here is how to Athena output data

  1. Drag and drop Data flow task from SSIS Toolbox
    SSIS Data Flow Task - Drag and Drop
  2. Double click data flow and drag and drop  ZS Amazon S3 CSV File Source
    SSIS Amazon S3 CSV Source - Drag and Drop
  3. Double click to configure it.
  4.  Click New Connection and configure S3 Connection
  5. Once connection is created, browse to S3 bucket / folder location where Athena outputs files.
  6. Replace file name with variable like below
     

 

Reading data from Amazon Athena CSV files (Stored in S3)

Reading data from Amazon Athena CSV files (Stored in S3)

Putting all together – Athena Data Extract Example

Here is final package with all pieces we talked.

Read Amazon Athena Query output in SSIS (S3 CSV files)

 

Conclusion

You have seen that using ZappySys SSIS PowerPack how quickly you can integrate with Amazon Athena and other AWS Cloud Services.  Download SSIS PowerPack and try it out by yourself.

Keywords: Extract data from Amazon Athena | Read from Amazon Athena Query file | Import / Export data from Amazon Athena | Fetch data from Amazon Athena

Posted in REST API Integration, S3 (Simple Storage Service), SSIS Amazon S3 CSV Source, SSIS Amazon S3 JSON Source, SSIS Amazon Storage Task, SSIS Components, SSIS JSON Source (File/REST), SSIS OAuth Connection, SSIS PowerPack, SSIS REST API Task and tagged , , , , , , .