Introduction
In our previous post we explored unlimited possibilities to call Amazon AWS API using SSIS. Now in this post we will learn how to import / export data from Amazon Athena using SSIS. Amazon Athena is very similar service like Google BigQuery which we documented already. To read data from Amazon Athena we will use ZappySys JSON / REST API Source which supports excellent JSON parsing capability like this.
Prerequisites
Before we do hello world demo for calling Amazon AWS API, you will need to make sure following prerequisites are met.
- SSIS designer installed. Sometimes it is referred as BIDS or SSDT (download it from Microsoft site).
- Basic knowledge of SSIS package development using Microsoft SQL Server Integration Services.
- Access to valid AWS credentials (Access Key, Secret Key for your IAM User). Click here to learn more about IAM users and Access Key/Secret Key
- Make sure SSIS PowerPack is installed. Click here to download.
Steps needed to read data from Amazon Athena
To read data from Amazon Athena you need to do atleast 3 steps at very high level. Later on this article we will see each step in depth.
- Call StartQueryExecution API to send SQL Query (Its Job Style API so this call starts Job)
- Keep checking Job Status by calling GetQueryExecution API (In the response check State field for Status – We need to look for word SUCCESS)
- Once Job is finished it produces CSV file in S3 Bucket (Output Location is supplied in Step-1)
AWS Console – Athena Query Editor / Configure Output Settings
Few things you might have to setup before you can call Athena API in SSIS
- Make sure your IAM user has correct permission (We used existing policies for ease of use but you can use custom policy for precise permission set)
- Configure default output location for your Athena queries
Creating Connection to call Athena API in SSIS
Any API calls made to Amazon Athena service using ZappySys JSON / REST API Source or ZS REST API Task needs OAuth connection.
For REST API Task configure as below (More details about using REST API Task is in the next section).
- Drag ZS REST API Task from Control Flow Toolbox
- Double click to edit.
- Change Request URL Access Mode to [Url from Connection]
- Now select <<New OAuth Connection>> from connection drop down
- On OAuth connection select Amazon AWS API v4 as Provider
- Enter your Access Key and Secret Key
Creating Table in Amazon Athena using API call
For this demo we assume you have already created sample table in Amazon Athena. If you wish to automate creating amazon athena table using SSIS then you need to call CREATE TABLE DDL command using ZS REST API Task.
- In the previous ZS REST API Task select OAuth connection (See previous section)
- Enter Request Url as below
1https://athena.us-east-1.amazonaws.com/ - In the HTTP Headers enter below 2 headers
12X-Amz-Target: AmazonAthena.StartQueryExecutionContent-Type: application/x-amz-json-1.1 - Change Request method to POST
- In the Body enter below
1234567{"ResultConfiguration": {"OutputLocation": "s3://my-bucket/output-files/"},"QueryString": "create external table tbl01 (CustomerID STRING, CompanyName STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://my-bucket/input-files/';","ClientRequestToken": "{{System::ExecutionInstanceGUID}}"} - Now click Test Request / Response. This will create a new Athena table which we will query in later section.
Notice that in above Body ClientRequestToken is a unique number each time you have to call StartQueryExecution API to make it simple we used SSIS system variable System::ExecutionInstanceGUID. Its unique for each execution. You can also use newid() function from SQL Server and save result to variable if you like.
Here is the screenshot of above Request
Uploading data for Amazon Athena (Source files)
Once you have athena table created, we will need some data to query. Loading data to amazon athena table is nothing but upload files to S3. If you noticed we added below line in CREATE TABLE statement in the previous section. This means all source files are located at this S3 location.
1 |
LOCATION 's3://my-bucket/input-files/' |
We have written blog post to explain this process (Click Here – How to load data from SQL server to S3 files).
Extract data from Amazon Athena using SSIS (Query Athena)
Now its time to read some data by writing SQL query for Athena data (i.e. S3 files). As we mentioned earlier, reading data from Athena can be done using following steps.
- Call StartQueryExecution API using REST API Task to execute your Query (It returns you QueryExecutionID). Use same technique as Create Table example. Replace CREATE Table command in body with your SQL Query (New line must be replaced by \r\n in JSON).
- Use REST API Task to call GetQueryExecution API and configure periodic status check loop (Task keeps waiting until Success or Failure Value not found in Response)
- Once data is available read data using S3 CSV File Source
- At last you can clean up query output files or keep it if some other process wants to read it (You can also clean up S3 files using bucket retention policy automatically at scheduled interval).
Now lets see each step in depth
Step1-Start Amazon Athena Query Execution
Now first thing is to execute Athena Query by calling StartQueryExecution API . Below steps are almost same steps as we saw in section Creating Table in Amazon Athena using API call
Only part is different here is SQL query is SELECT query rather than CREATE TABLE.
- Drag and drop ZS REST API Task from SSIS toolbox
- Enter Request Url as below
1https://athena.us-east-1.amazonaws.com/ - In the HTTP Headers enter below 2 headers
12X-Amz-Target: AmazonAthena.StartQueryExecutionContent-Type: application/x-amz-json-1.1 - Change Request method to POST
- In the Body enter below
1234567{"ResultConfiguration": {"OutputLocation": "s3://my-bucket/output-files/"},"QueryString": "select * from sampledb.elb_logs\r\nlimit 100000","ClientRequestToken": "{{System::ExecutionInstanceGUID}}"} - Now go to Response Settings Tab and configure below way to extract QueryExecutionId (we will need it later).
Set Extract Type as Json.
Enter Expression as $.QueryExecutionId Check Save output to Variable and create new Variable (we can name as QueryExecutionId ) - That’s it now we can move to next step
Step2 – Wait until Athena Query Execution is done
New version of ZS REST API Task includes Status check feature. This feature avoids complex looping logic. Here is how to configure to check Job Status (Wait until Query execution is finished). Steps are almost same as previous section except three things ( Headers, Body and Status Check Tab)
For Status check we will call GetQueryExecution API .
- Drag and drop ZS REST API Task from SSIS toolbox
- Enter Request Url as below
1https://athena.us-east-1.amazonaws.com/ - In the HTTP Headers enter below 2 headers
12X-Amz-Target: AmazonAthena.GetQueryExecutionContent-Type: application/x-amz-json-1.1 - Change Request method to POST
- In the Body enter below
1{"QueryExecutionId": "{{User::QueryExecutionId}}"} - Now go to Status Check Tab and configure below way to implement Status check loop. We want to stop check once status is SUCCESS or FAILED or CANCELLED. If we find FAILED or CANCELLED in State field then we must fail the step with error.
- Check Enable Status Check
- Enter Success Value as SUCCEEDED
- Check Fail Task option and enter this regex in the text box FAILED|CANCELLED
- In the Interval enter 5 seconds delay after each iteration.
- That’s it. Status check Tab will look like below. Now we can move to next step
Step3-Read data from Athena Query output files (CSV / JSON stored in S3 bucket)
When you create Athena table you have to specify query output folder and data input location and file format (e.g. CSV, JSON, Avro, ORC, Parquet …) they can be GZip, Snappy Compressed. Once you execute query it generates CSV file. Zappysys can read CSV, TSV or JSON files using S3 CSV File Source or S3 JSON File Source connectors. In this section we will use CSV connector to read Athena output files.
Here is how to Athena output data
- Drag and drop Data flow task from SSIS Toolbox
- Double click data flow and drag and drop ZS Amazon S3 CSV File Source
- Double click to configure it.
- Click New Connection and configure S3 Connection
- Once connection is created, browse to S3 bucket / folder location where Athena outputs files.
- Replace file name with variable like below
1my-bucket/output-files/{{User::QueryExecutionId}}.csv
Putting all together – Athena Data Extract Example
Here is final package with all pieces we talked.
Conclusion
You have seen that using ZappySys SSIS PowerPack how quickly you can integrate with Amazon Athena and other AWS Cloud Services. Download SSIS PowerPack and try it out by yourself.
Keywords: Extract data from Amazon Athena | Read from Amazon Athena Query file | Import / Export data from Amazon Athena | Fetch data from Amazon Athena