Introduction
In our previous blog, we saw how to call REST in SSIS including concepts of Authentication / Error Handling. Now in this post, we will cover API Pagination. You will learn concepts and patterns about REST API Pagination (for JSON / XML / SOAP or CSV API).
We will describe methods and steps to implement REST API Pagination in SSIS PowerPack or ODBC PowerPack API Drivers (for apps like Excel, Power BI, Informatica…) without coding. Both Products share a similar user interface with minor differences. For example purpose, we will show screenshots from SSIS PowerPack UI but concepts should remain same. You will also learn some techniques about how to implement an infinite loop in SSIS. Sometimes your REST API requires you to loop through all pages until no more data found then you can use pattern described in this article (sample package attached at the end).
In this article, we will use SSIS REST API Task and JSON Source Connector to process data from RESTful web service. If you have to process XML (or SOAP) data from REST API then simply use XML Source Connector for REST API, SOAP or File rather than JSON Source.
Debugging REST API Requests
Before we look at various aspects of REST API Pagination, we highly recommend you install FREE Web Debugging Tool Fiddler. Look at this article to understand how to debug REST API requests.
Method-1: REST API Pagination via URL Parameter
If you have paging requirement by specifying page number in URL then you can loop until no more pages found. Your REST API can indicate no more page found by any of the following ways.
Example URL: http://myserver.com/v1/api/getcustomers/?page=1
If no more data found for specified page number then
- server may return an empty response
- server may return the response with specific error and response status code
JSON Source Connector comes with flexible paging options to handle this scenario. See below screenshot how to configure this.
URL with offset and limit
Sometimes you may have URLs with offset + limit parameters. as below. In this case, you can use the same technique as the above screenshot but just specify Page Num Indicator as offset and Increment by the same as your limit parameter (e.g. 1000 in this case). At runtime, your offset parameter will increment and generate URLs like below to fetch the next page.
Example URLs:
http://myserver.com/v1/api/getcustomers/?offset=1&limit=1000
http://myserver.com/v1/api/getcustomers/?offset=1001&limit=1000
http://myserver.com/v1/api/getcustomers/?offset=2001&limit=1000
Method-2: REST API Pagination via URL Path
Sometimes your URL may contain a page or other identifier using path format.
Example URLs:
http://myserver.com/v1/api/getcustomers/1/
http://myserver.com/v1/api/getcustomers/2/
http://myserver.com/data/file1.json
http://myserver.com/data/file2.json
If no more data is found for the specified page number then
- the server may return a 404 error (page not found)
- server may return a response with a specific error and response status code
JSON Source Connector comes with flexible paging options to handle this scenario. See the below screenshot of how to configure this.
URL with Skip Parameter (Offset starts at zero)
Now let’s look at another common pattern just like the offset we discussed earlier but in the Skip pattern, it starts at zero rather than offset=1. In the previous mode, we didn’t have the option to supply the initial value to start the counter but if you use Path mode then you can start the counter at some custom value. See the below screen to show how we can setup pagination with the skip method with custom increment.
Example URLs
1 2 3 4 5 |
http://mysite.com/api/getorders?skip=0&pagesize=100 http://mysite.com/api/getorders?skip=100&pagesize=100 http://mysite.com/api/getorders?skip=200&pagesize=100 http://mysite.com/api/getorders?skip=300&pagesize=100 .... |
Method-3 : REST API Pagination via Response Attribute (Next Link or Cursor)
This method is not discussed in this article but click here to read more
There are mainly 3 ways you can get next link or cursor.
- Full Link (e.g. nextPage : “https://mysite.com/api/v1/1/?filter=AAAA” )
- Partial Link (e.g. nextPage : “/api/v1/getcustomer/1/?filter=AAAA” )
- Cursor Continuation Token (e.g. nextCursor : “xxxxAAAbbbbCccc112233==” )
Next Link Method (Partial or Full URL)
Example response with next URL would be
1 2 3 4 |
{ "value" : [....], "odata.nextLink" : "http://abc.com/customers/2/" } |
In this method you don’t specify page number in URL. If response include many records then you will get partial response along with indicator of next link.
For more information about this REST API Pagination please check below links. It explains in depth how to paginate REST API response with next page link found in response.
OData Paging using SSIS – REST API Paging Example
Read Twitter data in SSIS using REST API Task and JSON Source – OAuth2 Protocol
Next Link – Detect the last page based on empty rows (Regular Expression)
Now let’s look at a little bit different scenario where the Next Link is never empty even on the last page. This can cause an infinite loop if we configure it like the last section. So in such a case (see below example) when the last page still has the next link URL but the data row array is empty then use settings like the screenshot below.
First Page Example
1 2 3 4 |
{ "value" : [....], "odata.nextLink" : "http://abc.com/customers/2/" } |
Last Page Example
1 2 3 4 |
{ "value" : [], "odata.nextLink" : "http://abc.com/customers/NNNN/" } |
Example: Last Page
Basically, configure the following options
- Set Stop indicator attribute to the same node which contains data rows (without array brackets). E.g. $.items
- Set Stop indicator value / Regular Expression to something like [] assuming the last page has the same way. Check Fiddler Trace for exact text on the last page.
Cursor Method
Many API returns cursor rather than link. Cursor method usually used to support a large number of records. Cursor is opened at server side with initial query information you submit and if cursor has more records you may get next cursor indicator in your response. In cursor method you have to supply Suffix for Next URL so after First URL your next URL will contain extra attribute which will include pagination information.
Suffix for next URL
Suffix for next URL can include special placeholders. There are two placeholders allowed. If you wish to append Next URL value in raw text then use <%nextlink%> but if your API needs value in encoded format e.g. A%20B rather than A+B then you have to use <%nextlink_encoded%>
Example URLs for Cursor:
Below is few sample URLs in the log when pagination occurs via Cursor method. As you see the first URL has Cursor=* means its first call, after first request server may send next cursor in the response body (see example)
1 2 3 4 |
{ data: [....], nextCursor: "Aooo/MxxxxxxxxxxxtdA==" } |
First URL: http://api.crossref.org/members/311/works?filter=type:journal-article&rows=100&cursor=*
Second URL: http://api.crossref.org/members/311/works?filter=type:journal-article&rows=100&cursor=Aooo/MxxxxxxxxxxxtdA==
Third URL: http://api.crossref.org/members/311/works?filter=type:journal-article&rows=100&cursor=Booo/MxxxxxxxxxxxtdA==
…..
…..
Method-4: Pass Page Number in Body Attribute
Some API allows you to pass your PageSize and Page number inside Request Body (i.e. POST data) in that case you may configure your options like below (We used Google AdWords Api as an example)
Method-5: REST API Pagination via multiple URL parameters (e.g. Start and End)
In the previous section, you saw how to pass page numbers via URL (Single Parameter: http://abc.com/?page=1). However, sometimes you have a pagination requirement where you have to specify multiple parameters that need increment. See below example URL.
1 2 3 4 |
http://abc.com/?startRow=1&endRow=100 http://abc.com/?startRow=101&endRow=200 http://abc.com/?startRow=201&endRow=300 ... |
Method-6: REST API Pagination via Header (Next Link in Header)
Another popular pagination method is paginate via Header. In this method, each API response includes special Header which indicates next link for your response. In header method there are two approaches. [1] Custom Method [2] RFC 5988 Method. Lets look at both.
Next Link in Custom Header
In this method you have to specify which response header contains next URL. When no more NEXT url found in response reader stops.
Next Link in Header as per RFC 5988
In this method, HTTP response contains a special header (Link) which contains next link for our response. If no link found to read more data then reader stops. This method is described well here in RFC 5988. Check this GitHub API Example to understand more on this pagination method.
Method-7: REST API Pagination using Cursor / Continuation Token from Response Header
There will be a time when your API sends some sort of Continuation Token in Response Header and you have to use that token to call next request. Some API needs this token passed as query string in Next URL and in some API you have to pass as request header. We will cover both cases in the following section.
Your API might call these tokens as “Cursor” too so we will use these both terms in the following section (They both mean same thing).
There are few ways these APIs pass Cursor to the next request
- Pass Cursor via Request Header with same header name
- Pass Cursor via Request Header with different header name (introduced in v4.2.3)
- Pass multiple Response Headers via Request Headers (Azure DocumentDB style usecase)
- Pass Cursor via URL Query String Parameter (introduced in v3.0 or later)
Lets see all approaches.
Pass Token via Request Header (Same header name)
Consider the API pagination use case in which you get some sort of continuation token in the response header and you take it and pass in the next request via request header (same header name).
Example:
First Request / Response:
1 2 3 4 5 6 7 |
------- Request ------ GET https://abc.com/api/getdata ------- Response ------ x-token: Nxdfabcd11111 { some response data } |
Second Request / Response:
1 2 3 4 5 6 7 8 |
------- Request ------ GET https://abc.com/api/getdata x-token: Nxdfabcd11111 ------- Response ------ x-token: Nxdfabcd22222 { some response data } |
……
……
Last Request / Response:
1 2 3 4 5 6 7 |
------- Request ------ GET https://abc.com/api/getdata x-token: Nxdfabcd22222 ------- Response ------ { some response data } |
For above pagination pattern that just enter the response header name in Page Num Indicator field as below
All possible formats for Page Num Indicator is below
x-page-token
x-token1, x-token2
RequestUrlQueryName=ResponseHeaderName
header::RequestHeaderName=ResponseHeaderName
RequestUrlQueryName=ResponseHeaderName(regular-expression)
Pass Token via Request Header (Different header name)
Consider the API pagination use case just like previous one but only different is request header name is not same as response header name. For example rather than x-token in request we need to call it x-next-token.
Example:
First Request / Response:
1 2 3 4 5 6 7 |
------- Request ------ GET https://abc.com/api/getdata ------- Response ------ x-token: Nxdfabcd11111 { some response data } |
Second Request / Response:
1 2 3 4 5 6 7 8 |
------- Request ------ GET https://abc.com/api/getdata x-next-token: Nxdfabcd11111 ------- Response ------ x-token: Nxdfabcd22222 { some response data } |
……
……
Last Request / Response:
1 2 3 4 5 6 7 |
------- Request ------ GET https://abc.com/api/getdata x-next-token: Nxdfabcd22222 ------- Response ------ { some response data } |
For above pagination pattern that just enter the response header name in Page Num Indicator field as below
Pass multiple Response Headers via Request Headers (Azure DocumentDB style usecase)
Some API like Azure Cosmos DB uses Continuation Token via Header approach. In that case you can enter single header name or multiple headers like below (For multiple headers use comma)
You can use use tools like Fiddler to see paginated requests as below.
First Request
Second Request
Pass Token via URL Parameter (Query String)
In the new version we added a feature where you can actually read Continuation Token (i.e. Cursor) from Response Header and then pass to next request in the URL rather than Request Header. Here is the example API which uses such method.
First Request
1 2 3 4 5 6 7 8 |
--------- Request --------- GET http://test.com/api/getdata --------- Response --------- X-RESPONSE-TOKEN : Pg2xxxxxxxxxxxxxxxxx { data : [ 1,2,3.... 10 ] } |
Second Request
1 2 3 4 5 6 7 8 |
--------- Request --------- GET http://test.com/api/getdata?cursor=Pg2xxxxxxxxxxxxxxxxx --------- Response --------- X-RESPONSE-TOKEN : Pg3xxxxxxxxxxxxxxxxx { data : [ 11,12,13.... 20 ] } |
Here is how to configure above pagination pattern
Pass Token via URL Parameter (Query String) – with custom Regular expression (Usecase: Salesforce BULK query API)
Now consider previous scenario but with slightly different requirement. By default previous pagination pattern stops when specified response header is missing or response header value is blank. However is some cases header value contains custom string to indicate last page. For example x-token: NULL. In such case we need to use Page Num Indicator like below (must use v4.2.3 or higher in SSIS PowerPack / 1.5.2 or higher in ODBC PowerPack)
1 |
nextUrlParameterName=responseHeadername(Some_Regular_Expression) |
For example to stop pagination in below case you have to enter Page Num Indicator as below. It says stop pagination when “NULL” word is detected
1 |
cursor=X-RESPONSE-TOKEN(^((?!NULL\b).)*$) |
First Request
1 2 3 4 5 6 7 8 |
--------- Request --------- GET http://test.com/api/getdata --------- Response --------- X-RESPONSE-TOKEN : Pg2xxxxxxxxxxxxxxxxx { data : [ 1,2,3.... 10 ] } |
Second Request
1 2 3 4 5 6 7 8 |
--------- Request --------- GET http://test.com/api/getdata?cursor=Pg2xxxxxxxxxxxxxxxxx --------- Response --------- X-RESPONSE-TOKEN : Pg3xxxxxxxxxxxxxxxxx { data : [ 11,12,13.... 20 ] } |
……
……
Last Request
1 2 3 4 5 6 7 8 |
--------- Request --------- GET http://test.com/api/getdata?cursor=Pg3xxxxxxxxxxxxxxxxx --------- Response --------- X-RESPONSE-TOKEN : NULL { data : [ 21,22,23.... 30 ] } |
Here is how to configure above pagination pattern
Method-8: Pass Pagination Token in Request Body
In previous section (Method-4: Pass Page Number in Body Attribute ) we saw how to pass simple numeric page counter in Body. This is fine for some API but if your API has base64 style text token then counter based method doesn’t work. Also if you have requirement to hide page attribute node in first request and send page counter node in second request body (Changing Request Body Pattern for First vs Next) then check next two sections.
Next two sections covers two API Pagination patterns to fetch 100 rows (1000 in each page) . Both requires to send token in Body but second section is more complex need (Different Body Attributes in first vs next requests )
Passing Page Token in Body (Blank Token in first request allowed)
Lets look at simple pattern where first request has blank token in Body but all other requests until we find last page we keep sending non-empty token. This approach works as long as API Provider doesn’t mind having Blank Token in first request. If you get error then check next section to handle more complex way.
Here is the example of API which needs token in body
First API call
1 2 3 4 5 6 7 |
--------- Request --------- POST http://test.com/api/getdata { searchFor: "Account", continueToken:"" } --------- Response --------- {recordsFound: 5000, data: ["Acct1", "Acct2"..., "Acct1000"], nextToken:"AxBxCx112233001==" } |
Second API call
1 2 3 4 5 6 7 |
--------- Request --------- POST http://test.com/api/getdata { searchFor: "Account", continueToken:"AxBxCx112233001==" } --------- Response --------- {data: ["Acct1001", "Acct1002"..., "Acct2000"], nextToken:"AxBxCx112233002==" } |
….. many more calls …
…. many more calls ….
Last API call (no more nextToken in response so STOP pagination)
1 2 3 4 5 6 7 |
--------- Request --------- POST http://test.com/api/getdata { searchFor: "Account", continueToken:"AxBxCx112233009==" } --------- Response --------- {data: ["Acct9001", "Acct9002"..., "Acct10000"] } |
So if you have such pattern then you can set few properties
For such issue you can use Changing few properties like below. Use Property Grid if you cant find on UI
- Set EnablePageTokenForBody = True
- Set PagePlaceholders = header=|;filter=|;
- Set HttpRequestData = { searchFor: "Account", continueToken:"[$pagetoken$]" }
Passing Page Token in Body (Blank Token in first request)
Now let’s look at more complex pattern. Here our first request body is different than other requests.
First API call
1 2 3 4 5 6 7 |
--------- Request --------- POST http://test.com/api/getdata { searchFor: "Account" } --------- Response --------- {recordsFound: 5000, data: ["Acct1", "Acct2"..., "Acct1000"], nextToken:"AxBxCx112233001==" } |
Second API call
1 2 3 4 5 6 7 |
--------- Request --------- POST http://test.com/api/getdataNext { continueToken:"AxBxCx112233001==" } --------- Response --------- {data: ["Acct1001", "Acct1002"..., "Acct2000"], nextToken:"AxBxCx112233002==" } |
….. many more calls …
…. many more calls ….
Last API call (no more nextToken)
1 2 3 4 5 6 7 |
--------- Request --------- POST http://test.com/api/getdataNext { continueToken:"AxBxCx112233009==" } --------- Response --------- {data: ["Acct9001", "Acct9002"..., "Acct10000"] } |
So notice few things
- First Request Body / URL is different than second / third and Nth API call
- For first request we do not have token attribute but after first request we do have token attribute
We have this article and this article which describes similar pattern.
Now lets look at how to solve both problems
Passing PageToken or Counter in Next API Call Body
For such issue you can use Changing few properties like below. Use Property Grid if you cant find on UI
- Set EnablePageTokenForBody = True
- Set PagePlaceholders = header=|;filter=|;
- Set HttpRequestData = [$tag$]
- Set HasDifferentNextPageInfo = True
- Set FirstPageBodyPart = { searchFor: "Account"}
- Set NextPageBodyPart = { token:"[$pagetoken$]" }
So notice how we defined Body in two different properies as template. And in HttpRequestData we used [$tag$]
Output Page Number as column
Sometimes you have use case where you want to Start Page Number from the last page loaded successfully. By default, you will see one additional column __FileName in the output. This column contains PageNumber from URL for each row. You can save this into database so when next time you run your SSIS you can start at MAX(PageNumber) + 1.
Outputting PageNumber option is not visible by default so Use Property Grid and look for OutputFileName property (Default=true)
Last Page Detection
Every API may have different way of telling you that you reached last page or no more data found so you should stop looping. SSIS PowerPack comes with many options so you can implement desired logic for last page detection.
Following approaches supported by last page detection. Below options are not applicable if you are using Method#3 (Response contains Next Link / Cursor)
- Detect the last page by row count
- Detect the last page by response status code
- Detect the last page by error message (substring search)
- Detect the last page by response size
- Detect the last page based on response value or regular expression pattern
Now lets look at each setting. You should choose correct one based on your API behavior
Detect the last page by row count
If your API returns blank document or blank Array (if its JSON) then you can use this method to detect the last page. So when no content found in Response document it will stop pagination.
See below example of JSON API .. It shows sample first page and Sample last page (Assuming you entered Filter as $.Records[*] to extract data from Records array.
First Page
1 2 3 |
{ Records : [ {...}, {...}, {...} ......... ] } |
Last Page
1 2 3 |
{ Records : [ ] } |
Detect the last page by response status code
Sometimes you get error with status code when the last page is reached (Most common is 404). So rather than failure, you can treat that status code as last page indicator using this option (see below).
Detect the last page by the error message
If you cant use above two methods and want to detect the last page based on a certain error string in the response then use below approach.
Detect the last page by response size
Sometimes you can detect the last page by response size. For example, your response contains “–” only if no more data is found then you cannot use any of the above approaches and you have to detect based on response size (e.g. less than 3 bytes). See below
Detect the last page based on the response attribute value
There will be a time when you want to detect the last page based on some indicator value from your response. In that case, you can use the following settings to detect the last page.
- Select Paging mode =Response Attribute Mode
- Select Next Link Attribute (Example: $.next_page )
- Select Stop Indicator Attribute (Example: $.hasMorePages )
- Enter a value for Stop Indicator (Example: false )
If you don’t want to use Hardcoded value for stop indicator and want to use expression then see next section.
Detect the last page based on regular expression pattern
There will be a time when you want to detect the last page based on some indicator value from your response. In that case, you can use the following settings to detect the last page.
- Select Paging mode =Response Attribute Mode
- Select Next Link Attribute (Example: $.next_page )
- Select Stop Indicator Attribute (Example: $.count )
- Enter regex pattern (regular expression) for Stop Indicator (Example: regex=^\d{1,3}$ ). Stop when count attribute is less than 1000 in other word less than 4 digits.
Here is an example from Zendesk API (Stop pagination when count attribute from the response is less than 1000)
Handling API Rate Limit - Retry on too many API requests
Many API restricts you from calling too many API calls too soon (just to save their servers from being overwhelmed) - Example. Let's think this way if you fetching a million rows via paginated response and the very last page fails then you won't be happy because now you have to fetch everything again. To solve this ZappySys provides retry options on HTTP / OAuth and REST API connection managers.
Generally, when you exceed the rate limit, the server sends you error code 429 but it can be something else too or multiple codes are returned. In the below example, we use 503. Check your API documentation to know the exact code you should retry.
To set the retry options go to connection settings (e.g. HTTP / OAuth / REST connection). The below example is for OAuth connection.
To retry on multiple codes use Pipe as a delimiter. Example: 429|503
Performance Tips
You can improve performance of your API calls by doing a few changes (again not every API may support this)
- Use the maximum page size possible (e.g. if you doing pagination you can set max possible page size – refer to your API documentation whats the default size and what is the max page size you can supply)
e.g. /myapi?pagesize=100 (parameter name can be something else like limit=100)
SSIS Infinite Loop
So now you know how to paginate automatically in JSON Source Connector but what if you want to use REST API Task for pagination? Some custom scenarios like web scraping require you to download all HTML pages until no more results found? In that case, you most likely need infinite looping pattern (Loop until the last page detected). To loop through pages until the last page is reached you can perform the following logic.
The last page can be detected based on any of the following ways
1. Check Response StatusCode
2. Check Error Message
3. Check Response Size (If less than X bytes then consider as the last page)
REST API Task support #1 and #2 scenarios. If you want to check bytes then use the expression in Expression Task or Script Task
(e.g. @exitloop = LEN(@responseData) <= 3 )
URL for Rest API Task (On Request Settings Tab)
When you configure REST API Task enter URL as below so its dynamic. You can use PageCounter variable anywhere in the URL.
1 |
//zappysys.com/downloads/files/test/cust-{{User::pagecounter}}.json |
Download Sample SSIS Package
SSIS-Loop-Rest-API-2012.dtsx (Zip file)
Video Tutorial – Reading data from API in SSIS
In case you missed our previous article about calling REST API is SSIS then check below video.
Conclusion
It can be time consuming and steep learning curve to write C#/JAVA code to handle your rest api integration but using some Tasks and Connectors listed below can help you to solve big headache and provide clean drag and drop REST API integration in SSIS for virtually any REST API Source (e.g. Twitter, Facebook, Salesforce, office 365). Download SSIS PowerPack and try it out by yourself.
SSIS Components for REST API/JSON/XML/SOAP
JSON Source Connector (File, REST API) | |
Export JSON File Task | |
JSON Parser Task | |
XML Source Connector (File, SOAP, REST API) |
Pingback: Calling SOAP Web Service in SSIS (XML Source) | ZappySys Blog