May 23, 2016 at 7:48 pm #1601
Which approach is faster/better for Redshift Data Transfer? Should I use SSIS Redshift Data Transfer Task or using manual approach where I export/split and compress data into multiple files using Export CSV Task then use Amazon S3 to upload files in Parallel and then issue COPY command manually using Redshift ExecuteSQL Task ?
First of all lets understand how many steps needed to perform successful data transfer to Amazon Redshift as per best practice. Assume scenario where you loading data into Amazon Redshift from SQL Server.
If you are using SSIS Redshift Data Transfer Task then it will handle many steps automatically for you. See below list of few common things handled automatically for you when you use SSIS Redshift Data Transfer Task.
Steps for Amazon Redshift Data Load from On-Premise files or RDBMS (e.g. MySQL, SQL Server)
- Export local RDBMS data to flat files (Make sure you remove invalid characters, apply escape sequence during export)
- Split files into 10-15 MB each to get optimal performance during upload and final Data load
- Compress files to *.gz format so you don’t end up with $1000 surprise bill 🙂 .. In my case Text files were compressed 10-20 times
- List all file names to manifest file so when you issue COPY command to Redshift its treated as one unit of load
- Upload manifest file to Amazon S3 bucket
- Upload local *.gz files to Amazon S3 bucket
- Issue Redshift COPY command with different options
- Schedule file archiving from on-premises and S3 Staging area on AWS
- Capturing Errors, setting up restart ability if something fails
For whatever reason if you decide to take complete control on Redshift Data Load process then you can do that too.
SSIS PowerPack has many task which will make your data integration very easy with Amazon AWS Services.
For manual approach consider following.
- Export local RDBMS data into flat files using Export CSV Task (enable Split option and then check compress to gzip file)
- Use Amazon S3 to upload files in Parallel (By default single S3 Task will automatically decide how many files will be uploaded in parallel based on setting you specify on S3 Connection). Amazon S3 Task also perform multithreaded upload for single file if its too big (usually more than 16MB). When file is big then S3 task splits bigger files into multiple chunks and then upload those chunks in parallel.
- You can then issue COPY command manually using Redshift ExecuteSQL Task.
- To perform S3 file archival you can Amazon S3 Task and to perform local file archival you can use Advanced File System Task
- If any error occurs then you have to manually read correct system table from Redshift data base and find exact error message.
Check this similar article which explains manual approach for Downloading Redshift data as file and load into local database. Its not exact as Uploading data into Redshift but you will get idea how to use some the tasks mentioned on this page.
Things to tweak for performance:
- Make sure your S3 Threads are set correctly on Amazon S3 Connection manager
- Too many threads may be counter productive so test setting and figure out which one gives maximum speed.
- When too many upload/download threads used for S3 you may get HTTP Timeout errors. In that case make sure timeout setting is increased on S3 Connection manager UI
- Change how many threads used to compress files (This setting found on Advanced Tab of SSIS Redshift Data Transfer Task)
Redshift data load is a complex process but SSIS Redshift Data Transfer Task can handle most of those tasks automatically for you. However in certain case you will have to take complete control and use manual approach. In that case use SSIS Tasks listed in above section but make sure you test various options and perform benchmarking by considering speed vs reliability.
The forum ‘FAQs – SSIS PowerPack’ is closed to new topics and replies.