Azure Synapse formerly known as Azure Data Warehouse is an analytics service which provides you the best of both worlds; Big Data and Big Data Analytics. It is the only service leading TPC-H Benchmarks when queries are executed at Petabyte scale.
Leading enterprises are taking advantage by embracing Azure Synapse and migrating their humongous data sets to the platform. In this process, there is an urgent need of a recommended approach to perform the migration at relatively cheaper cost with little effort.
In this article we will try to see how you can perform data migration from AWS Redshift to Azure Synapse. Please note we would be discussing only data migration and not the schema or code migration in this article. Moreover schema and tables should be created as pre-requisites by making necessary changes to data types(if any) to make it compatible to Azure Synapse e.g AWS Redshift has max column character length as 65000, however in Azure Synapse it is just 8000.
So Let’s Begin.
Prerequisites
- ODBC Connection to AWS Redshift added on the server
- Schema/Tables created on destination database with appropriate distribution.
- S3 to store data extracted from Redshift in form of .gzip files
- Blob Storage to store .gzip files copies from S3 via Azure Data Factory
- Azure Data Factory to copy data from S3 to Blob Storage
- External Source and File Format created in destination table to create external table.
Steps

1.Unload
Prepare and execute the Unload query against Redshift and store the data in AWS S3 in .gzip format.
2.Copy
Invoke Azure Data factory to dump the .gzip file extracted in Unload process from AWS S3 to Azure Blob Storage.
3.Load
Use Polybase and dump data into Azure Synapse via Stored Procedure.
Unload
The below image shows a sample unload query to extract data from AWS Redshift table to a S3 storage container in .gzip format with a “^” delimiter. The parameters maxfilesize and AllowOverWrite ensure max file size of extracted data file size and overwriting an existing file in the container.
unload ( $$ Select Col1, col2 FROM schemaname.tablename $$ ) to ' s3://<ContainerName>' credentials 'aws_access_key_id=<Access Key>; aws_secret_access_key=<aws_secret_access_key>'
parallel on gzip
maxfilesize 3 gb ALLOWOVERWRITE DELIMITER AS '^';
There may be a scenario where Administrator would have set timeout for query execution in Redshift. In such cases unload for large number of rows may timeout. You would need to adjust your query to unload data at granularity of Year/Month/Day in batches.
Copy
You can invoke the Azure Data Factory through either Portal or PowerShell. Here, Source is AWS S3 container and Destination the Azure blob storage. Below is the code that can be used to invoke the Pipeline via power shell.You will need to use a Service Principal when invoking the Azure Data Factory via PowerShell.
$parameter = @{
“source” = “<S3 Source Container>”
“destination” = “<Blob Storage Container”
}
$password = "<Password>" | ConvertTo-SecureString -asPlainText -Force
$username = "<User Name"
$credential = New-Object System.Management.Automation.PSCredential($username,$password)
Connect-AzAccount -Credential $credential -ServicePrincipal -Tenant "<Tenant ID>"
$id = Invoke-AzDataFactoryV2Pipeline -ResourceGroupName "<Resource Group Name" -DataFactoryName "Data factory Name" -PipelineName "PipeLine Name" -Parameter $parameter -DefaultProfile $AzContext
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName "<Resource Group Name" -DataFactoryName "Data factory Name" -PipelineRunId $id
if ($run.Status -eq 'InProgress') {
Start-Sleep -s 60
}
$run.Status
if ($run.Status -eq 'Succeeded') {
break
}
elseif ($run.Status -eq 'failed') {
break
}
}
Load
Once the file is copied Azure blob storage, create an external table and load data in destination table. You can also load rejected rows into a separate location in blob storage.[datasource] in a external data source to be created for the blob storage. [fileformat] is a file format with a specified delimiter ‘^’. Truncate command can be used in case you are performing full data load for the table. For partial or batch wise load ignore this command.
IF(OBJECT_ID('[ExtTable].[Tab1e1]')) IS NOT NULL
DROP EXTERNAL TABLE [ExtTable].[Tab1e1]
CREATE EXTERNAL TABLE [ExtTable].[Tab1e1](
[C011] bigint,
[C012] varchar(100))
WITH (DATA SOURCE = [datasource],
LOCATION = '<blob storage>'
FILE FORMAT = [fileformat],
REJECT TYPE = VALUE,REJECT VALUE = 0)
TRUNCATE TABLE [SchemaName].[Tab1e1]
INSERT INTO [SchemaName].[Tab1e1] SELECT [Col1], [Col2]
FROM [ExtTable].[Tablel]
You can parameterize the complete process in PowerShell to automate the migration of multiple tables at a larger scale. Add a rejection location and increase the value of Reject value to troubleshoot individual tables for the root cause in case of insert failure for the rows.
So go ahead explore and migrate your data to Azure Synapse Analytics with above approach and do share in your feedback.
Categories: AWS, AWS S3, Azure Data Factory, Azure Synapse, Blob storage, polybase, Redshift
Thanks allot for posting this article really helpful, would you please post how you managed to load incremental load till cutover.
Thank you Rakesh, we used SQL jobs and Stored Procedures to do the incrementals