Migrating Data from AWS Redshift to Azure Synapse Analytics

Azure Synapse formerly known as Azure Data Warehouse is an analytics service which provides you the best of both worlds; Big Data and Big Data Analytics. It is the only service leading TPC-H Benchmarks when queries are executed at Petabyte scale.

Leading enterprises are taking advantage by embracing Azure Synapse and migrating their humongous data sets to the platform. In this process, there is an urgent need of a recommended approach to perform the migration at relatively cheaper cost with little effort.

In this article we will try to see how you can perform data migration from AWS Redshift to Azure Synapse. Please note we would be discussing only data migration and not the schema or code migration in this article. Moreover schema and tables should be created as pre-requisites by making necessary changes to data types(if any) to make it compatible to Azure Synapse e.g AWS Redshift has max column character length as 65000, however in Azure Synapse it is just 8000.

 So Let’s Begin.

Prerequisites

  1. ODBC Connection to AWS Redshift added on the server
  2. Schema/Tables created on destination database with appropriate distribution.
  3. S3 to store data extracted from Redshift in form of .gzip files
  4. Blob Storage to store .gzip files copies from S3 via Azure Data Factory
  5. Azure Data Factory to copy data from S3 to Blob Storage
  6. External Source and File Format created in destination table to create external table.

Steps

1.Unload

Prepare and execute the Unload query against Redshift and store the data in AWS S3 in .gzip format.

2.Copy

Invoke Azure Data factory to dump the .gzip file extracted in Unload process from AWS S3 to Azure Blob Storage.

3.Load

Use Polybase and dump data into Azure Synapse via Stored Procedure.

Unload

The below image shows a sample unload query to extract data from AWS Redshift table to a S3 storage container in .gzip format with a “^” delimiter. The parameters maxfilesize and AllowOverWrite ensure max file size of extracted data file size and overwriting an existing file in the container.

unload ( $$ Select Col1, col2 FROM schemaname.tablename $$ ) to ' s3://<ContainerName>' credentials 'aws_access_key_id=<Access Key>; aws_secret_access_key=<aws_secret_access_key>'
parallel on gzip  
maxfilesize 3 gb ALLOWOVERWRITE DELIMITER AS '^';

There may be a scenario where Administrator would have set timeout for query execution in Redshift. In such cases unload for large number of rows may timeout. You would need to adjust your query to unload data at granularity of Year/Month/Day in batches.

Copy

You can invoke the Azure Data Factory through either Portal or PowerShell. Here, Source is AWS S3 container and Destination the Azure blob storage. Below is the code that can be used to invoke the Pipeline via power shell.You will need to use a Service Principal when invoking the Azure Data Factory via PowerShell.

    $parameter = @{
     “source” = “<S3 Source Container>”
     “destination” = “<Blob Storage Container”
     }

    $password = "<Password>" | ConvertTo-SecureString -asPlainText -Force
    $username = "<User Name" 

    $credential = New-Object System.Management.Automation.PSCredential($username,$password)
    Connect-AzAccount -Credential $credential -ServicePrincipal -Tenant "<Tenant ID>"
    $id = Invoke-AzDataFactoryV2Pipeline -ResourceGroupName "<Resource Group Name" -DataFactoryName "Data factory Name" -PipelineName "PipeLine Name" -Parameter $parameter -DefaultProfile $AzContext
    while ($True) {
            $run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName "<Resource Group Name" -DataFactoryName "Data factory Name" -PipelineRunId $id

           if ($run.Status -eq 'InProgress') {
                Start-Sleep -s 60                    
            }
            $run.Status
            if ($run.Status -eq 'Succeeded') {                     
                break
            }
            elseif ($run.Status -eq 'failed') {
                break
            }                    
    }

Load

Once the file is copied Azure blob storage, create an external table and load data in destination table. You can also load rejected rows into a separate location in blob storage.[datasource] in a external data source to be created for the blob storage. [fileformat] is a file format with a specified delimiter ‘^’. Truncate command can be used in case you are performing full data load for the table. For partial or batch wise load ignore this command.

IF(OBJECT_ID('[ExtTable].[Tab1e1]')) IS NOT NULL
DROP EXTERNAL TABLE [ExtTable].[Tab1e1]
CREATE EXTERNAL TABLE [ExtTable].[Tab1e1](
[C011] bigint,
[C012] varchar(100))
WITH (DATA SOURCE = [datasource],
LOCATION = '<blob storage>'
FILE FORMAT = [fileformat],
REJECT TYPE = VALUE,REJECT VALUE = 0)
TRUNCATE TABLE [SchemaName].[Tab1e1]
INSERT INTO [SchemaName].[Tab1e1] SELECT [Col1], [Col2]
FROM [ExtTable].[Tablel]

You can parameterize the complete process in PowerShell to automate the migration of multiple tables at a larger scale. Add a rejection location and increase the value of Reject value to troubleshoot individual tables for the root cause in case of insert failure for the rows.

So go ahead explore and migrate your data to Azure Synapse Analytics with above approach and do share in your feedback.



Categories: AWS, AWS S3, Azure Data Factory, Azure Synapse, Blob storage, polybase, Redshift

2 replies

  1. Thanks allot for posting this article really helpful, would you please post how you managed to load incremental load till cutover.

Leave a Reply