Database Migration from Microsoft Azure to Snowflake on AWS: Part 1

Advances in technology have increased the demand for improved infrastructure with rapid deployments. Public cloud providers are constantly updating their policies and infrastructure to meet ever-increasing business demands. This competition gives enterprises the opportunity to choose the cloud provider(s) that best meets their specific governance and cost-efficiency needs.

In this blog, we discuss migrating databases from SQL Server on Azure VM to Snowflake on AWS.

Understand the problem

Microsoft Azure is a cloud computing offering for Microsoft-managed data centers, while Snowflake is a cloud-based data warehousing solution that provides software as a service (SaaS) based on various public cloud providers .

In this scenario, we need to migrate SQL Server databases to Snowflake. It is also necessary that the migrated data be placed in Snowflake with multiple schemas for database names with the data in an accurate and correct form.

While Azure supports ingest from various sources and clouds, it does not support direct output to other cloud providers. Finding a viable solution to move the output to AWS-based Snowflake was the first challenge.

Our approach to migration

According to Snowflake Documentation, Snowflake has a scene feature that might solve the above problem. It is basically a path where data files that need to be ingested are stored, similar to the SMB Samba mount concept. Using Snowflake Stage allowed it to load Azure Blob storage, after which Snowflake could read and ingest the data into flat files. The next step was to move the data from SQL Server to Azure Blob storage.

The migration process

The migration process included the following steps:

  • Replicate database schema in Snowflake based on Azure SQL Database
  • Configure the Azure Data Factory pipeline to create Instant Parquet format flat files on Blob storage.
    • Use parquet files for data compression and fast data loading in Snowflake
  • Create a file format in Snowflake
    • Create or replace file format type = ‘parquet’;
  • Create a Scene in Snowflake
    • create or replace step
    • url=”
    • credentials=(azure_sas_token=)
    • file_format = ;
  • To check if files are prepared
  • Finally, load the data into the Snowflake table from Stage

(Note that all parquet data is stored in a single column ($1))

copy to TEST1

from (select

$1:ClientID::varchar,

$1:NameStyle:name::varchar,

$1:Title:city.bag::variant,

$1:First name::varchar,

$1:SecondName::varchar,

$1:Name::varchar,

$1:Suffix::varchar,

$1:CompanyName::varchar,

$1:SalesPerson::varchar,

$1:EmailAddress::varchar,

$1:Phone:name::varchar,

$1:PasswordHash::varchar,

$1:PasswordSalt::varchar,

$1:rowguid:name::varchar,

$1:DateModified::datetime

from @ );

Let’s break down the details of each step of the process.

1. Leverage Azure Data Factory

Azure’s Data Factory is a GUI-based tool that facilitates an end-to-end ETL solution and provides a step-by-step guide for building the pipeline. It has the source (SQL Server), target (Blob storage), and parameters needed to tune the performance of the pipeline. These offerings made it a perfect solution for the custom needs of this migration project, which involved managing data before exporting it to Blob storage. This issue has been transparently resolved by Data Factory, which is covered in detail in the last section of this blog.

2. Data Factory Pipeline Performance Tuning

The tricky part of this migration was that even though the pipeline was easy to build, it was difficult to exploit its full potential and achieve optimal performance.

There were terabytes of data that needed to be exported to Blob from SQL, which would have taken weeks to transfer without tuning. After proper POC, it was found that Data Factory could support dynamic range when reading data from source.

Let’s say there is an 800 GB XYZ table. In line with the approach mentioned above, Data Factory is needed to move the huge amount of data in Blobs. With the traditional method, the GUI, by default, writes data to the Blob serially, which would be slower.

Now if we look at the XYZ table with a “date” column, the 800GB of data can be partitioned into smaller sets by month or year. This would mean that each partition does not directly depend on other date partitions and can be written in parallel. It will be faster and more resource efficient.

This can be achieved by using the dynamic range filter which can only be applied by writing the select statement rather than checking the box of existing tables.

3. Using the parquet file

The exported data needed to be stored in a flat file while maintaining integrity and compression. CSV was the first choice, but during the POC there were many challenges when writing the file, maintaining spaces and new line characters that corrupted the data. The Data Factory offered the Parquet format of a file which had a high compression ratio (75%) and also maintained data integrity. Parquet was optimized to work with complex bulk data and was therefore suitable for this project. Referring to the figure above, it can be seen that 40 GB of data was compressed to 11 GB.

4. Integration Execution

To make the Data Factory work required more computing power, which was facilitated in the following ways:

  • Auto-Resolve Integration Runtime

As the name suggests, compute resources were managed and assembled by Microsoft data centers and cost was incurred on a per-unit basis. The region of these resources was automatically decided based on availability. This is selected by default when running a Data Factory pipeline.

  • Self-hosted integration runtime

This runtime uses resources that already exist. For example, the self-hosted IR allowed to download a client program on the machine for the necessary resources and create a service and couple it with the Data Factory.

4. Configure self-hosted integration runtime

This was the best option available because the SQL Server was already hosted on a standalone Azure VM, which provided the freedom to use the full capacity of the resources attached to it. It included the following steps:

1. Configuring Self-Hosted IR

  • For Azure Data Factory to work with Azure VM, the Azure Data Factory integration runtime had to be configured
    • In Azure Data Factory, select “Manage” then “Integration Runtimes”
    • Select “+ New”, then “Azure, self-hosted”,
    • Then select “Network environment -> Self-hosted”
    • Next, give a suitable name to the self-hosted IR
    • Once the IR is created, the authentication keys will be presented. Copy these keys
  • Now according to the last screen a link has been provided to download Microsoft Integration Runtime
      • Download and install the integration runtime from the Microsoft link
      • Once installed, enter the value Auth Key1 and save “Launch Configuration Manager”
      • Once registered, the self-hosted IR will bind to the data factory
      • Now install the 64-bit Java Runtime as it is required for self-hosted IR to work. Refer to this manual.

    2. Create a linked service in Data Factory

  • Switch to data factory
    • Select “Manage” then “Related services”
    • Select SQL server as the service type and give it an appropriate name
    • Now under “Connect via Integration Runtime” select the self-hosted IR created
    • Put the server name. It is important to note that the server name must be the same as the one used to connect the SQL server successfully
    • Enter credentials and test the connection

In the next blog, we’ll look at some of the challenges faced during the migration, cost reduction actions, and our approach to data validation.

Maria H. Underwood