Simplify Spark to Snowflake Transition

Posted on July 20, 2024July 20, 2024 by cloudmatrix.website

In the age of AI, enterprises are increasingly looking to extract value from their data at scale but often find it difficult to establish a scalable data engineering foundation that can process the large amounts of data required to build or improve models. Designed for processing large data sets, Spark has been a popular solution, yet it is one that can be challenging to manage, especially for users who are new to big data processing or distributed systems.

To empower organizations to build the secure and scalable data foundation required for AI, but without the operational complexity, Snowflake launched Snowpark. With familiar DataFrame-style programming and custom code execution, Snowpark lets teams process their data in Snowflake using Python and other programming languages by automatically handling scaling and performance tuning. Snowflake customers see an average of 4.6x faster performance and 35% cost savings with Snowpark over managed Spark.

To help teams with existing Spark codebases to get up and running with Snowpark faster, we are excited to launch the Snowpark Migration Accelerator. This is a free, self-service code assessment and conversion tool from Snowflake that helps developers move to Snowpark faster and more efficiently. The tool serves two primary functions: assessment and conversion.

The assessment is built by scanning any codebase written in Python or Scala and outputting a readiness score for conversion to Snowpark. From that assessment, the accelerator can automatically convert references from the Spark API to the Snowpark API.

“The Snowpark Migration Accelerator really helped us make the decision on whether to move to Snowflake. It provided us insights as to code compatibility and allowed us to better estimate our migration time.” —Alan Feuerlein, CTO of Travelpass

How does it work?

The Snowpark Migration Accelerator builds an internal model representing the functionality present in the codebase. This model is an Abstract Syntax Tree (AST) that is not dependent on a specific source language. As a result, the tool can take in both code files and notebooks with multiple languages (such as Scala, Python and SQL) at the same time. No source data is ever analyzed by the tool (code is the only input), and it does not connect to any source platform.

Step 1: Assessment

Once the model is built, the assessment generates a series of reports designed to explain what is present in the source code. Some of these are high-level summaries that report on how “ready” a codebase is for Snowpark. Others are complete inventories showing where each reference to a given API or SQL statement or internal dependency can be found.

From these inventories, the Snowpark Migration Accelerator will identify exactly what can be converted.

Step 2: Conversion

For each element of the Spark API that is identified, the conversion engine in the Snowpark Migration Accelerator will usually perform one of the following:

Identify what can be directly mapped to Snowpark and update the import call.
Attempt to replicate the functionality present in the source code into Snowflake with Snowpark, when there is no direct mapping.
Report the element in an issue inventory and write a comment in the output code.

Note that the conversion capability of the tool is not a silver bullet. It does not execute a complete migration, but rather, it will identify and convert what it can into a functionally equivalent output, compatible with Snowflake.

Get the most out of the Snowpark Migration Accelerator

The tool has been designed to optimize data processing pipelines, specifically those built on Spark and Hive codebases, including:

ETL: Accelerations are particularly useful in ETL processes that involve extracting data from various sources; transforming and aggregating it; performing cleaning, filtering, aggregating or joining; and loading the transformed data into a target system.
Batch Processing Pipelines: Large volumes of data can be processed on schedule using the tool. This is ideal for tasks such as data aggregation, reporting or batch predictions.
Ingestion Pipelines: Handling data from cloud storage and dealing with different formats can be efficiently managed with the accelerator.
Feature Engineering: Creating and deriving features from raw data to enhance model performance in machine learning tasks is another area where the Snowpark Migration Accelerator excels.

While the Snowpark Migration Accelerator has been purpose-built to provide acceleration for Spark and Hive codebases, Snowpark supports Python code in general. Although no acceleration is provided for non-Spark code, the tool can still offer valuable information about code size, used technologies, supported libraries and pandas usages. This data can be utilized to guide efforts in orchestrating within Snowflake, optimizing performance and efficiency across various workloads.

Here are a few helpful ideas to keep in mind when using the Accelerator:

Know your purpose. The tool can be used on any codebase with Python or Scala code, but to get the most out of it, there should be references to the Spark API. If it’s all pure Python or Scala, then the assessment may still be useful (all references to any API will be cataloged), but the conversion will not do much and you may not get a readiness score.
The input matters. If the accelerator cannot read the source code files or if they do not run in the source language, it will not be able to properly analyze the files. Pay attention to the source codebase being run through the tool. Ensure that files are identified with the correct extension and that they actually work in the source. Check the documentation on what is required before running the Snowpark Migration Accelerator.
Understand the readiness scores. The readiness score(s) act as an indicator of what can be converted and should not be interpreted as an effort estimate.
Keep size in mind. For the assessment, it’s recommended to run your entire codebase through to get a complete picture of your migration. If there is a substantial amount of extra code files or the total size (in byte) of the files being scanned is extremely large, the Snowpark Migration Accelerator could run out of internal resources on the machine processing the code. For conversion, if you’re just getting started, start small.
Deal with issues programmatically. When working with the output code from conversion, resolve any challenges programmatically across your codebase. Resolve the error across the codebase before going to the next one.
No silver bullets here. While the tool has seen an average automation rate greater than 95%, no conversion is ever 100% automated. There will be some measure of manual work to be done.
Reach out! If you run into complications, take advantage of the “report an issue” capability built into the tool and don’t hesitate to post in the Snowflake Community. If you have found a use case for the output reports/inventories generated by the tool, share it!

Following these best practices will help you get the most out of your experience with the Snowpark Migration Accelerator. Try it today to see how smooth the on-ramp to Snowpark can be.

Get started

The Snowpark Migration Accelerator is available now for free just by downloading the installer onto your local machine or container. You can find more information with the following resources: