Run pandas on 1TB+ Enterprise Data Directly in Snowflake
Our benchmark studies have shown that pandas on Snowflake scale to more than a terabyte of data, for data sets where the standard pandas library runs out of memory on even less than 100GB. On average across representative workloads, we find that pandas on Snowflake perform around 6x faster on 1GB scale and around 30x faster on 10GB scale than vanilla pandas in memory.
Minimal tuning or rewriting required to use
With the introduction of pandas on Snowflake, users can work with their familiar pandas API and semantics. This feature enables developers to run pandas directly on their data in Snowflake, while queries are translated to SQL to run natively in Snowflake.
pandas on Snowflake is part of the Snowpark Python library, which enables scalable data processing of Python code within the Snowflake platform. By simply changing a few lines of import statement, developers get the same pandas experience they know and love with the scalability and security benefits of Snowflake. As a result, migrations to Snowflake are easy, and data teams avoid the time and expense of rewriting their pandas pipelines to other big data frameworks or provisioning expensive high-memory machines.
Secure access within Snowflake removes sensitive data risks on local machines
The in-memory design of pandas has created problems for organizations — notably the security and governance concerns that result from pulling enterprise data to laptops to process with pandas. As part of the Snowpark Python library, compute is pushed down to Snowflake directly within Snowflake’s secure, governed perimeter.
Built on the Modin open source project
At Snowflake, we are committed to meeting developers where they are by integrating open source tools and standards with the powerful capabilities of the Snowflake AI Data Cloud. pandas on Snowflake is built on the Modin open source project. Modin is a distributed pandas library that joined the family of open source projects at Snowflake through an acquisition in October 2023. Modin is used by hundreds of thousands of data scientists and developers to seamlessly scale their pandas workflows. Snowflake actively contributes to and supports both the open source project and its vibrant community.