Build and Manage ML features for Production-Grade Pipelines

Posted on October 9, 2024October 9, 2024 by cloudmatrix.website

When scaling data science and ML workloads, organizations frequently encounter challenges in building large, robust production ML pipelines. Common issues include redundant efforts between development and production teams, as well as inconsistencies between the features used in training and those in the serving stack, which can lead to decreased performance. Many teams turn to feature stores to create a centralized repository that maintains a consistent and up-to-date set of ML features. However, this often introduces the complexity of managing additional infrastructure for feature authoring, building and maintaining update pipelines, and establishing workflows to access consistent and fresh features. As a result, teams often end up spending more time than expected on makeshift or customized solutions.

Today we are announcing the general availability of the Snowflake Feature Store. This native solution lives on the same platform as your end-to-end workflows in Snowflake ML, with seamless integration to your data, features and models, so that large-scale ML pipelines can be productionized easily and efficiently. The Feature Store helps you eliminate redundancy and duplication of pipelines, ensuring that you have updated, consistent and accurate features available with enterprise-grade security and governance.

Key capabilities of Snowflake Feature Store are:

Easy authoring of common feature transformations in Python or SQL
Automated and efficient feature refresh on new data from both batch and streaming sources
Simple API for retrieving time-consistent features using ASOF JOIN and generating training datasets
Fine-grained role-based access control (RBAC) and governance
Support for user-maintained feature pipelines in tools such as dbt
Full integration with Model Registry and other Snowflake ML capabilities
Centralized view of features and entities from the Snowsight UI for easy search and discoverability
Built-in end-to-end ML Lineage (preview feature)

Snowflake Feature Store is fully integrated with Snowflake Model Registry and other Snowflake ML capabilities to enable a complete end-to-end ML development and operations solution in Snowflake. A high-level schematic of this workflow is shown below:

Customers productionize MLOps with Snowflake Feature Store

Many customers are already using Feature Store in their ML workflows across various industries and use cases.

Scene+ is a Canadian loyalty program that uses Snowflake Feature Store on large data sets with notable performance improvements from its previous solution.

In the retail industry, Feature Store is also being implemented by our partner Kubrick to productionize models that improve customer experience.

We’re also seeing Feature Store used in gen AI use cases. Stride, a leader in remote, online, and in-person learning, partnered with phData to implement Snowflake Feature Store in a RAG app that provides accurate, safe assistance to students and teachers.

Using Snowflake Feature Store

A simplified ML workflow powered by Feature Store is depicted below:

Let’s look at the main components of this.

Creating Feature Stores

You can easily create a Feature Store, or connect to an existing one, by providing a Snowpark session, database name, schema name and default warehouse. Feature Store is simply a schema in Snowflake’s backend.

Creating Feature Views

Feature Views are the primary abstraction in a Feature Store. They consist of a collection of logically related features that are computed and maintained on the same schedule. In the Snowflake Feature Store, Feature Views can be created from any source data (e.g., tables, views, shares) by using Snowpark dataframes or SQL transformations. Columns in the source tables (or views) or data transformation dataframe are recognized as features. Additionally, Feature Views must contain an Entity, which contains the join keys used for feature lookup at training or inference time, and optionally, a timestamp column for capturing changes in feature value through time.

Define an Entity:

Define a Feature View:

feature_df is a Snowpark DataFrame object containing your feature definition. Snowpark provides helper functions that make it easy to define many common feature transformations. For example, this code snippet below specifies 3 month and 6 month aggregations of customer order sum and count over a 1 day sliding window.

timestamp_col is the name of a timestamp column that is used to join with a table containing the required entity keys for training to retrieve point-in-time correct feature values.

A key benefit of Snowflake Feature Store is its use of Dynamic Tables to automate and abstract the complexity of data and feature engineering pipeline and backfill management. In many feature store solutions, the user is responsible for creating all the data and feature engineering logic to perform the initial population and subsequent ‘update’ of feature values. These steps then need to be scheduled and managed manually outside of the feature store.

In a Snowflake managed Feature View, all of this is declaratively handled. You define the logic to compute features across all history, using Dataframe/SQL. Snowflake handles the incrementalization of that declarative logic. To use these managed Feature Views, simply specify the refresh_freq, which defines the frequency of feature refresh and how up to date you need your features to be from their source tables. Snowflake-managed Feature Views can be monitored from the Snowsight UI via the new Feature Store support.

While in most cases you will want to use such managed Feature Views, there may be scenarios where you want to use feature pipelines, maintained by you, that run using external tools. In this case, create a Feature View by omitting the refresh_freq. This creates user-maintained Feature Views that are computed at retrieval time.

Generating training data

A key purpose of feature stores is to simplify generation of consistent training data sets. Feature Store provides APIs to generate training data in two formats depending on your workflow. In either case, Snowflake Feature Store handles retrieval of point-in-time correct values using the timestamp and ASOF JOIN function to efficiently and scalably join features from multiple views, yielding time-consistent results.

Snowflake Dataset is a new schema-level object specially designed for machine learning workflows. Snowflake Datasets hold collections of data organized into versions, where each holds a materialized snapshot of your data with guaranteed immutability, efficient data access and interoperability with popular deep learning frameworks, such as PyTorch and TensorFlow. Datasets can be conveniently created from Feature Store as shown below:

Training data can also be created as a Snowpark DataFrame for training with classic ML Libraries, such as scikit-learn or Snowpark ML, or to load into external machine learning frameworks:

Similarly, Feature Store supports retrieving feature data directly for model inference using retrieve_feature_values, enabling production-ready incremental batch inference pipelines to be authored and scheduled easily.

Discovering and exploring features

Feature Store is available within the Snowsight UI and can be used to conveniently browse, search and manage feature views and their versions, underlying entities, individual feature columns and associated feature metadata.

Governance

Snowflake Feature Store uses standard database objects, like schemas, dynamic tables and views. Snowflake object tagging is used to denote these database objects as belonging to a Feature Store, and to maintain the relationships between them. Standard Snowflake RBAC is used to control access to the Feature Store and the objects within. In a typical Feature Store implementation, two roles are commonly defined: Producers and Consumers. Producers can create and modify Feature Views. Consumers can read Feature Views. Refer to this page for more details about privileges of each role. We also provide a simple utility API and a SQL script to easily configure these roles. Feature publishers can also share features within and across accounts using Snowflake Data Sharing.

Snowflake ML includes built-in ML Lineage capabilities (in preview) with integration to Feature Store, which allows you to visualize the lineage of all ML artifacts in your pipeline, such as source data tables, feature views, data sets and ML models, along with all the data lineage and governance that Snowflake Horizon Catalog provides.

Getting started

The Snowflake Feature Store is generally available for all enterprise edition (or higher) customers and you can get started today with an introductory quickstart. For additional details, more end-to-end examples, and API reference, visit our documentation.