Blogs

Feature Store as a Service (FSaaS) with Data Virtualization

Top Data Pipeline Tools in 2022

Click Here

Top 5 databases to consider for 2022

Click Here

What is Modern-Day Data Virtualization?

Click Here

Request Demo

Learn about Feature Store, which is an emerging concept in AI and advanced analytic, and why Data Virtualization is one of the best options for Feature Store as a Service (FSaaS).
Feature Store is a hot topic in today’s AI and Advanced Analytics and a lot of vendors are actively looking into it and are working on solutions and products to fulfill the requirements. Before explaining data virtualization as a good fit for FSaaS, let’s explain a feature first.

Feature

Even though “feature” is a more common word in AI and advanced analytics, but a feature is essentially a form of data that is built based on raw or existing feature(s). A new form of data used to be generated by integration, ETL, RPA tools, etc, however, the common understanding is that features are those forms of data that are generated by feature engineering processes and are meant to be used by AI services. When we look at some real feature implementations, we can see that not all features are too complex or specific to AI, and most of the time, the same features can be used for integration, analytic, or reporting also.

A feature, regardless of its feature engineering complexity, can be an enterprise data asset and may have other consumers than only AI. Therefore, if it turns out to be an enterprise asset, it still needs a centralized repository, data catalog, and governance and must be sharable to authorized people/systems.

Now let’s look at some examples of feature generation:

Creating new column(s), updating/removing existing columns of existing raw data/feature(s)
Feature normalization / feature scaling / dimension reduction / outliers & anomalies, etc of existing raw data/feature(s)
Creating a new tabular object (table, view, materialized view, CSV, etc), a new unstructured object (blob, text, JSON, etc), and new semi-structured object (No-SQL, graph, etc) from existing raw data/feature(s)
AI algorithm object like matrixes, vectors, auto-generated features etc

What is modern-day data virtualization? How Lyftrondata enables data virtualization on Snowflake? – Part I

Data virtualization for Snowflake with a powerful combination of Lyftrondata – Part II

More about the feature

Feature purpose

The reason for generating a feature is important, as that explains the position of a feature in the feature spectrum (next item). A feature is generated for the below purposes:

Data enrichment is a feature generation process that produces a baseline data asset.

Summary tables, snapshot tables, materialized views, semantic views, etc are all examples of analytics & reporting features. In analytic, snapshot tables are different versions of a feature based on the snapshot DateTime dimension.

Systems integrate through interface files, ODBC/JDBC connections, APIs, streaming, etc. In most of these methods, data, which is often shared from one party to another one, is a form of integration feature.

AI needs features for three main reasons, AI training, AI Scoring, and AI algorithm accuracy.

AI models often need a high stale & historical volume of data. AI training features are sometimes called offline (cold) features.

AI scoring needs smaller sets but more recent ones. AI scoring features also called online (hot) features.

Accuracy of AI model/algorithms is often increased by adding more iterations and automatically generating new features (column and/or cells) from existing features and trying to generate new algorithms with fewer errors. In today’s market, many existing and emerging AI tools generate these features automatically. Reference: Automated Feature Engineering in Python.

Feature spectrum

Depending on what we do on existing data/features and what feature we generate, the new form of data (feature) could be a change to existing data, can be an analytic/reporting object, or more AI-related object. When we simply add a column into existing data, it is still a simple form of data. When we de-normalize data, we transform it to an object which is a good fit for integration, analytic & reporting use cases, but when we generate a vector object, it is only useful for AI use cases. Below is an example of feature spectrum.

Feature storage

A Feature may come up with a different data type from its originated raw/feature(s) data type(s). As an example, we get an audio file and we generate an audio transcript feature using an AI cognitive service. That means, the new feature (transcript) has changed from unstructured blob format to a semi-structured No-SQL format (No-SQL just because I like, otherwise someone can drop it into a text file!). Furthermore, we can generate a “Bag of Words” feature from the previous transcript where and bag of words is more like a Key-Value or JSON type feature. Finally, a bag of words feature can be also represented in a Matrix feature which could be stored in an in-memory multidimensional array. A feature, depending on its data type, needs a specific storage type that could have a different storage type of its originated raw or feature(s). Below is an example of features, their storage types, and the storage technology option. Keep in mind that, there is no definite data type and storage type of a feature, and one person can store, for example, transcript as a text blob into AWS S3 but another person can store it as No-SQL data in MongoDB, it is a little bit personal architecture.

Feature engineering tool

Depending on the feature spectrum, we may choose different feature engineering tools. Data-oriented features can be commonly generated by T-SQL supported tools versus AI features are normally generated through sophisticated AI services “(like cognitive APIs) or AI codes (like Python, R etc)”. Analytic features are somewhere in between, and they may simply need a T-SQL or a programming tool or a mix of both. Therefore, Feature Store as a Service (FSaaS) must support both data engineering and feature engineering tools.

Feature configuration

The size of a feature, how often it gets updated and different versions of a feature are also very important. As an example, when we perform an AI deep learning, we need a big set of data, but it doesn’t need to be fresh data (most of the time of course). When we do system integration through API, we need smaller but most recent and accurate data. For stream analytic, as an example, we are talking about just the recent frame of data. On the other hand, features evolve with an organization over time, and they do have two reasons to be versioned: A) Change in the schema or logic of a feature B) The data a feature is using and/or exposing.

What we learn from all the above contexts are

Data virtualization

Looking at the requirements for FSaaS, it is quite clear that a Data Virtualization platform is one of the best options. Below are some highlights about the Data Virtualization platform which helps us to evaluate it for FSaaS:

Storage decoupling

Data Virtualization generates virtual data objects (VDOs) based on any data source from anywhere and it decouples itself from storage technologies, whether source or target!

Tooling decoupling

Data Virtualization is essentially a No-ETL concept but can be used/accessed by all T-SQL, ETLs, programming languages/scripts/tools, visualization tools, and AI platforms, either through virtualization platform, API, or ODBC, which makes it decoupled from tooling.

Sharing and data hub

Data Virtualization acts as a real data hub and all VDOs are accessible by authorized systems/people.

Governance/security

All VDOs are governed and secured through a centralized process.

Data catalogue

As VDOs are logical, data catalog and metadata get automatically generated by creating/updating a VDO.

and more

To have a better comparison, we can look at two more options, FEAST as a feature store solution and Snowflake as a generic data/analytic solution with a capability to be used as a feature store:

FEAST

Looking at FEAST as one of the best feature store solutions in the market, we see that:

Snowflake

Snowflake uses DataRobots for auto AI features, but here are some limitations with Snowflake as FSaaS:

Here is a FSaaS comparison among Data Virtualization, typical data platforms, and some of the existing feature store solutions:

Disclaimer: Bear in mind that, I tried to have a fair comparison based on my own experiences, my definition of feature store and common market tools & platforms which I have seen and been working with, but if I missed something or new works are going around, I would be happy to get your feedback through my LinkedIn, please.

About the author

Ali Aghatabar, founder and director of Intelicosmos®, has been helping clients across the globe, particularly the APAC region, for over two decades. With a consulting background, he helped a wide range of clients and industries with their IT needs especially on data & analytic, cloud architecture and computing, AI and process automation, digital transformation, IoT and smart devices, etc.

Connect & Normalize

Transform & Load

Analyze and Share

Other Pages

Solutions By Use Cases

Solutions By Use Cases

Solutions By Departments

Solutions By Technology

Solutions By Data Replication

Reports

Comparisons

Comparisons

Resources

Resources

Feature Store as a Service (FSaaS) with Data Virtualization

Feature

What is modern-day data virtualization? How Lyftrondata enables data virtualization on Snowflake? – Part I

Data virtualization for Snowflake with a powerful combination of Lyftrondata – Part II

More about the feature

Feature purpose

Feature spectrum

Feature storage

Feature engineering tool

Feature configuration

What we learn from all the above contexts are

Data virtualization

Recent Posts

What is Operational Data Store

What is a Delta Lake

What is Change Data Capture

Awards

Lyftrondata in news

Comparison

Product

Solution