Feature Store as a Service (FSaaS) with Data Virtualization

Share on facebook
Share on twitter
Share on linkedin
A guest article by Ali Aghatabar on what is Feature Store and why Data Virtualization is one of the best options for Feature Store as a Service (FSaaS). Take a look!
Top Data Pipeline Tools in 2021

Top 5 databases to consider for 2021

What is Modern-Day Data Virtualization?

Learn about Feature Store, which is an emerging concept in AI and advanced analytic, and why Data Virtualization is one of the best options for Feature Store as a Service (FSaaS).
Feature Store is a hot topic in today’s AI and Advanced Analytics and a lot of vendors are actively looking into it and are working on solutions and products to fulfill the requirements. Before explaining data virtualization as a good fit for FSaaS, let’s explain a feature first.

Feature

Even though “feature” is a more common word in AI and advanced analytics, but a feature is essentially a form of data that is built based on raw or existing feature(s). A new form of data used to be generated by integration, ETL, RPA tools, etc, however, the common understanding is that features are those forms of data that are generated by feature engineering processes and are meant to be used by AI services. When we look at some real feature implementations, we can see that not all features are too complex or specific to AI, and most of the time, the same features can be used for integration, analytic, or reporting also.

A feature, regardless of its feature engineering complexity, can be an enterprise data asset and may have other consumers than only AI. Therefore, if it turns out to be an enterprise asset, it still needs a centralized repository, data catalog, and governance and must be sharable to authorized people/systems.

Now let’s look at some examples of feature generation:

More about the feature

Feature purpose

The reason for generating a feature is important, as that explains the position of a feature in the feature spectrum (next item). A feature is generated for the below purposes:

Data Enrichment
Data enrichment is a feature generation process that produces a baseline data asset.
Analytics & Reporting
Summary tables, snapshot tables, materialized views, semantic views, etc are all examples of analytics & reporting features. In analytic, snapshot tables are different versions of a feature based on the snapshot DateTime dimension.
Integration
Systems integrate through interface files, ODBC/JDBC connections, APIs, streaming, etc. In most of these methods, data, which is often shared from one party to another one, is a form of integration feature.
AI
AI needs features for three main reasons, AI training, AI Scoring, and AI algorithm accuracy.
Training
AI models often need a high stale & historical volume of data. AI training features are sometimes called offline (cold) features.
Scoring
AI scoring needs smaller sets but more recent ones. AI scoring features also called online (hot) features.
Accuracy
Accuracy of AI model/algorithms is often increased by adding more iterations and automatically generating new features (column and/or cells) from existing features and trying to generate new algorithms with fewer errors. In today’s market, many existing and emerging AI tools generate these features automatically. Reference: Automated Feature Engineering in Python.

Feature spectrum

Depending on what we do on existing data/features and what feature we generate, the new form of data (feature) could be a change to existing data, can be an analytic/reporting object, or more AI-related object. When we simply add a column into existing data, it is still a simple form of data. When we de-normalize data, we transform it to an object which is a good fit for integration, analytic & reporting use cases, but when we generate a vector object, it is only useful for AI use cases. Below is an example of feature spectrum.

Feature storage

A Feature may come up with a different data type from its originated raw/feature(s) data type(s). As an example, we get an audio file and we generate an audio transcript feature using an AI cognitive service. That means, the new feature (transcript) has changed from unstructured blob format to a semi-structured No-SQL format (No-SQL just because I like, otherwise someone can drop it into a text file!). Furthermore, we can generate a “Bag of Words” feature from the previous transcript where and bag of words is more like a Key-Value or JSON type feature. Finally, a bag of words feature can be also represented in a Matrix feature which could be stored in an in-memory multidimensional array. A feature, depending on its data type, needs a specific storage type that could have a different storage type of its originated raw or feature(s). Below is an example of features, their storage types, and the storage technology option. Keep in mind that, there is no definite data type and storage type of a feature, and one person can store, for example, transcript as a text blob into AWS S3 but another person can store it as No-SQL data in MongoDB, it is a little bit personal architecture.

Feature engineering tool

Depending on the feature spectrum, we may choose different feature engineering tools. Data-oriented features can be commonly generated by T-SQL supported tools versus AI features are normally generated through sophisticated AI services “(like cognitive APIs) or AI codes (like Python, R etc)”. Analytic features are somewhere in between, and they may simply need a T-SQL or a programming tool or a mix of both. Therefore, Feature Store as a Service (FSaaS) must support both data engineering and feature engineering tools.

Feature configuration

The size of a feature, how often it gets updated and different versions of a feature are also very important. As an example, when we perform an AI deep learning, we need a big set of data, but it doesn’t need to be fresh data (most of the time of course). When we do system integration through API, we need smaller but most recent and accurate data. For stream analytic, as an example, we are talking about just the recent frame of data. On the other hand, features evolve with an organization over time, and they do have two reasons to be versioned: A) Change in the schema or logic of a feature B) The data a feature is using and/or exposing.

What we learn from all the above contexts are

  • A feature is not necessarily an AI object and can be a widespread enterprise data asset
  • Feature store is better to be decoupled from any storage technology and feature engineering tooling
  • Feature store is better to provide capabilities like configuration, governance, security, sharing, etc
Below is a high-level view of a good FSaaS

Data virtualization

Looking at the requirements for FSaaS, it is quite clear that a Data Virtualization platform is one of the best options. Below are some highlights about the Data Virtualization platform which helps us to evaluate it for FSaaS:
Storage decoupling
  • Data Virtualization generates virtual data objects (VDOs) based on any data source from anywhere and it decouples itself from storage technologies, whether source or target!
Tooling decoupling
  • Data Virtualization is essentially a No-ETL concept but can be used/accessed by all T-SQL, ETLs, programming languages/scripts/tools, visualization tools, and AI platforms, either through virtualization platform, API, or ODBC, which makes it decoupled from tooling.
Sharing and data hub
  • Data Virtualization acts as a real data hub and all VDOs are accessible by authorized systems/people.
Governance/security
  • All VDOs are governed and secured through a centralized process.
Data catalogue
  • As VDOs are logical, data catalog and metadata get automatically generated by creating/updating a VDO.
and more
To have a better comparison, we can look at two more options, FEAST as a feature store solution and Snowflake as a generic data/analytic solution with a capability to be used as a feature store:
Looking at FEAST as one of the best feature store solutions in the market, we see that:
  • It is AI-oriented
  • It has a local storage that can be an issue from a data security and governance point of view and it is not supporting all data types
  • It is not part of an enterprise data platform that makes enterprise usability of its feature impossible
Snowflake uses DataRobots for auto AI features, but here are some limitations with Snowflake as FSaaS:
  • It is a centralized data repository and not a distributed platform, so it is not able to generate cross-platform features
  • It binds its engine to one of the cloud providers and it is not a cloud-agnostic platform
Here is a FSaaS comparison among Data Virtualization, typical data platforms, and some of the existing feature store solutions:

Disclaimer: Bear in mind that, I tried to have a fair comparison based on my own experiences, my definition of feature store and common market tools & platforms which I have seen and been working with, but if I missed something or new works are going around, I would be happy to get your feedback through my LinkedIn, please.

About the author
Ali Aghatabar, founder and director of Intelicosmos®, has been helping clients across the globe, particularly the APAC region, for over two decades. With a consulting background, he helped a wide range of clients and industries with their IT needs especially on data & analytic, cloud architecture and computing, AI and process automation, digital transformation, IoT and smart devices, etc.

Recent Posts