What is Change Data Capture

Share on facebook
Share on twitter
Share on linkedin
Explore the Change Data Capture (CDC) in detail. Read the blog to learn it in indepth.
Understanding the Star Schema
Star Schema – Read the blog to explore what is a Star Schema and how it works.
Snowflake data sharing with Lyftrondata
Take a deep dive into Snowflake data sharing approach and how to leverage the powerful combination of Lyftrondata with Snowflake.
Change Data Capture (CDC) software component provides a robust data movement platform. It can capture and propagate changes and update records in a target source using batch or micro-batch (continuous) processing, as well as determining the lowest incremental point within an event. CDC uses checksum values to identify only the incremental changes at the row level, based on the primary key, making it easy for applications to subscribe for specific events and receive only changed data. By leveraging CDC’s data capture capabilities, customers are able to develop near real-time aggregates and operational dashboards, including overnight roll-up batch processing capabilities. Change Data Capture (CDC) is an approach to data integration that is based on the identification, capture, and delivery of the changes made to enterprise data sources. Even with its increased speed, the Snowflake data warehouse is designed to make it easy for businesses to audit and analyze their data history.
The Change Data Capture feature is designed to capture all the changes that are applied to a publisher’s tables and make those changes available in an easily consumable relational format. Changes are tracked by using change tables, which are automatically managed by the SQL Server Database Engine for each published table when change tracking is enabled. The change tables contain columns that reflect the column structure of a tracked source table, plus metadata columns that provide information about the nature of the change and when it was made.
For example, for each source table for which we want to capture data, the publisher creates a corresponding change table.

Make better and quicker data migration decision

These are the type of Change Data Capture Mechanisms

1- Row Versioning Change capturing is used to identify changes in normalized data, where each unit of data is stored only once in a table. One technique is to mark each changed row with a version number.A current version of the table, or perhaps a group of tables, is maintained. A supporting component, such as a reference table, is used to hold this information. All data with the most recent version number is considered to have changed when a change capture happens.
2- Update Timestamps When using the CURRENT_TIMESTAMP() function in an INSERT statement, you can specify whether the value inserted is evaluated on INSERT or UPDATE. By default, a column that uses CURRENT_TIMESTAMP gets the current timestamp as its default value when the column is created. Then, whenever a row gets created or updated, the column is set to the current timestamp.
3- Publish and Subscribe Queues Publish and subscribe messaging provides instant event notifications for distributed applications. Use publish/subscribe topics to broadcast messages to different parts of your system asynchronously. To publish a message, simply push it to the topic. Then any number of interested subscribers can receive that message, or filter out messages they’re not interested in.
4- Database Log Scanners The database log or transaction log is a fundamental component of a database management system. The database log keeps track of all changes to application data in the database. Using this information the database can track which transaction made which changes to the database.
5- Publish/Subscribe Sources A subscription represents the stream of messages from a specific topic, to be delivered to the subscribing application when it can receive them. A subscription can only deliver messages to one endpoint (for example, one application running on one compute instance). If you want to deliver messages simultaneously to multiple endpoints, you must create a separate subscription for each endpoint.

Change Data Capture Methods

  • Log-based CDC

    This is the most efficient way to implement CDC. Log-based Change Data Capture (CDC) allows you to react to data changes in near real-time without polling the database or altering the application. Changes are logged in real-time on a transaction log and then delivered to a Kafka topic for ingestion by other systems. This is made much easier using Kafka Connect's straightforward out-of-the-box connectivity.

  • Query-based CDC

    Here you query the data in the source to pick up changes. Unlike polling-based CDC approaches, we process the change logs published by the database. This is faster, and also ensures that even changes coming from a different application (e.g. via MongoDB's op log) are properly captured and propagated to the target system.

  • Trigger-based CDC

    You alter the source application to trigger the writing to a change table and then move it in this method. A trigger is a procedural code, stored in the database catalog. It “triggers” an action to occur whenever data is changed that affects the trigger. Triggers can be either row-level or statement-level. Row-level triggers are triggered once for each affected row, whereas statement-level triggers run only once for a SQL statement regardless of how many rows it affects.

Methodology

Change Data Capture (CDC, also known as data replication) can be the solution when the volume of changes to data is too great for traditional programming methods. CDC mechanisms can track all changes made on a record-by-record basis over time and then capture those after-the-fact change records as they are generated. As each new change record is captured, it can be passed to another system that requires the change information.

What Lyftrondata empowers data capture?

Lyftrondata is a modern data platform that provides real-time access to any data, and enables users to query them with simple ANSI SQL. With Lyftrondata, enterprises can build data pipelines in minutes and shorten the time to insights by 75% with the power of modern cloud compute of Snowflake and Spark.
The modern data platform eliminates the time spent by engineers building data pipelines manually and makes data instantly accessible by providing real-time access with simple ANSI SQL. Lyftrondata’s pre-built connectors automatically deliver data to warehouses in normalized, ready-to query schemas and provide full search on data catalogue. Whether you want to build a governed data lake, data warehouse, or simply want to migrate from your traditional database to a modern cloud data warehouse, do it all with Lyftrondata. Simply create and manage all your data workloads on one platform by automatically building your pipeline and warehouse. Analyze it instantly with ANSI SQL, BI/ML tools, and share it without worrying about writing any custom code. Boost the productivity of your data professionals and shorten your time to value. Lyftrondata helps to automatically normalize API/JSON/XML/S3/Blob/NoSql sources into a ready-to-query relational format.
Connect with our experts
Today to explore how Lyftrondata could help with data stack modernization with an agile, automatic columnar ELT pipeline and give you 95% faster performance.

Recent Posts