Myths and Misconceptions – Data Mesh and Data Warehousing

Share on facebook
Share on twitter
Share on linkedin
Read our blog to know more about Data Mesh: A challenger to the traditional data warehouse.
Data Virtualization for Azure with a powerful combination of Lyftrondata
Read our blog where Lyftrondata helps with Azure’s ultimate data virtualization architecture —a real-time replication and federation system that delivers superior agility, speed, and responsiveness.

Poll Results: ETL Tools for Data Integration
Lyftrondata conducted a poll to know whether or not businesses use ETL for data integration.

Several of Data Mesh’s offerings are quite impressive. As an example, they have really innovative approaches to data modeling and data architecture. However, when it comes to the technical history of Data Warehousing or conducting seminars on the subject matter, my thoughts diverge with some of their conclusions regarding how we got here. I also do not fully align with them on future plans w.r.t. the current trends. There is a clear disconnect between mesh architects and data warehouse architects. This could lead Data Mesh down a dangerous path regarding their understanding of the past, present, and future of modern OLTP architectures..

Myth #1 The data warehouse as a place to copy OLTP exhaust data

Data has existed since long before the birth of computer systems or even dinosaurs. But it’s only now that people started thinking about it as a resource.
Many people view data warehousing as a place to dump data. However, this is a misreading of the past. Today’s data warehouses are far more complex. Those who only see data warehouses as places to put data do not understand what a data warehouse does.
What some people need to realize is that data warehousing has borrowed from many areas of information management (digital and non-digital) including in areas such as:

Build Amazon Redshift data warehouse securely and swiftly

  • Data migration tools and techniques
  • Database analysis and design
  • Decision Support Systems / Executive Information Systems
  • Distributed data processing
  • End User Computing
  • Entity relationship modelling / dimensional modelling
  • Function decomposition and business data domains
  • In addition, there is a longer list of notable contributors.
  • Information Centre architectures
  • Iterative development and delivery
  • Joint application development / rapid application development
  • MPP, SMP and hybrid SMP platforms
  • Relational database management systems
  • Reusable designs
  • The subject orientation of data
  • Time slicing, time-variance and time series as well as time-invariant data
  • Timebox methodologies
Although data warehousing has borrowed from many sources, Bill Inmon, who some people consider as the “Father of Data warehousing”, defines it as a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making.
Subject Oriented: The data stored in a Data Warehouse is organized by the business’ subjects of interest, that is, its data domains, such as customer, product, and sales. This organization makes it easier to remember because you are dealing with a limited number of subjects or data domains. Your understanding of these subjects should be deeper than your understanding of other IT requirements as a result of many years of experience in your industry.
Integrated: All data entering the data warehouse must be normalized and integrated according to defined rules and constraints to ensure that any reproduced data is consistently, unambiguously, and contextually whole.
Time Variance: Time variance is the perspective that data is viewed from multiple viewpoints over time. Within the data warehouse, it helps to organize the data by comparing and contrasting it from different perspectives at different times.
Non-Volatile: A data warehouse is essential to any business and companies should use them extensively as it can significantly improve the efficiency of a company. Data warehouses may also be used to analyze past decisions and make efficient future ones based on that data.
Management Decision Making: Data Warehousing is commonly described as being the infrastructure that supports operational reporting and analysis; however, it also has ancillary uses within corporations.
Demand driven: Updating a data warehouse should be based on business demand. In other words, we shouldn’t update the data warehouse hoping to get back data just because we might need it at some point in time in the future (yes, that is what preemptive loading of data means). We should only load data when we already know how it will be used and how it benefits us.
Conclusion
Data warehousing manages far more than the mere storage of data that originates from the operational level of an organization. It is a rather technical process in which one is able to
  1. determine, through a persistent and consistent planning effort, which data needs to be integrated into the data warehouse;
  2. migrate (or transform or load) this data into it through software tools; and then
  3. access it effectively with reporting and business intelligence applications

Enterprise grade data platform for Google BigQuery

Myth #2 Relational database management systems were first used for OLTP

If someone says that relational database management systems were first used with operational applications, doubt everything the person says. The first RDBMS products were used for reporting on data in dimensional database designs—there was good reason for this, as none of the implementations even came with a usable audit trail facility. So, not at all OLTP-friendly at the beginning.

Myth #3 Data Warehousing necessarily mean monolithic databases

Another myth is that data warehouses always require giant databases. Many businesses have used data warehouses with many node clusters, tons of disks and memory, and ultra-fast backplanes. The myth stems from the fact that we have been able to isolate data at various levels of abstraction for a long time. In addition, we have been able to use data storage and compute technology that has been available for decades.

Myth #4 Data Warehousing necessarily means monolithic and siloed teams

This is a problem with the way IT companies and their customers have reframed the idea of data warehousing development and infrastructure. Before IT got its hands dirty with data warehouses, cross-functional, high-performance teams would work closely with the business to build highly flexible solutions that could be modified quickly in response to evolving needs. In those cases, teams often developed entire data marts iteratively and incrementally, starting small and adding capabilities over time. This approach is not limited to data warehousing; look at the way so many of us use a software versioning strategy for software projects or agile processes for product development. The point here is that this approach fits the nature of how business works better than an attempt to fully understand requirements up-front, design a solution based on some rigid methodology from the outset, build it all out and then accommodate change over time as a separate activity.

Data warehouse databases must be fully normalized

In data warehousing, a multidimensional model is typically implemented as a dimensional schema—not a class of 3NF schema. Many data warehouse projects use a dimensional model where the dimensions are not in third normal form (3NF) or fifth normal form (5NF). Though this is sometimes referred to as dimensional-fourth normal form (4NF) or dimensional-fifth normal form (5NF), this is not the same as the sort of 4NF/5NF definitions used in database theory.Central ownership of data?

Data warehouses get queried directly

Yet another ill-informed criticism from the pro-privacy, anti-status quo faction. They really need to brush up on their history, or correct their total lack of a historical perspective. For one, there was never a generalized centralized ownership of data even if there was central custodianship of data.

Data warehousing is out of date

As Data Mesh proponents point out, distributed computing techniques go back as far as the 1980s. For example, a client-server model in which one server processes requests from many client machines was popular at that time. Many data-mesh advocates have pointed out that data mesh is actually an extension of this distributed computing model. They also argue that distributed computing and data mesh are not mutually exclusive: you can use both simultaneously.

In the end, there are many positive aspects of data mesh. Perhaps because I’ve worked with it before. In addition, its proponents tend to be polite and well-informed people–unlike some of the rough-necks in the big data industry. However, what bothers me is how people who champion data mesh so often take potshots at how long data warehouses take to build and how long they take to improve upon once built. We saw that with Big Data and Hadoop; then we saw that with data lakes and lake houses/outhouses; now we’re seeing it again. It’s like when some people make fake news just for attention: irritating, time-wasting, and unnecessary. We don’t expect the champions of data mesh will read this blog post, but if they do, we hope they reconsider their stance on data warehousing and realize it’s not from the Stone Age, but an amazing platform for storing petabytes of critical corporate data that enable enterprise BI, analytics, Big Data projects–even cutting-edge AI algorithms and machine learning processes.

START PLANNING YOUR MODERNIZATION

Want more information about how to solve your biggest data warehousing challenges? Visit our resource center to explore all of our informative and educational ebooks, case studies, white papers, videos and much more.

Recent Posts