Introduction to The World of Data — (OLTP, OLAP, Data Warehouses, Data Lakes and more)

In this article, I hope to paint a picture of the modern data world and when done you should have a decent understanding of the roles of Data Engineers, Data Analysts and Data Scientists and the technologies that empower them.

The Transactional World (OLTP)

Essentially, the data access patterns of applications are often row driven:

  • When we create a user, we create a new row
  • When we retrieve a user, we get one particular row
  • When we delete a user, one row is deleted

This is generally what is called “Transactional Processing”, as we are working with data not to analyze it but to facilitate transactions/interactions with our application. The databases we use for our applications like Postgresql, MySQL, MariaDB and others are designed for these kinds of patterns. Systems designed for these kinds of workloads are referred to as OLTP (online transaction processing) systems.

The Need for OLAP

  • Data Analysts will begin analyzing our data to create data visualizations and dashboards that help different business units make better decisions.
  • Data Scientists will create and train machine learning models to help generate predictions that can be used for making decisions, better-targeted advertising, making useful suggestions and more.
  • Data Engineers are responsible for making sure data is available for data consumers and its quality can be trusted.

Eventually, as the size of our data grows especially into the multi-petabyte scale, OLTP systems are going to start to feel slow for analytical queries for several reasons:

  • In analytics, you are often not grabbing one row but large subsets of rows
  • In analytics, you often only need particular fields/columns, but with row-based data you’ll load the entire row data before narrowing it down, then repeat for every row.
  • Analytical queries are often more complicated adding logic to transform, aggregate and clean data.

To a point, we can improve the performance of OLTP systems using techniques you can use to optimize any database such as:

  • Indexing: Creating a data structure similar to the index in the back of a textbook to more quickly lookup values. (Speeds up Reads, Slows down writes)
  • Partitioning: Allow the database to break up the data into subdivisions based on the value of certain fields. For certain queries, the database can now only search those subdivisions. (Not good for fields with lots of possible values/high cardinality)
  • Sorting: Sorting the data so that finding data based on the sorted fields can be faster.
  • Materialization: Making a physical copy of a subset of the data (can be challenging to keep materialized tables in sync with the original, and consumers need to know to query the materialized table, not the original)

Even then, at the end of the day it may be better to use systems built with analytical queries in mind, referred to as OLAP (Online Analytical Processing) systems which can also benefit from the techniques mentioned above. These have traditionally come in the flavor of specialized systems/platforms called “Data Warehouses”, which provide the performance needed but often at very large price tags for storage and use of the platform.

To use a data warehouse, our startup will have to move data currently sitting in our OLTP systems to our OLAP data warehouse using tools and code:

  • We will export the data from our OLTP database
  • We will run any logic to transform our data to the ideal format for our analytical needs
  • We will then load the transformed data into the OLAP data warehouse

This is the role of the Data Engineer to create and maintain these ETL (EXPORT, TRANSFORM, LOAD) workloads so data consumers (Data Analysts and Data Scientists) can make use of reasonably up-to-date data in OLAP systems.

Swimming in a Lake of Data

The solution is to create an intermediary dumping ground for all of our data whether structured or unstructured, and this is known as the data lake.

In the early days, data warehouses and data lakes had to be on-prem (on-premise) so companies would have to buy several large and powerful computers to power the data warehouse which added to the licensing costs of data warehouse software and support. With the creation of the open-source Hadoop framework, businesses could store large amounts of data on a network of cheaper less powerful computers, the data lake. Fast forward to modern times, more and more data warehouse and data lake storage is now cloud-based with the main cloud providers AWS, Azure, and Google Cloud.

So now Data Engineers will ETL the data from OLTP systems into the data lake and then do another ETL for a subset of mission-critical data in the data warehouse.

Data Sprawl & Data Drift

  • The original data in our OLTP sources
  • The copy in our data lake
  • The copy in our data warehouse

This is actually only the beginning of the number of copies that will get made that the data engineer needs to track and keep in sync.

The problem comes in that you just can’t give all consumers access to all the data to maintain all the security, regulatory and governance responsibilities of the enterprise.

  • Some departments probably shouldn’t have access to certain datasets
  • Some should have access, but not to certain columns (SSN, Private Health Info) or rows (customers that have opted out of information use)
  • Some departments need the data to have different column names and other changes that other departments don’t need.

To facilitate this, the data engineer will create data marts which are essentially sub-data warehouses. So for example, for the marketing department we may create a data mart and inside it load some data:

  • We may create a “logical view” of customers that don’t include credit card numbers or SSNs, this isn’t a copy but a pre-filtered view of the existing data accessible from their mart.
  • Marketing often likes to run analytics on sales data by month so we may do monthly extract into their data mart, this is actually a physical copy of a subset of the original. (it will be faster to query the subset than to query the full dataset)

This can result in a large network of marts and extracts adding to our storage and platform costs. Although… it gets worse…

A data analyst on the marketing team may be creating a dashboard for the last week of sales data so to improve performance they query the data and export an extract to their computer for faster performance on their favorite BI Dashboard tool (Tableau, Power BI, Hex, Preset). Now we have a copy of the data that the data engineer isn’t aware of, can’t manage access to and has no idea whether is in sync. This non-accountable copy is being used to create dashboards that critical business decisions are based on. This problem with the growth of data copies is referred to as data drift.

Along with data drift, anytime consumers need a previously hidden column to be visible or other data changes they have to make a request with the data engineer who is constantly working through a backlog of requests, maintaining existing ETL pipelines and creating new ones resulting in simple requests taking weeks or months slowing down the time to business insights.

One may say, just hire more data engineers. Unfortunately, there just aren’t enough data engineers to keep up with needs resulting in a multi-billion dollar market for tools to help make this whole process become easier.

The Data Lakehouse Dream

  • We reduce the cost of storage and the data warehouse
  • We reduce data drift, as the data lake becomes the single source of truth

This architecture is referred to as the Data Lakehouse, essentially using the data lake as your data warehouse. In the past, this often did not go well resulting in “data swamps” for several reasons.

  • The structure and format of the data wasn’t conducive to fast processing at scale (lots of csv and json files)
  • It wasn’t easy to use
  • Processing frameworks weren’t fast enough to load and query the data
  • It was tricky to track what datasets existed and to control who has access

Luckily, innovation across a lot of high-value open-souce projects have alleviated many of these concerns making Data Lakehouses practical and desirable.

The Components of Data Lakehouse

Storage and Storage Format

  • Newer binary columnar file formats such as Apache Parquet make storing data in standard ways optimized for analytics easy

Table Formats

Read a Comparison of different table formats

We will also need a catalog to track all of these tables, with the open-source Project Nessie we can do just that, and also get great versioning features similar to using Git when developing applications allowing data engineers to practice “data as code” and “write-audit-publish” patterns on their data.

Video on Dremio’s Project Nessie-based Arctic Service

In-Memory Processing and Data Transport

Openness

  • Vendor lock-in, once your data is in the data warehouse it can be a lot of work to get it back out if you switch providers.
  • Tool lock-out, if new data engines and tools are created, since your data is in a proprietary format you can’t use them.

The Final Piece of the Puzzle

With Dremio the data lakehouse comes to life as it addresses all the problems we’ve explored throughout this article.

Performance on the data lake

  • Dremio is built top to bottom with Apache Arrow allowing to process queries fast and at scale
  • Dremio uses its C3(columnar cloud cache) technology to help speed up queries on cloud object storage and reduce costs.
  • With Dremio’s reflections, you can create reusable materializations that automatically sync to speed up the process on multiple datasets and queries. It is as easy as flipping a switch. Now you can have sub-second BI dashboards without the need for extracts.

Governance and Collaboration

  • You make data self-serve using Dremio’s semantic layers allowing you to distribute datasets into spaces and folders.
  • You can control which users and roles can access each space, folder and dataset for easy and granular governance to maintain regulatory compliance and security.
  • Control access to rows and columns using pre-filtered virtual datasets or row/column access policies.

Ease of Use

  • Robust SQL Editor for writing the queries you need

Cost Optimization

  • Free tier, no software or licensing costs to get started. Dremio provides powerful data analytics tools available to enterprises of all sizes.

Conclusion

To follow advancements in the data world follow the following Twitter accounts:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Merced Coder

Alex Merced is a Developer Advocate for Dremio and host of the Web Dev 101 and Datanation Podcasts.