Introduction to The World of Data — (OLTP, OLAP, Data Warehouses, Data Lakes and more)

The Transactional World (OLTP)

Imagine we are creating the newest social media startup. Our initial focus will be on hiring developers to build our application with languages like Javascript, Python or Ruby and getting it into the hands of potential users. Like any app, data is key as we need users to be registered and log in and for users to be able to add content like photos, blogs and so forth.

  • When we retrieve a user, we get one particular row
  • When we delete a user, one row is deleted

The Need for OLAP

In the early stages, the data in our database may be several hundreds of megabytes or even dozens of gigabytes. We are going to want to use that data to improve our business outcomes so we begin hiring data professionals:

  • Data Scientists will create and train machine learning models to help generate predictions that can be used for making decisions, better-targeted advertising, making useful suggestions and more.
  • Data Engineers are responsible for making sure data is available for data consumers and its quality can be trusted.
  • In analytics, you often only need particular fields/columns, but with row-based data you’ll load the entire row data before narrowing it down, then repeat for every row.
  • Analytical queries are often more complicated adding logic to transform, aggregate and clean data.
  • Partitioning: Allow the database to break up the data into subdivisions based on the value of certain fields. For certain queries, the database can now only search those subdivisions. (Not good for fields with lots of possible values/high cardinality)
  • Sorting: Sorting the data so that finding data based on the sorted fields can be faster.
  • Materialization: Making a physical copy of a subset of the data (can be challenging to keep materialized tables in sync with the original, and consumers need to know to query the materialized table, not the original)
  • We will run any logic to transform our data to the ideal format for our analytical needs
  • We will then load the transformed data into the OLAP data warehouse

Swimming in a Lake of Data

While now we have the performance needed for large-scale analytics, we run into issues with the costs. It is generally cost-prohibitive to store a copy of ALL our data in the data warehouse, and some data, particularly unstructured data (text-based data like emails, audio/video, sensor data) can’t be loaded into the data warehouse.

Data Sprawl & Data Drift

Now it is looking like we have three copies of our data:

  • The copy in our data lake
  • The copy in our data warehouse
  • Some should have access, but not to certain columns (SSN, Private Health Info) or rows (customers that have opted out of information use)
  • Some departments need the data to have different column names and other changes that other departments don’t need.
  • Marketing often likes to run analytics on sales data by month so we may do monthly extract into their data mart, this is actually a physical copy of a subset of the original. (it will be faster to query the subset than to query the full dataset)

The Data Lakehouse Dream

One approach that has been embraced in recent times to simplify all of this is the idea of why not just run more or all of our analytical workloads on the data lake instead of the data warehouse.

  • We reduce data drift, as the data lake becomes the single source of truth
  • It wasn’t easy to use
  • Processing frameworks weren’t fast enough to load and query the data
  • It was tricky to track what datasets existed and to control who has access

The Components of Data Lakehouse

Storage and Storage Format

  • Cloud object storage with cloud providers like AWS, Azure and Google Cloud is super cheap making file storage a trivial cost
  • Newer binary columnar file formats such as Apache Parquet make storing data in standard ways optimized for analytics easy

Table Formats

We may have our sales data broken down in our object storage into thousands of parquet files, but when we run analytics we want to recognize those files as one table and this is the role of table formats like Apache Iceberg. Apache Iceberg allows us to group files in our data lake as tables and also maintain metadata that processing tools (query engines) can use to increase their performance when scanning these tables.

In-Memory Processing and Data Transport

While parquet is a great file format for analytics, if an engine loads the file into a row-based format we end up losing a lot of the benefits. The Apache Arrow project creates an in-memory columnar format, and the Apache Arrow Flight project allows us to move data from system to system in that format eliminating the speed hits from serialization/deserialization when loading files or sending data across the wire. Apache Arrow Gandiva also allows queries to be pre-compiled to native code prior to execution for even more performance gains.

Openness

All these components are open source and allow any engine to incorporate and use them so you avoid two other problems with data warehouses and proprietary formats:

  • Tool lock-out, if new data engines and tools are created, since your data is in a proprietary format you can’t use them.

The Final Piece of the Puzzle

Hearing about all these components sounds great, but what everyone wants isn’t to have to setup and configure all these components but instead to have a platform and tool that brings this all together in an easy-to-use package, and that platform is Dremio. With Dremio you can work with the data directly from your data lake. No copies, easy access, high performance.

Performance on the data lake

More About Dremio’s Performance

  • Dremio uses its C3(columnar cloud cache) technology to help speed up queries on cloud object storage and reduce costs.
  • With Dremio’s reflections, you can create reusable materializations that automatically sync to speed up the process on multiple datasets and queries. It is as easy as flipping a switch. Now you can have sub-second BI dashboards without the need for extracts.

Governance and Collaboration

More on Dremio’s Governance and Collaboration

  • You can control which users and roles can access each space, folder and dataset for easy and granular governance to maintain regulatory compliance and security.
  • Control access to rows and columns using pre-filtered virtual datasets or row/column access policies.

Ease of Use

  • Dremio has a really straightforward easy to use UI
  • Robust SQL Editor for writing the queries you need

Cost Optimization

  • You can auto-scale and granularly control cloud computing power to only pay for what you need reducing your cloud compute costs significantly over other platforms.
  • Free tier, no software or licensing costs to get started. Dremio provides powerful data analytics tools available to enterprises of all sizes.

Conclusion

Engineering data to unlock the value it can provide has come a long way as we’ve tried different ways to analyze larger and larger amounts of data. We are now at a point where this all becomes a bit simpler with data lakehouse architecture which can address the data engineer shortage while also enabling Analysts and Scientists the ability to do their jobs without long waits for data or incurring data drift.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Merced Coder

Alex Merced Coder

Alex Merced is a Developer Advocate for Dremio and host of the Web Dev 101 and Datanation Podcasts.