Big data is defined by the three Vs: velocity - the speed with which the data is generated; variety - the different types of data; and volume - the size of data. Due to these characteristics, some of the daily tasks of a data engineer are to ensure that pipelines are executed correctly, incoming data is stored in the system, and data is accessible for whoever wants to use it. In the following article we are going to dive into the most common strategies for pipelines (ETL - extract, transform, load and ELT - extract, load, transform) and storing (data warehouse and data lake).
Data goes through different stages, starting from the creation, where the job is to collect and to centralize different data generated by various sources. After collecting, you need to process the data, applying different transformation strategies (formatting, labeling, filtering, and so on). Processed data needs to be stored into a destination and is ready to be used with different visualization applications.
Because of all of these stages that data goes through, engineers started to create pipelines. Pipelines are a series of steps where the output of the previous step is the input for the following one, and steps can be executed in parallel. A pipeline has a single direction and it doesn’t contain cycles. Each step in the pipeline represents an operation applied on the incoming data. You can imagine the pipeline like a directed acyclic graph (DAG). The following image shows an example of a pipeline for data collected from sensors.
There are two main strategies for ingesting into a Big Data platform: batch and streaming. Batch ingestion is used when a large amount of data is stored into the system (for example, a number of files). For storing this chunk of data, a pipeline is scheduled at a suited interval to run over the incoming information.
On the other hand, streaming ingestion is used when working with real-time data (for example, a number of messages are pushed into the system every couple of seconds). A publisher/subscriber design pattern application is used. In this case, processing pipelines are running continuously to ensure that the incoming data is stored into the system.
ETL - Extract Transform Load |
ELT - Extract Load Transform |
Data is stored after applying different transformations on it |
Data is stored directly into the sink, the transformation happens after saving it |
Commonly used inside data warehouse |
Commonly used inside data lake |
Needs a lower disk capacity destination because data is already structured and ready to use |
Needs a higher disk capacity destination for saving all the raw data |
ETL and ELT are both data integration processes used to move data from multiple sources into a single, centralized target data store. The main difference between ETL and ELT is when the data is loaded in the sink and the moment that the transformation step happens. Because of these differences, they are preferred for different use cases.
ETL pipelines are commonly used inside data warehouse jobs for ingesting data. The reason behind this choice is that raw data can be transformed into a structured format. On the other hand, ELT pipelines are commonly used inside cloud-based data lakes which can benefit from using provided storage and processing power.
Extract, Transform, and Load (ETL) is a data integration methodology that extracts raw data from the sources, transforms the data, and then loads the data into a sink. The method emerged in the 1970’s and remains prevalent amongst on-premise databases that possess finite memory and processing power.
The extracted data only moves from the processing server to the data warehouse once it has been successfully transformed. This strategy helps when you want to shrink the data volume that needs to be stored applying different transformations.
A data warehouse is a central data hub used for reporting and analysis. Data in a data warehouse is typically highly formatted and structured for analytics use cases. It’s among the oldest and most well-established data architectures.
The key features of this approach are: data is organized, may have been transformed, and is stored in a structured way. Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed. Data may be easier to understand, but less up-to-date. Structures are hard to change.
You can imagine a data warehouse like a store where every product is well organized on a shelf, ready for it to be consumed.
Unlike ETL, Extract, Load, and Transform (ELT) does not require data transformation to take place before the loading process. ELT loads raw data directly into a target data hub, instead of loading it after the transformation phase finishes.
ELT is relatively new, made possible by the invention of scalable cloud-based data lakes. In terms of storage and processing power, cloud providers facilitate raw data repositories and in-app transformations.
Although ELT data pipeline is not used universally, the method is becoming more popular as companies adopt cloud infrastructure.
Instead of imposing tight structural limitations on data, the idea of data lakes is to dump all of your data—structured and unstructured—into a central location. The key features of this approach are: the data lake contains all raw, unprocessed data, before any kind of transformation or organization. Data is more up to date, but may require more advanced tools for analysis. No structure is enforced, so new types of data can be added at any time.
You can imagine a data lake like a giant pool where any form of data can be thrown into it.
Data Warehouse |
Data Lake |
Data is highly structured |
Data has any format (raw/structured) |
Data is ready to be used for analysis |
Data needs to be processed before using it |
Needs a lower disk capacity for storing data |
Needs a higher disk capacity for storing the data |
Commonly used for on premise infrastructure |
Commonly used inside cloud solutions |
A data warehouse and a data lake are both data storage and management systems, but they have different architectures and are used for different purposes.
Data warehouse satisfies the need for a centralized repository for storing and analyzing structured data helping business intelligence (BI) and decision-making.
On the other hand, data lake is a centralized repository for storing and processing vast amounts of raw, unstructured data for big data and data science analytics.
The presented techniques are some of the most common patterns used for ingesting and storing data into the system. Each has its pros and cons but it depends on the needs you have to fulfill. Maybe you have an on premise system that can’t store large amounts of raw data, and you need to apply some transformations before saving it, shrinking its volume. Or you need all the unprocessed data to be stored into a central repository and process it when you want to use it for training a machine learning (ML) model.
As you can see, various business scenarios and technical specifications call for different frameworks. Since they depend on each individual project, it's always best to analyze your existing infrastructure and business needs to see which of the above best suits your business plan and can help you overcome your challenges in analyzing data. The first steps in a big data project implementation are auditing and technical consultancy.
With a diploma in Computer Science at the Politehnica University of Bucharest, Erhan Mamut has been working as a Full Stack Engineer for 3 years before embarking on his data engineering journey. He enjoys stepping outside his comfort zone and learning new things, which led him to proceed on a new career path in big data, a decision he is thoroughly excited about, one year later.