Our Services
Software Engineering

Engineer your business' future through tailored solutions

Big Data

Cloud and Infrastructure

DevOps

Custom Software Development

Artificial Intelligence (AI)

Internet of Things (IoT)

QA & Software Testing

Project Support

Expand your technical capabilities rapidly

Team Augmentation

Full Teams

Technology Consulting

Get guidance in making informed technical decisions

IT System Audit

Architecture Design

IoT Hardware Design

Business Consulting

Make your business transformation worthwhile

Agile Consulting

Business Analysis

Technical Training
Can?t find what you?re looking for?

We might still be able to help you. We have the skills and experience to match any challenge.
Our Products
KubeLake

Kubernetes-native modular data platform for handling, processing, and analyzing vast amounts of data, both in real-time and in batch.

Discover

Retail

kountr

smart store monitoring

rupio

Digital Business

roomio

TVio

digital checklist

panomio

Agile

Agile Work Assessment

AgilePro: Interactive Training

Agile Compass Program

AI / ML / Computer Vision

Transcriber

spot

AI-Powered Chat Assistant

Data & Infrastructure

KubeLake

KubeSol

IT Audits

Application Audit

IT Landscape Audit
Can?t find what you?re looking for?

We might still be able to help you. We have the skills and experience to match any challenge.
Portfolio
Technologies
Languages & Frameworks

Leading languages & frameworks for your project

Java / Kotlin

JavaScript / TypeScript

Python

C / C++

GoLang

Spring

ReactJS / Angular

Flutter

Infrastructure

Streamlined infrastructure solutions for modern demands

AWS / GCP / Azure

Kubernetes

Docker

Terraform / Terragrunt

ArgoCD / Jenkins

Ansible

Crossplane

Grafana / Loki / Prometheus

Linux

Database & Storage

On-premise & cloud solutions for every business need

Cassandra

BigQuery

PosgreSQL

OracleDB

Redis

Apache Druid

MongoDB

Solr / ElasticSearch

Cloud Storage

HDFS

MiniO

GlusterFS

Processing & Pipelines

Architecting your data processing pipelines

Apache Spark

Kafka / Kafka Streams

MQTT

Airflow

Apache Beam

Apache NiFi

Databricks

Snowflake
Can?t find what you?re looking for?

We might still be able to help you. We have the skills and experience to match any challenge.
Blog
About

An Intro to Data Ingestion and Data Transformation

Big data is defined by the three Vs: velocity - the speed with which the data is generated; variety - the different types of data; and volume - the size of data. Due to these characteristics, some of the daily tasks of a data engineer are to ensure that pipelines are executed correctly, incoming data is stored in the system, and data is accessible for whoever wants to use it. In the following article we are going to dive into the most common strategies for pipelines (ETL - extract, transform, load and ELT - extract, load, transform) and storing (data warehouse and data lake).

Data Lifecycle

Data goes through different stages, starting from the creation, where the job is to collect and to centralize different data generated by various sources. After collecting, you need to process the data, applying different transformation strategies (formatting, labeling, filtering, and so on). Processed data needs to be stored into a destination and is ready to be used with different visualization applications.

What Is a Pipeline?

Because of all of these stages that data goes through, engineers started to create pipelines. Pipelines are a series of steps where the output of the previous step is the input for the following one, and steps can be executed in parallel. A pipeline has a single direction and it doesn’t contain cycles. Each step in the pipeline represents an operation applied on the incoming data. You can imagine the pipeline like a directed acyclic graph (DAG). The following image shows an example of a pipeline for data collected from sensors.

There are two main strategies for ingesting into a Big Data platform: batch and streaming. Batch ingestion is used when a large amount of data is stored into the system (for example, a number of files). For storing this chunk of data, a pipeline is scheduled at a suited interval to run over the incoming information.

On the other hand, streaming ingestion is used when working with real-time data (for example, a number of messages are pushed into the system every couple of seconds). A publisher/subscriber design pattern application is used. In this case, processing pipelines are running continuously to ensure that the incoming data is stored into the system.

ETL versus ELT

ETL - Extract Transform Load	ELT - Extract Load Transform
Data is stored after applying different transformations on it	Data is stored directly into the sink, the transformation happens after saving it
Commonly used inside data warehouse	Commonly used inside data lake
Needs a lower disk capacity destination because data is already structured and ready to use	Needs a higher disk capacity destination for saving all the raw data

ETL and ELT are both data integration processes used to move data from multiple sources into a single, centralized target data store. The main difference between ETL and ELT is when the data is loaded in the sink and the moment that the transformation step happens. Because of these differences, they are preferred for different use cases.

ETL pipelines are commonly used inside data warehouse jobs for ingesting data. The reason behind this choice is that raw data can be transformed into a structured format. On the other hand, ELT pipelines are commonly used inside cloud-based data lakes which can benefit from using provided storage and processing power.

ETL

Extract, Transform, and Load (ETL) is a data integration methodology that extracts raw data from the sources, transforms the data, and then loads the data into a sink. The method emerged in the 1970’s and remains prevalent amongst on-premise databases that possess finite memory and processing power.

The extracted data only moves from the processing server to the data warehouse once it has been successfully transformed. This strategy helps when you want to shrink the data volume that needs to be stored applying different transformations.

Data Warehouse

A data warehouse is a central data hub used for reporting and analysis. Data in a data warehouse is typically highly formatted and structured for analytics use cases. It’s among the oldest and most well-established data architectures.

The key features of this approach are: data is organized, may have been transformed, and is stored in a structured way. Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed. Data may be easier to understand, but less up-to-date. Structures are hard to change.

You can imagine a data warehouse like a store where every product is well organized on a shelf, ready for it to be consumed.

data warehouse

ELT

Unlike ETL, Extract, Load, and Transform (ELT) does not require data transformation to take place before the loading process. ELT loads raw data directly into a target data hub, instead of loading it after the transformation phase finishes.

ELT is relatively new, made possible by the invention of scalable cloud-based data lakes. In terms of storage and processing power, cloud providers facilitate raw data repositories and in-app transformations.

Although ELT data pipeline is not used universally, the method is becoming more popular as companies adopt cloud infrastructure.

Data Lake

Instead of imposing tight structural limitations on data, the idea of data lakes is to dump all of your data—structured and unstructured—into a central location. The key features of this approach are: the data lake contains all raw, unprocessed data, before any kind of transformation or organization. Data is more up to date, but may require more advanced tools for analysis. No structure is enforced, so new types of data can be added at any time.

You can imagine a data lake like a giant pool where any form of data can be thrown into it.

data lake

Data Warehouse versus Data Lake

Data Warehouse	Data Lake
Data is highly structured	Data has any format (raw/structured)
Data is ready to be used for analysis	Data needs to be processed before using it
Needs a lower disk capacity for storing data	Needs a higher disk capacity for storing the data
Commonly used for on premise infrastructure	Commonly used inside cloud solutions

A data warehouse and a data lake are both data storage and management systems, but they have different architectures and are used for different purposes.

Data warehouse satisfies the need for a centralized repository for storing and analyzing structured data helping business intelligence (BI) and decision-making.

On the other hand, data lake is a centralized repository for storing and processing vast amounts of raw, unstructured data for big data and data science analytics.

Summary

The presented techniques are some of the most common patterns used for ingesting and storing data into the system. Each has its pros and cons but it depends on the needs you have to fulfill. Maybe you have an on premise system that can’t store large amounts of raw data, and you need to apply some transformations before saving it, shrinking its volume. Or you need all the unprocessed data to be stored into a central repository and process it when you want to use it for training a machine learning (ML) model.

Conclusion

As you can see, various business scenarios and technical specifications call for different frameworks. Since they depend on each individual project, it's always best to analyze your existing infrastructure and business needs to see which of the above best suits your business plan and can help you overcome your challenges in analyzing data. The first steps in a big data project implementation are auditing and technical consultancy.

References:

About the Author

With a diploma in Computer Science at the Politehnica University of Bucharest, Erhan Mamut has been working as a Full Stack Engineer for 3 years before embarking on his data engineering journey. He enjoys stepping outside his comfort zone and learning new things, which led him to proceed on a new career path in big data, a decision he is thoroughly excited about, one year later.

See all articles

We are a technology company developing integrated, complex, and secure big data, cloud, IoT, and software engineering solutions for key market verticals.

Bucharest

20 General Constantin Budisteanu street, Bucharest, 010775

Office hours

Monday - Friday : 09:00 - 18:00

Bonn

Am Dickobskreuz 10, D-53121 Bonn

Berlin

Kohlfurter Strasse 41/43, D-10999 Berlin

Koln

Kennedy-Ufer 11, D-50679 Koln

Software Engineering

Project Support

Technology Consulting

Business Consulting

Retail

Digital Business

Agile

AI / ML / Computer Vision

Data & Infrastructure

IT Audits