CODE
CODE

What Is Data Engineering?

In the age of big data, companies collect and process vast quantities of data relating to their customers and internal processes. This data is usually scattered and unstructured, and therefore hard to analyse.

Data engineering is the process of building data pipelines to clean, store and organise that data in a structured format and making it accessible to data scientists and other teams within the organisation. These pipelines are subject to throughput, cost and scalability constraints.

What Is Data Engineering? - A diagram depicting a data pipeline

What Does a Data Engineer Do?

Data engineers are like city planners. They have a birds-eye view of the journey of every piece of data within the company. They understand where data is coming in, where it is stored and where it needs to go. They leverage this information to design and build systems that ensure the right pieces of data make it to the right team, at the right time, in the right format.

They work on a wide variety of tasks to accomplish this. A typical data engineer job description will have the following responsibilities:

  • Building pipelines that ingest data and route it between different parts of the system
  • Ensuring that data reaches the desired location
  • Designing data storage solutions that are cost effective, performant, and allow for analytics to be done on top
  • Ensuring that software applications can scale to handle huge volumes of data
  • Ensuring that data is accessible to those who need it
  • Ensuring compliance with data privacy and security regulations
  • Ensuring data quality
  • Building ETL (Extract, Transform, Load) pipelines to process, clean, or move the data
  • Designing and maintaining databases
  • Writing high quality code
  • Keeping up to date with new technology to avoid building on legacy technology
  • Interacting with stakeholders including management, data scientists, and devops engineers

Working with databases is a fundamental part of being a data engineer - most data is simple enough to be stored in a structured table. In the modern day, this means working with the cloud - companies rarely run servers on-premise any more.

Beyond just databases, varied types of data from multiple data sources are aggregated in raw format into data lakes. This raw data is then formatted before being sent to a data warehouse, which is a more accessible way for other teams to interface with it.

Overall, the responsibility is about writing code and using cloud services to develop back-end features and infrastructure.

An Example of Data Engineering

Let's think through an example to see where data engineers have had an impact in your daily life.

Imagine you walk into your local Tesco and buy baby food using your clubcard. Later that day, you visit the Tesco website online and are magically recommended wipes and other baby products. How did that happen?

It's not magic. When you made that purchase in-store, the data from the sales register was uploaded to the cloud in real-time and aggregated with the rest of your shopping behaviour data, using the clubcard to identify you. Your shopping history was formatted into features that were routed to the recommendation systems built by the data science team. Finally, the recommendations from that system were directed to your home page when you visited the website.

If you consider that process needs to run for millions of customers a day, you start to understand how challenging it can be to build performant and robust data pipelines at scale.

When Is a Data Engineer Needed?

Data engineers are needed when the amount of data you have outgrows basic data storage or processing solutions. This why they are also sometimes referred to as Big Data Engineers.

Beyond just storage, data engineering is the bedrock of scalable data analytics systems. If you have billions of data points, and you want to answer a simple question such as "What is the average purchase price this month?", you can't just run a regular old Python script. Firstly, all of that data won't fit in your computer memory. And even if you had a computer big enough, there are drastically more efficient ways to run such operations, like using Spark to distribute the computation across a cluster of computers.

These kind of problems appear more frequently as the amount of data grows. If a data science team wants to ask questions like this, they'll need a data engineer. If a machine learning engineer wants to deploy a machine learning model that makes predictions on flowing data, they will need a data engineer, and in fact there is a lot of overlap between data engineering and ML engineering.

Skills and Tools Used by Data Engineers

Don't assume that because you know Python you have all the skills you need to be a data engineer. When you start dealing with big data, there are some rather advanced tools that are ubiquitous in industry AI and data use cases that you should know about. Of course, the toolkit always depends on the problem, but here are some of the key tools you should be aware of.

  • Apache Airflow
  • Apache Spark Core
  • Apache Spark Streaming
  • Apache Kafka
  • Apache Hive
  • Web requests and APIs
  • Cloud services from providers such as AWS or Azure
  • Git & GitHub
  • NoSQL databases
  • SQL databases
  • Python

The stack of tools used varies across companies and even across applications within the same companies. For an example of a real industry system, check out this breakdown of Uber's Ad Processing System.

How Much Does a Data Engineer Make?

When starting out, junior data engineers in the UK can expect a salary of at least £35,000. With a few years of experience, mid level data engineer salaries are around £67,500. If you stay in the industry, keep building on your skills and gain experience managing teams, senior positions pay upwards of £87,500.

What Is the Demand for Data Engineers?

Data engineers are in high demand - the most in-demand role in the Ai stack. As data and Ai have continued to touch more industries, it's not just tech companies hiring data engineers any more. Companies in sectors ranging from accounting to waste management need skilled engineers to make use of the vast quantities of data at their disposal. In fact, the number of open roles citing data engineering has doubled between 2019 and 2021.

With demand growing that fast, talent is not able to keep up. A huge talent shortage means data engineers can take their pick of companies to work for and can demand high salaries, especially as they move into senior positions. Data engineers are in such short supply that the median salary has been increasing by approximately 5% per year.

Data Engineer vs. Data Scientist vs. Machine Learning Engineer - What’s the Difference?

In the current state of the industry, there is a bit of overlap between data science and data engineering as they are both built on top of the same essential tools of software development such as Python, Git, and the command line. Both roles are going to have to do some level of data cleaning and processing. But the similarities end there.

Data scientists interpret the data. They are often concerned with building proof of concepts (PoCs) which they then might hand over to a machine learning engineer to put into production, with the help of a data engineer.

Data engineering actually has a larger overlap with ML engineering as they are both concerned with storing, accessing, and processing data in some way. As a result, both roles make significant use of the cloud and other data processing tools, and have to consider real world constraints such as latency, throughput and cost.

How To Become A Data Engineer

Data engineers do not need strong maths skills so there is less stigma against coming from unconventional, even non-technical backgrounds, compared to something like data science. The key requirements for any data engineer is to be analytical and have strong problem solving skills.

There is no single path to becoming a data engineer. However, the fastest way we know to launch your career in data engineer is through the AiCore programme.

The 18 week programme delivers the most industry-informed, hands-on education in data engineering. You will learn from established experts how production-grade data engineering systems work then get experience by building systems that are currently deployed at companies such as Uber and Pinterest.

[fs-toc-omit]Launch Your Data Engineering Career Today With AiCore!

Are you considering becoming a data engineer, trying to figure out if it's right for you? Maybe you are certain you want to become a data engineer and are trying to find the fastest route into this lucrative career. In either case, the AiCore team would be delighted to help you. Book a 15 minute call here.

Ready to become an Ai & Data professional?

Apply Now