The opportunity to get involved with putting AI into real products is huge. And guess what, data scientist isn’t the only role available! I spoke to over 50 industry experts from companies like Twitter, Amazon & ASOS who are doing just that to find out exactly what they’re ACTUALLY looking for. This is what I learnt 👇👇👇
Putting AI into the real world requires all of the following
- Data engineers
- Data scientists
- Data analysts (which i will only touch upon briefly)
- ML Engineers
- MLOps Engineers
- Project managers (who I don’t focus on here)
In this article, I'll explain the most technical roles in depth, share what tools they're using and try to give an overall idea for what it takes companies to put AI into the real world. I’ll add in writing about data analysts and project managers later, but for now this is where I think the exciting stuff is.
Let's firstly take a look at an overview of what these different roles might include.
Note: in reality the boundaries between roles are fuzzy
Larger companies tend to have more clearly defined boundaries between roles. For early stage companies, it works to have one person to manage all of these, as that saves them salary costs and management overhead. As companies scale up however, systems become too complex for one person to manage, and specialists are required more and more for each part of the AI pipeline.
Before we get into the details of each of those roles, we should mention some things which are essential across all of them.
Python is the native language for working in AI, and programming in it will be necessary for anyone in an AI team (some teams accept Scala and R but it’s far less common and is falling out of favour). Data scientists will write code in Python (Jupyter) Notebooks and often be required to write tests using Python libraries like PyTest to ensure things work before they get deployed.
AI is built by teams of teams, so communication is important. All individuals will need to use the command line, especially for using Git and GitHub so that they can checkpoint their code and collaborate with others. It's also surprisingly popular to pair program production code (for example at companies like ASOS & Thomson Reuters etc), where one person writes code whilst they both talk through what they need to write and why. This gives programmers the chance to learn from more senior team members, share ideas and collaborate on projects. Being able to present and communicate complex ideas & concepts to your team, and potentially to non-technical stakeholders, is important.
Another key skill for anyone developing any software is being able to find and read the documentation that explains how 3rd party code and tools work online; It sounds obvious, but because many schools don't teach this explicitly, many people don't understand that programming is all about getting good at looking up what you don't know.
Using APIs and making requests (using Python or Bash) is another key skill for understanding how you can get access to data and use all kinds of 3rd party tools. As more and more of the technology stack moves to the cloud, it becomes imperative to be confident with at least one leading cloud provider (Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure (Microsoft)) for at least the basic services like file and database storage, building APIs & virtual compute. I’ll dig into the most commonly used of these throughout the next paragraphs. Being able to use a cloud provider means being familiar with using the graphical console (the cloud provider’s website), the software development kit (SDK) (probably a Python library) and command line interface (CLI) (the thing you use from the terminal).
Now let's dive deeper into each of the key roles and take a look at what they consist of, and what tools they use.
Data engineers handle ingesting, storing and preparing data as well as making sure that it can be accessed by other teams (mainly data scientists).
A key part of any AI project is data - you need data for your AI system to learn how to make predictions. Once they have data, data engineers need to decide how to store it so that multiple teams (particularly data scientists) can access it. Selecting the best setup for this requires understanding the formats that data comes in, which boils down to what problem you’re tackling. Complex (“unstructured”) data like images, video or audio etc are best stored in “data lakes” like AWS S3, which provides a file structure in the cloud (you wouldn't want a column in your table for each pixel colour, so best to just store the file). It’s also common to dump raw data, regardless of its type into a data lake as you may not know what parts of the data are important, or what problems you will be using it to tackle, until later. Tabular ("structured") data, which has meaningful rows and columns, is best stored in a database, known as a data warehouse. Traditionally, data has been stored in a central database like PostgreSQL, but as I mentioned earlier, it is becoming the de-facto standard to store this data in the cloud, using a cloud database service like AWS RDS or GCP BigTable.
Almost all companies use SQL to perform these data queries and understand what the data tells them. For very large datasets (too big to store in your computer memory), companies use Spark (or PySpark) for distributed computing and data processing, which can allow for massive speedups through parallelisation. Databricks is a cloud tool that makes it easy to get started with Spark by setting up the compute resources in the background, and it is becoming a popular tool for data engineering (and to a lesser extent AI - more later). Databricks also sits on top of AWS S3 and allows for data versioning - keeping track of data as it changes over time. This is important for reproducing results (“what data did i use?”) and interpretability (“what data caused X?”). Snowflake is another popular 3rd party database service growing in popularity, but is certainly not as widely used as the competing AWS or GCP database services.
Data scientists are concerned with understanding the data by visualising it, building models that explain it and evaluating the quality of those models.
Data scientists need to be able to confidently use SQL to query data. They need to be comfortable using the Python library Pandas for data cleaning, preprocessing, feature engineering & data manipulation in general. If they’re working with big data, then PySpark is the go-to tool. Being able to visualise data is also important for doing exploratory data analysis (EDA), and a Python library like plotly is usually used for this.
A key method for understanding data is to use machine learning to build models. These “models” are simply mathematical functions that take in some data, and make a prediction from it. E.g. predict the best price for this shoe given the historical sales and product data. They tell you about the underlying structure of the data (unsupervised models) or the relationships between features within it (supervised models). A key skill for any data scientist is to understand when to use ML, and more importantly, when NOT to use it.
The most widely used supervised machine learning models used in industry include linear regression, logistic regression & decision trees. Popular unsupervised learning techniques include Principal Component Analysis (PCA) & K-means clustering. Before you apply these kinds of machine learning to a problem, you need to be very comfortable with the intuition & mathematical theory of how machine learning works. Being able to use the right models in the right situation, and being able to debug and tune them so that they work, depends again on having a great mathematical understanding of what they do and how - the key mathematical concepts to be confident with are 1) linear algebra 2) statistics and 3) calculus.
More and more often, deep learning models (neural networks) are replacing traditional machine learning models. The main deep learning framework used in industry is TensorFlow. However, PyTorch is gaining popularity, and is leading in terms of rate of development of new features which are also being readily adopted by industry. Most companies use tensorboard to track their deep learning experiments, which allows them to view a model's performance in various ways during the training process. AWS Sagemaker provides a cloud hosted jupyter notebook connected to cloud computers (as powerful as you ask for) and is commonly used by data scientists for training models. Another constantly valuable skill is building upon the work of others. Using open-sourced pretrained ML models from somewhere online and applying them as an “off the shelf” solution can help to make things work quickly in the real world. Further to this, another (less commonly) applied technique is transfer learning: fine tuning a pretrained deep learning model to a specific use case. Beyond building the models, data scientists need to understand how to evaluate these models - what metrics to use, what statistical significance means and how to run A/B tests.
Data scientists need to be able to interact with other people across the stack. They sit between data engineers, who provide them access to the data, ML Engineers who will optimise and put their proof of concepts (POCs) into production, data analysts who will use their work to create dashboards and find problem insights, and stakeholders who they will report/present technical approaches and results to. Again, this should emphasise how data visualisation is important for clear communication.
ML engineers take the models of data scientists and handle the practical considerations like training/prediction efficiency, scaling & deployment.
The first step in handling these practical concerns is making sure that code runs efficiently. This means accelerating training and prediction of models through GPU parallelisation. To this point, it's useful to be able to use AWS EC2 (Elastic Cloud Compute), to rent high power servers in the cloud for faster training. To track and run experiments at a large scale, tools like Neptune or MLFlow are used. MLFlow is used for “registering” models in a “store” so that they can be tracked, and moved in and out of production.
ML engineers are designers of end-to-end solutions. They decide how to connect data to models to customers. This, as you might imagine, can heavily involve connecting different cloud services. The system you need to architect obviously depends on the problem you’re trying to solve; Different problems have different speed, accuracy, cost, security, and load constraints, to name a few. Despite this, there are a few major questions you can ask to group the types of systems. Here I’ll just ask one which leads to a major branch of solutions, and allows me to talk about a highly popular set of tools.
Do you need real-time predictions? If so, you’d be making “online” predictions; and “offline” predictions otherwise. For example, making product recommendations needs to happen in real time, literally whilst the user is scrolling the page - this requires online predictions. Pricing products however, might need to happen only once a month, after which all the prices will be updated. This is an offline job. Online predictions are typically deployed by creating an API (a computer that knows how to react to requests made over the internet, like “what products should I recommend to this user?”).
There are many ways to build an API. And yes, you guessed it, there’s a cloud service for that. But you could also build one yourself. In reality, the best solution is somewhere along a spectrum between these two options. At one end you have services which are fully managed by a 3rd party (like AWS), but have limited flexibility. At the other end, is your custom built solution which you need to design, manage and maintain, but which offers endless customisation. The best choice obviously depends on your problem.
AWS Sagemaker is one of the fully managed services you can resort to for a simple solution that works scalably. On top of being used for training, as I mentioned earlier, it can also be used for deploying models, by exposing them through an API endpoint (a url which requests can be made to). Sagemaker is attempting to be a one stop shop for integrating ML into your product, but in reality there are often too many constraints which push companies to develop their own custom solutions.
A basic way to deploy your own fully custom ML API would be to serve predictions through a flask API running inside a Docker container on an AWS EC2 instance. Although all of the right ingredients are there, this simple architecture won’t scale to handle the large volume of traffic that some companies serve. Kubernetes comes to the rescue. It is a widely used tool which allows you to scale up and down such applications corresponding to the demand. I’ve found that most custom ML infrastructure is built on Kubernetes.
Between “whole solution” services like Sagemaker and fully custom solutions built on Kubernetes, there are many other options. AWS provides a huge variety of cloud services, which can be connected together to build end-to-end AI solutions. Common services used in these kinds of pipelines include API Gateway (for building and securing APIs) and AWS Lambda (for highly available compute without constantly running or managing a server). There are countless other companies offering specific solutions that tie in with other tools too. Choosing the best tools of the job and sewing them together can often be the responsibility of ML engineers.
At this point, you’ve got a working system, but there are many extra cherries that can go on top. One of them is monitoring: making sure you are able to keep track of how the system is performing when it’s deployed. How many predictions are we making? What kind of predictions are we making? Are things working? Is our model drifting away from the data? Do we need to retrain our model? Prometheus and Grafana are popular tools which can help answer questions like this. Prometheus is a monitoring tool that is gaining popularity for pulling inmetrics from different parts of your system. From that data, Grafana can be used to produce graphs and dashboards for visualising it.
MLOps engineers are concerned with improving the processes for deploying, and ensuring reliability of ML models which are used in the real world.
Building a successful AI product is fundamentally different to traditional software engineering because you can’t guarantee that your code will continue to work. To win you need to continuously optimise some metric (like revenue) by making good predictions that help the end-user. As the world changes, the solutions will change too (different trends = different solutions). This is known as data drift.
As you can probably see by now, putting ML into real products requires lots of things to work together. Pipelines from data to model deployment can easily contain dozens of steps. As the world changes, we’ll need to go through this pipeline again to produce new solutions in the form of new models trained on new data. Running each of these steps manually just isn’t going to work at scale. This is where MLOps comes in. MLOps = ML + DevOps.
This process of making sure that every step of the pipeline happens in the right order, automatically, and with nothing breaking (or retrying in the case that it does), is called orchestration. Airflow and Kubeflow Pipelines are popular tools which enable you to create automated orchestration pipelines.
But what if you want to change some part of your application code? Do you have to stop the system, replace something and start it up again manually? You guessed it, there’s a better way. MLOps engineers are also responsible for setting up continuous integration and continuous deployment (CI/CD) through a service like GitHub Actions, where every time new code is added to the main branch, it triggers a redeployment.
I’ll mention one final thing to add a final layer of complexity into your ML stack: continuous training (so now we have CI/CD/CT). This is the automatic retraining of your model based upon some criteria (e.g. surpassing some data drift or performance decrease threshold) which kicks off earlier processes in your pipeline. Maybe I’m going overboard with the automation here though… A lot of the time you can just use a cron job for this.
Once the system is tested, running, scaling and retraining automatically, we have a pretty mature ML system in place.
In summary, it takes a wide range of skills, and a huge variety of tools to put AI into real products. It's hard! And while the roles are yet to be clearly defined, we’re starting to see these emerge as the industry matures.
AiCore exists to bridge the gap between industry demands and your interest in AI, and the advice which we got from our friends in industry to inform this memo is what our entire new curriculum is based upon. We hope that this helps you to launch your career in AI, and if you’d like to join a group of passionate people looking to do the same thing, supported by a world class faculty, you should apply to join the Fellowship today.