Data Engineering and MLOps

This course is designed to provide students with the skills and knowledge they need to effectively implement MLOps and deployment practices in a business or industry setting. Students will learn about different types of databases and how to use them in ML projects, techniques for processing large datasets using Big Data workflows, and how to build and deploy APIs for machine learning models. These skills are essential for businesses that rely on machine learning to improve their products or services, such as a retail company using machine learning to personalize customer recommendations or a healthcare organization using machine learning to improve patient diagnosis and treatment.

Students will also learn about using modern tooling for experiment tracking, packaging and deploying machine learning models, and setting up pipelines for continuous integration and deployment. These tools are critical for ensuring the quality and reliability of machine learning models in production environments. By the end of the course, students will be able to confidently apply these techniques to improve the speed and reliability of their machine learning projects.

Course lectures will be complemented by a number of guest presentations with industry examples how MLOps technologies are used in different businesses.

Session Overview

Databases in ML projects

  • Session 1: Introduction to databases for ML. Overview of different database types.

  • Session 2: Working with SQL databases.

  • Session 3: Working with NoSQL databases, such as TinyDB.

  • Exercise 1: Building PoC with DB backend + Group Portfolio Assignment

Big Data workflows

  • Session 4: Introduction to Big Data workflows using Spark.

  • Industry case 1: Guest lecture TBC.

  • Exercise 2: Performing a Big Data workflow with Spark + Group Portfolio Assignment

From notebook to API

  • Session 5: Code refactoring for production.

  • Session 6: API workflows with FastAPI.

  • Industry case 2: Guest lecture TBC.

  • Exercise 3: API workflows + Group Portfolio Assignment

MLOps with mlflow

  • Session 7: Introduction to MLOps with mlflow.

  • Session 8: Model deployment with mlflow.

  • Industry case 2: Guest lecture TBC.

  • Exercise 4: MLOps workflows + Group Portfolio Assignment

Packaging and deployment

  • Session 9: Introduction to packaging and deployment.

  • Session 10: Introduction to Docker and deploying scalable ML.

  • Industry case 3: Guest lecture TBC.

  • Exercise 5: Packaging and deployment + Group Portfolio Assignment

Literature

  • John, M. M., Olsson, H. H., & Bosch, J. (2021, September). Towards mlops: A framework and maturity model. In 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 1-8). IEEE

  • Calefato, F., Lanubile, F., & Quaranta, L. (2022, September). A preliminary investigation of MLOps practices in GitHub. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (pp. 283-288).

  • Mäkinen, S., Skogström, H., Laaksonen, E., & Mikkonen, T. (2021, May). Who needs MLOps: What data scientists seek to accomplish and how can MLOps help?. In 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN) (pp. 109-112). IEEE

  • Openja, M., Majidi, F., Khomh, F., Chembakottu, B., & Li, H. (2022). Studying the Practices of Deploying Machine Learning Projects on Docker. arXiv preprint arXiv:2206.00699.

  • Granlund, T., Kopponen, A., Stirbu, V., Myllyaho, L., & Mikkonen, T. (2021, May). Mlops challenges in multi-organization setup: Experiences from two real-world cases. In 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN) (pp. 82-88). IEEE

  • How Netflix works: the (hugely simplified) complex stuff that happens every time you hit Play (2017)

  • Bowles, M. (2019). Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics. John Wiley & Sons.

  • Made with ML (2023)

  • SQLite vs TinyDB (2021)

  • How to Handle Large Datasets in Python (2022)

  • What is MLOps (2019)

  • MLOps Best Practices for Machine Learning Model Development, Deployment, and Maintenance (2022)

  • Build and Run a Docker Container for your Machine Learning Model (2021)

  • [Step-by-Step MLflow Implementations](https://medium