Social / Business Data Science 2022 > Data Engineering and MLOps > Big Data workflows > Group assignment 2

Group assignment 2

Portfolio Exercise 2:

Introduction

Welcome to the Apache Spark or Polars Group assignment! For this assignment, you will be using your previous project as the dataset (or other large datasets of your choice) to process EDA part using Apache Spark or Polars. The goal of this assignment is to help you gain hands-on experience in processing and analyzing big data using Apache Spark or Polars. You will learn how to load and manipulate data with Spark or Polars and perform various transformations and actions on the data. This assignment will give you a better understanding of how Spark/Polars works and how it can be used to handle large datasets efficiently. Good luck and have fun!

Task

Select one of your previous projects or a large dataset of your choice, such as a dataset with millions of rows, and use either Apache Spark or Polars to perform the complete EDA report. The following tasks should be performed:

Load the dataset into the platform and perform basic exploratory data analysis (EDA) to understand the structure of the data. This includes checking the dimensions of the dataset, examining the data types, and identifying missing values.
Filter the data to include only the relevant observations. This can be done by removing missing values or filtering based on certain criteria or conditions that are specific to your dataset.
Aggregate the data to obtain summary statistics or metrics. You should run different aggregations using functions such as filter(), select(), groupby(), etc. For example, you can calculate the mean, median, or mode of certain variables to gain a better understanding of the data.
Join the data with another dataset (if available) to perform more complex analysis. This can be done by merging two tables using a common key.
Visualize the data using either Apache Spark’s built-in plotting library or any other visualization tools of your choice. You can plot a histogram, scatter plot, line plot, or any other types of charts that are relevant to your dataset. This will help you to identify trends and patterns in the data and gain insights into the underlying relationships between different variables.

Data

Select one of your previous projects or a large dataset of your choice, such as a dataset with millions of rows.

Delivery

Create a github repository (or use the existing one and adapt it)
Save colab notebook in the github.
Provide a readme.md with brief description.
Submission can be in groups up to 3.
Submit by sending an email with link to repo to Hamid (hamidb@business.aau.dk) with Daniel & Roman in cc. (dsh@…, roman@…)