FIT5202 Data processing for Big Data

Monash University
FIT5202 – Data processing for Big Data
Assignment 2A: Analysing Flight-delays Data (Preprocessing, Visualization, Data Modelling)
Due Date : Sunday, 26th September 2021 11:00 pm Weight : 10% percent of the total mark
Background
The flight-delays prediction dataset has become one popular dataset used by the aviation industry to predict the delay given the historical flight data. Learning from data can be beneficial for the companies, e.g. aviation industry, so that they can minimize the delay to improve customer satisfaction. The insight of the data can be obtained by conducting some steps, including pre-processing, visualization, and data modelling.
There is no change in the datasets in comparison to the assignment1. The details of the dataset can be seen below.
Required datasets (available in Moodle):
– A compressed file flight-delays.zip.
– This zip file consists of 21 csv files and a metadata file:
o 20 flight*.csv files
o airports.csv (we do not use it)
o metadata.pdf
– Note that in this assignment, the original flights.csv has been sliced and
reduced into several csv files in order to accommodate the hardware
limitations.
– The complete dataset also can be downloaded publicly at
https://www.kaggle.com/usdot/flight-delays
Additional packages allowed in this assignment:

– numpy, pandas, scipy, and matplotlib
– If you are unsure or need to use other packages please consult to the
tutors or ask Ed Forum
Information on Dataset
The flight-delays and cancellation data was collected and published by the U.S. Department of Transportation¡¯s (DOT) Bureau of Transportation Statistics. This data records the flights operated by large air carriers and tracks the on-time performance of domestic flights. This data summarises various flight information such as the number of on-time, delayed, cancelled, and diverted flights published in DOT’s monthly in 2015.
Assignment Information
This assignment consists of two parts:
¡ñ Part 1: Data Loading, Cleaning, Labelling, and Exploration ¡ñ Part 2: Feature extraction and ML Training
In the assignment1, our focus was on manipulating data using Spark RDD, Spark DataFrames and Spark SQL. In this assignment, we only use Spark SQL where you need to further do the pre-processing, visualization, and data modelling steps. The data modelling task aims to create the machine learning models utilizing MLlib/ML APIs to predict both departure and arrival delays from the testing data using classification tasks. You will need to build models to predict the classes.
The class labels are generated from the departure and arrival delay values range. To define the categorical label in the dataset, you need to conduct a labelling task. To do that, you need to define thresholds that can split data into two or more category labels. For binary classification, a threshold can be a mean or median in the range of departure/arrival delay values. For example, the instances in which arrival delay values between min range and median are labelled as 1 which represents a late condition, otherwise 0 which represents not late condition. In multiclass classification, the label can be defined by splitting the range into several parts (e.g using bin size) where each part has its own label.
Some of the parts in the assignment are open questions. You might have a different approach

for every step. For example, in the case of multiclass labelling tasks, you can define how you would categorize the data based on the range of arrival and departure value. In another case, it is also possible for you to define the steps in spark ML transformers.
Getting Started
¡ñ Download the datasets from Moodle namely flight-delays.zip.
¡ñ There is no template for this assignment, please organize your answer in order so that it
is easy to mark.
¡ñ You will be using Python 3+ and PySpark 3+ for this assignment.
1. Data Loading, Cleaning, Labelling, and
Exploration (45%)
In this section, you will prepare the data (loading and cleaning), performing data exploration and
1.1 Data loading (5%)
1. Write the code to get an object using SparkSession, which tells Spark how to access a cluster. To create a SparkSession you first need to build a SparkConf object that contains information about your application. Give an appropriate name for your application and run Spark locally with as many working processors as logical cores on your machine.
2. Read the 20 files of ¡°flight*.csv¡± file into a single Spark DataFrame namely flightsRawDf using spark.read.csv with both header and inferSchema attributes are set to True. Display the total number of flightsRawDf rows.
3. Obtain the list of columns from flightsRawDf and name this variable as allColumnFlights.
Code Help, Add WeChat: cstutorcs
1.2 Data cleaning (15%)
1. Check for missing values (NaN and Null) in all columns, display the number of missing values for each column. Do you see any pattern in how values are missing in the data? E.g. You can take the top two columns with the highest percentage of missing values and examine if the values are missing completely at random?
2. Remove some columns and rows in the flightsRawDf using threshold values. Please do the following tasks :
a. Write a python function to automatically obtain the column names for all columns in flightsRawDf whose number of missing values is greater x percent (x is a threshold in percent, e.g. x=10) from the number of rows. These columns are deemed unworthy due to the abundance of missing values. Example of the function is as follows:
b. Once a list variable removedColumns is obtained, write a python function namely eliminate_columns so that flightsRawDf is updated by removing columns listed in removedColumns variable. Example of the function is as follows:
Then, display the number of rows and columns in flightsRawDf.
c. Drop rows with Null and Nan values from flightsRawDf. Please name it as
flightsDf. Display the number of rows and columns in flightsDf.

1.3 Data Labelling (15%)
1. Generate labels in flightsDf using predetermined values as follow:
a. The new binary labels generated from arrival delay and departure delay columns for the classification task purpose. The new column names are binaryArrDelay and binaryDeptDelay, which are generated from arrival delay and departure delay respectively. Label the data as 1 (late) if the delay value is
positive, otherwise label it as 0 (not late).
b. The new multiclass labels generated from arrival delay and departure delay
columns for the classification task purpose. The new column names are multiClassArrDelay and multiClassDeptDelay, which are generated from arrival delay and departure delay respectively. For multiclass labels. Please make it three classes as early, on time, and late which are represented by the 0, 1, and 2 values. For example: The value below 5 is regarded as early, 5 to 20 is regarded as on time, and above 20 is regarded as late.
2. Auto labelling flightsDf using function
a. The same task as multi class labelling task 1b above. However, in this task
please write a python function to execute data labelling automatically (Task 1b uses hardcoding method). You may determine the range yourself for each category early, on time, and late (e.g. using bin size). Make sure to study the distribution of the data and comment on why you think your choice of bin size is appropriate, in not more than 200 words.
1.4 Data Exploration / Exploratory Analysis (10%)
1. Show the basic statistics (count mean stdev, min, max, 25 and 75 percentile) for all columns of the flightsDf. Observe the mean value of each column in flightsDf. You may see that a column like ¡®AIRLINE¡¯ does not have any mean value. You may see that a column has a mean value equal to zero (0). Keep in mind that this analysis is needed further to determine the selected features to build the machine learning models. You can restrict your analysis to numerical columns only.
2. For categorical columns identify total unique categories, and category name, and

frequency of the highest occurring category in the data.
3. Write code to display a histogram (only for binary labels) to show the task as follows:
a. Percentage of flights that arrive late each month.
b. Percentage of flights that arrive late each day of week.
c. Percentage of delayed flights by airline.
2. Feature extraction and ML Training (55%)
In this section, you will need to use PySpark DataFrame functions and ML packages for data preparation, model building and evaluation. Other ML packages such as scikit learn would receive zero marks. Excessive usage of Spark SQL is discouraged.
2.1 Discuss the feature selection and prepare the feature columns (10%)
1. Related to the question on 1.4.1, discuss which features are likely to be feature columns and which ones are not. Further, obtain the correlation table (use default ¡®pearson¡¯ correlation) which shows the correlation between features and labels. How would you analyze the correlation for categorical columns? For the discussion, please provide 300-500 words of description.
2. Write the code to create the analytical dataset consisting of relevant columns based on your discussion above.
2.2 Preparing any Spark ML Transformers/ Estimators for features and models(10-15%)
1. Write code to create Transformers/Estimators for transforming/assembling the columns you selected above in 2.1.1.
2. Bonus task: (5%)
Create a Custom Transformer that allows you to map Months to Season. Note the Custom Transformer can be used with the pipeline. E.g. Months 3 to 5 are mapped as ¡°Spring¡±, Months 6 to 8 are mapped as ¡°Summer¡±, Months 9 to 11 are mapped as ¡°Autumn¡±, and Months 12, 1, and 2 are mapped as ¡°Winter¡±.
3. Create ML model Estimators for Decision Tree and Gradient Boosted Tree model for binary classification for both arrival and departure delays (PLEASE DO NOT fit/transform the data yet).
程序代写 CS代考 加QQ: 749389476
4. Create ML model Estimators for Naive Bayes model for multiclass classification for both arrival and departure delays from the labels that you have created at task 1.3.2 (PLEASE DO NOT fit/transform the data yet).
5. Write code to include the above Transformers/Estimators into pipelines for all tasks (PLEASE DO NOT fit/transform the data yet).
2.3 Preparing the training and testing data (5%)
1. Write code to randomly split the data into 80 percent and 20 percent proportion as training and testing data. You may use seed for the analysis purposes. This training and testing data will be used for model evaluation of all tasks in 2.4.
2.4 Training and evaluating models (30%)
For three use cases below, please follow the instructions.
1. Binary classification task (Using Decision Tree and Gradient Boosted Tree) for both arrival and departure delay classification (4 models).
a. Write code to use the corresponding ML Pipelines to train the models on the training data.
b. For both models and for both delays, write code to display the count of each combination of late/not late label and prediction label in formats as shown in the example Fig.1.
c. Compute the AUC, accuracy, recall, and precision for the late/not late label from each model testing result using pyspark MLlib/ML APIs.
d. Discuss which metric is more proper for measuring the model performance on predicting late/not late events, in order to give the performers good recommendations.
e. Discuss which is the better model, and persist the better model.
f. Write code to print out the leaf node splitting criteria and the top-3 features with each corresponding feature importance. Describe the result in a way that it could be understood by your potential users.
g. Discuss the ways the performance can be improved for both classifiers (500-700 words in total)
Programming Help, Add QQ: 749389476
Fig.1. Format output for binary classification task (example)
2. Multiclass classification task (using Naive Bayes) for only arrival delay classification (1 model).
a. Write code to use the corresponding ML Pipelines to train the model on the training data. This label is automatically generated from task 1.3.2.
b. Write code to display the count of each combination of early/on-time/late label and prediction label in formats as shown in the example Fig.2.
c. Compute the AUC, accuracy, recall, and precision for the early/on-time/late label from Naive Bayes model from pyspark MLlib/ML APIs.
d. Discuss which metric is more proper for measuring the model performance on predicting early/on-time/late events, in order to give the performers good recommendations.
e. Discuss the ways the performance can be improved for Naive Bayes classifiers (500-700 words in total)
Fig.2. Format output for multiclass classification task (example)

Assignment Marking
The marking of this assignment is based on the quality of work that you have submitted rather than just quantity. The marking starts from zero and goes up based on the tasks you have successfully completed and its quality, for example how well the code submitted follows programming standards, code documentation, presentation of the assignment, readability of the code, reusability of the code, and organisation of code.
Submission Requirements
You should submit your final version of the assignment solution online via Moodle; You must submit the following:
¡ñ A PDF file (created from the notebook) to be submitted through Turnitin submission link. Use the browser¡¯s print function to save the notebook as PDF. Please name this pdf file based on your authcate name (e.g. psan002.pdf)
¡ñ A zip file of your Assignment-2A folder, named based on your authcate name (e.g. psan002.zip). This should be a ZIP file and not any other kind of compressed folder (e.g. .rar, .7zip, .tar). Please do not include the data files in the ZIP file. Your ZIP file should only contain Assignment-2A.ipynb
Where to Get Help
You can ask questions about the assignment on the Assignments section in the Ed Forum accessible from the on the unit¡¯s Moodle Forum page. This is the preferred venue for assignment clarification-type questions. You should check this forum regularly, as the responses of the teaching staff are “official” and can constitute amendments or additions to the assignment specification. Also, you can visit the consultation sessions if the problem and the confusions are still not solved.

Plagiarism and Collusion
Plagiarism and collusion are serious academic offences at Monash University. Students must not share their work with any other students. Students should consult the policy linked below for more information.
https://www.monash.edu/students/academic/policies/academic-integrity
See also the links under Academic Integrity Resources in Assessments block on
Students involved in collusion or plagiarism will be subject to disciplinary penalties, which can include:
¡ñ The work not being assessed
¡ñ A zero grade for the unit
¡ñ Suspension from the University ¡ñ Exclusion from the University
Late submissions
Late Assignments or extensions will not be accepted unless you submit a special consideration form. ALL Special Consideration, including within the semester, is now to be submitted centrally. This means that students MUST submit an online Special Consideration form via Monash Connect. For more details please refer to the Unit Information section in Moodle.
There is a 10% penalty per day including weekends for the late submission.