Coding Task
COMP0058 – Machine Learning Scientist
Background
Ensemble methods are a cornerstone of machine learning, where multiple models are trained to solve the same problem and combined to get better results. They are designed to improve the robustness and accuracy of Machine Learning algorithms.
Your task is to demonstrate your understanding and ability to implement ensemble techniques to solve a classification problem.
Task Description
You are given a synthetic dataset train.csv which contains anonymized features X1, X2, …, X20 and a binary target variable Y. Your task is to train Machine Learning models for predicting the target variable Y. You should start with some base models, and then use some ensemble techniques to combine the base models together to get better results. You are also provided a validation.csv dataset, which you should use it to validate the performance of your models. You’ll need to write a report to explain what you did with the dataset and the models, and the reasoning.
The task is divided into the following steps:
1. Data Exploration: Load the dataset and perform data analysis to understand the data.
2. Base Model Development: Develop some simple base models of your choice. You should use the training set for training the models, and use the validation set to report their performance. You can use standard machine learning libraries like scikit-learn, TensorFlow or PyTorch.
3. Ensemble Model Development: Develop one or more ensemble models. Compare the results with the base models, and explain the results.
4. Code Documentation and Report: Your code should be well-documented with comments explaining your reasoning and approach. You should also write a report about the approaches you take, the reasons of your design choices, and explanation of why certain models perform better than others.
You should report both accuracy and F1 score of your models. You should report their performance on both the training set and the validation set, as a way of understanding overfitting. You can assume “1” in the target is the positive label.
A test.csv is kept from all participants to evaluate the performance of your best model, and the result will be used to rank all the participants (together with other factors).
Deliverables
1. All your source code, tests, and script. The source code should be well documented.
Code Help
2. A requirements.txt file to help us install all the dependencies in your project. We will run `pip install -r requirements.txt` in your project.
3. A detailed report discussing your approaches, the results, any challenges you faced and how you overcame them. The report should also include any assumptions you made while tackling the problem. You should report the performance of all your models, on both the training set and the validation set.
4. A test.py script that can be run to evaluate the performance of your best model (which should be pickled and submitted as part of the deliverable package). We will run `python3 test.py test.csv` to run the script, and it should output the results in the following format:
Accuracy: 0.886
The test file is in the same format of the validation file, so you can use the validation file to make sure the test script is working.
Please submit the deliverables in a package by email no later than 11:59am GMT (noon) on Sunday March 3rd 2024 .
• You are free to use any additional libraries or tools you find useful.
• Please don’t confine your report to the approaches that work well, but all the approaches you
have tried and explanation of why they didn’t perform well. Your reasoning and explanation are as important as the model performance.
Programming Help