QBUS2820 Assignment 1 (30 marks)
August 23, 2024
1 Background
Developing a predictive model for building heating load is essential in energy efficiency management. Suppose you work for an energy efficiency consulting firm, and your task is to optimize the heating system operations of buildings by predicting their daily heating load requirements.
The variable HeatingLoad
in the dataset HeatingLoad_training.csv
represents the daily energy required to maintain comfortable indoor temperatures in buildings. This data includes several predictors that influence heating load, such as building characteristics, en-vironmental conditions, and occupancy. The response variable and covariates are detailed in the table below.
Variable | Description |
---|---|
HeatingLoad |
Total daily heating energy required (in kWh) |
BuildingAge |
Age of the building (in years) |
BuildingHeight |
Height of the building (in meters) |
Insulation |
Insulation quality (1 = Good, 0 = Poor) |
AverageTemperature |
Average daily temperature (in °C) |
SunlightExposure |
Solar energy received per unit area (in W/m²) |
WindSpeed |
Wind speed at the building’s location (in m/s) |
OccupancyRate |
Proportion of the building that is occupied (percentage) |
Table 1: Description of Variables
Your task is to develop a regression model to predict HeatingLoad
based on these covariates. Additionally, you are provided with the dataset HeatingLoad_test_without_HL.csv
, which is the real test dataset HeatingLoad_test.csv
with the HeatingLoad
column re-moved. The test dataset HeatingLoad_test.csv
(not provided) has the same structure as the training data HeatingLoad_training.csv
.
1.1 Test Error
To measure prediction accuracy, please use mean squared error (MSE) on the test data. Let ŷi be the prediction of yi, where yi is the i-th HeatingLoad
in the test data. The test error is computed as follows:
Test error = $\frac{1}{n_{test}}\sum_{y_i \in \text{test data}}(\hat{y}_i – y_i)^2$
where ntest is the number of observations in the test data.
2 Submission Instructions
- Please submit THREE files (or more if necessary) via the Canvas site:
- A document file named
SID_Assignment1_document.pdf
, reporting your data analysis procedure and results. You should replace “SID” with your student ID. - A Python file named
SID_Assignment1_implementation.ipynb
that implements your data analysis procedure and produces the test error. You may submit additional files if needed, following the formatSID_Assignment1_
. - A CSV file
SID_Assignment1_HL_prediction.csv
containing the predictions ofHeatingLoad
for the datasetHeatingLoad_test_without_HL.csv
. This CSV file should have only one column, namedHeatingLoad
, which holds the pre-dicted values.
- A document file named
- Regarding your document file
SID_Assignment1_document.pdf
:- Detail your data analysis procedure: how the Exploratory Data Analysis (EDA) was conducted, the methods/predictors used, and the reasoning behind them. The description should be thorough enough for other data scientists in your field to understand and replicate the task. All numerical results should be reported to four decimal places.
- Present relevant graphs and tables clearly and appropriately.
- The page limit is 15 pages, including everything: appendices, computer output, graphs, tables, etc.
- The Python file must be written using Jupyter Notebook, assuming all necessary data files (
HeatingLoad_training.csv
andHeatingLoad_test.csv
) are in the same folder as the Python file.- The Python file
SID_Assignment1_implementation.ipynb
must include the following code in the last code cell:
import pandas as pd = pd.read_csv("HeatingLoad_test.csv") HeatingLoad_test # YOUR CODE HERE: code that produces the test error test_error print(test_error)
The marker expects to see the same test error you would obtain if you were provided with the complete test data. The file should contain enough explanations for the marker to run your code.
- Use only the methods covered in the lectures and tutorials. You are free to use any Python libraries to implement your models as long as they are publicly available.
- The Python file
3 Marking Criteria
This assignment is worth 30 marks in total, with 18 marks allocated to the content of SID_Assignment1_document.pdf
and 12 marks to the Python implementation. The marking breakdown is as follows:
- Prediction accuracy: Your test error will be compared against the smallest test error among all submissions, including the teaching team.
- The marker first runs
SID_Assignment1_implementation.ipynb
.- If the file runs smoothly and produces a test error, up to 12 marks will be awarded based on prediction accuracy relative to the smallest MSE and the appropriateness of your implementation.
- If the marker cannot run
SID_Assignment1_implementation.ipynb
or if no test error is produced, partial marks (maximum 4) may be awarded based on the appropriateness of the file.
- The marker first runs
- Report described in
SID_Assignment1_document.pdf
: Up to 18 marks are allocated based on:- The appropriateness of the chosen prediction method.
- The detail, discussion, and explanation of your data analysis procedure. See the Marking Criteria for more details.
- CSV File Submission: Up to 2 marks will be deducted if you fail to upload the CSV file in the correct format.
4 Errors
If you believe there are errors in this assignment, please contact the teaching team.