QBUS2820 Assignment 1

QBUS2820 Assignment 1 (30 marks)

August 23, 2024

1 Background

Developing a predictive model for building heating load is essential in energy efficiency management. Suppose you work for an energy efficiency consulting firm, and your task is to optimize the heating system operations of buildings by predicting their daily heating load requirements.

The variable HeatingLoad in the dataset HeatingLoad_training.csv represents the daily energy required to maintain comfortable indoor temperatures in buildings. This data includes several predictors that influence heating load, such as building characteristics, en-vironmental conditions, and occupancy. The response variable and covariates are detailed in the table below.

Variable	Description
`HeatingLoad`	Total daily heating energy required (in kWh)
`BuildingAge`	Age of the building (in years)
`BuildingHeight`	Height of the building (in meters)
`Insulation`	Insulation quality (1 = Good, 0 = Poor)
`AverageTemperature`	Average daily temperature (in °C)
`SunlightExposure`	Solar energy received per unit area (in W/m²)
`WindSpeed`	Wind speed at the building’s location (in m/s)
`OccupancyRate`	Proportion of the building that is occupied (percentage)

Table 1: Description of Variables

Your task is to develop a regression model to predict HeatingLoad based on these covariates. Additionally, you are provided with the dataset HeatingLoad_test_without_HL.csv, which is the real test dataset HeatingLoad_test.csv with the HeatingLoad column re-moved. The test dataset HeatingLoad_test.csv (not provided) has the same structure as the training data HeatingLoad_training.csv.

1.1 Test Error

To measure prediction accuracy, please use mean squared error (MSE) on the test data. Let ŷ_i be the prediction of y_i, where y_i is the i-th HeatingLoad in the test data. The test error is computed as follows:

Test error = $\frac{1}{n_{test}}\sum_{y_i \in \text{test data}}(\hat{y}_i – y_i)^2$

where n_test is the number of observations in the test data.

2 Submission Instructions

Please submit THREE files (or more if necessary) via the Canvas site:
- A document file named SID_Assignment1_document.pdf, reporting your data analysis procedure and results. You should replace “SID” with your student ID.
- A Python file named SID_Assignment1_implementation.ipynb that implements your data analysis procedure and produces the test error. You may submit additional files if needed, following the format SID_Assignment1_.
- A CSV file SID_Assignment1_HL_prediction.csv containing the predictions of HeatingLoad for the dataset HeatingLoad_test_without_HL.csv. This CSV file should have only one column, named HeatingLoad, which holds the pre-dicted values.
Regarding your document file SID_Assignment1_document.pdf:
- Detail your data analysis procedure: how the Exploratory Data Analysis (EDA) was conducted, the methods/predictors used, and the reasoning behind them. The description should be thorough enough for other data scientists in your field to understand and replicate the task. All numerical results should be reported to four decimal places.
- Present relevant graphs and tables clearly and appropriately.
- The page limit is 15 pages, including everything: appendices, computer output, graphs, tables, etc.
The Python file must be written using Jupyter Notebook, assuming all necessary data files (HeatingLoad_training.csv and HeatingLoad_test.csv) are in the same folder as the Python file.
- The Python file SID_Assignment1_implementation.ipynb must include the following code in the last code cell:
```
import pandas as pd
HeatingLoad_test = pd.read_csv("HeatingLoad_test.csv")
# YOUR CODE HERE: code that produces the test error test_error
print(test_error)
```
The marker expects to see the same test error you would obtain if you were provided with the complete test data. The file should contain enough explanations for the marker to run your code.
- Use only the methods covered in the lectures and tutorials. You are free to use any Python libraries to implement your models as long as they are publicly available.

3 Marking Criteria

This assignment is worth 30 marks in total, with 18 marks allocated to the content of SID_Assignment1_document.pdf and 12 marks to the Python implementation. The marking breakdown is as follows:

Prediction accuracy: Your test error will be compared against the smallest test error among all submissions, including the teaching team.
- The marker first runs SID_Assignment1_implementation.ipynb.
  - If the file runs smoothly and produces a test error, up to 12 marks will be awarded based on prediction accuracy relative to the smallest MSE and the appropriateness of your implementation.
  - If the marker cannot run SID_Assignment1_implementation.ipynb or if no test error is produced, partial marks (maximum 4) may be awarded based on the appropriateness of the file.
Report described in SID_Assignment1_document.pdf: Up to 18 marks are allocated based on:
- The appropriateness of the chosen prediction method.
- The detail, discussion, and explanation of your data analysis procedure. See the Marking Criteria for more details.
CSV File Submission: Up to 2 marks will be deducted if you fail to upload the CSV file in the correct format.

4 Errors

If you believe there are errors in this assignment, please contact the teaching team.