STAT4038 STAT6038)

RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS

REGRESSION MODELLING

(STAT4038/STAT6038)

Assignment 1 for Semester 1, 2023

Due date: 3:00 pm (Canberra time) on Friday, 31 March 2023

INSTRUCTIONS:

 This assignment is worth 15% of your overall marks for this course.

 You must complete this assignment by yourself. If you copy someone else’s work or

allow your work to be copied, you will receive a mark of zero for the assignment and

risk very severe academic consequences.

 Your report should be submitted to Turnitin on Wattle as a single pdf document (less

than 25MB), including the following:

1. The assignment cover sheet (available to download from Wattle).

2. Your assignment (a maximum of 10 pages).

3. An appendix including the R codes you used. Failure to upload the R code will

result in a penalty.

 When submitting your assignments, please ensure that they are typed. You may

include some edited R output, such as graphs and tables, showing the results of your

data analysis and a discussion of these results. Additionally, you may include carefully

selected code to support your analysis. It is important to be selective about what you

present and only include as many pages and as much R output as necessary to justify

your solution. Please label each part of your report with the corresponding question

number to make it easier for the reader to follow your work.

 Unless otherwise advised, use a significance level of 5%. Round numeric answers to 4

decimal places (e.g., 0.0012).

 To ensure that your report is concise and meets the expectations of the assignment,

please follow the instructions carefully. Note that marks may be deducted if the

instructions are not strictly followed. Specifically, the report should be no longer than

10 pages, including graphs and tables. However, you may include an appendix in

addition to the 10-page limit. Please note that the appendix will not be assessed but

will be checked only if there is a question about what you have actually done.

 Name your report “Course code-Uid”, e.g., “STAT2008-u1234567”.

 To avoid any unexpected issues, such as internet connectivity problems, please aim to

submit your assignment at least 15 minutes before the deadline.

 Please note that late submissions will NOT be accepted. If you require an extension, it

will be granted on medical or compassionate grounds, provided that appropriate

evidence is produced. However, obtaining the lecturer’s permission for an extension

at least 24 hours before the deadline is essential.

Analysing Used Car Prices with Simple Linear Regression

You have been provided with a dataset containing information on 818 used Toyota cars. The data

includes the year of manufacture (Year), model name (Model), odometer reading in kilometres

(Odometer), transmission type (Transmission), power type (Power), and the asking price in $AUD for

each car (AskPrice). Your goal is to use simple linear regression to analyze the data and answer the

following questions:

1. Is a seller asking a reasonable price for the car?

2. Is the seller asking a lower price than the market price for the car?

3. Is the seller asking a higher price than the market price for the car?

Therefore, we will be using the asking price of a vehicle as the response variable and the odometer

reading and the year it was manufactured as the predictor variables in this assignment.

To complete the assignment, you need to perform the following tasks:

(a) [5 marks] Exploring the predictor variable, Odometer, through appropriate graphic diagnostics

and descriptive statistics can assist in identifying any outliers in the variable and provide

insight into the range of validity for the regression analysis based on the concentration and

range of the variable levels. Explore the predictor variables, Odometer and Year, one at a time,

using appropriate graphical diagnostics and descriptive statistics. Based on your analysis, what

conclusions can you draw about these variables?

(b) [5 marks] Diagnostic plots for the response variable are usually not very informative in

regression analysis because the values of the response variable are influenced by the predictor

variable, making it challenging to evaluate the goodness of fit of the regression model by

examining the response variable directly on itself. Nonetheless, exploring the response

variable can still yield useful insights, particularly in identifying outliers or anomalies. Explore

the response variable, AskPrice, using appropriate graphic diagnostics and descriptive

statistics. Based on your analysis, what conclusions can you draw?

(c) [5 marks] Perform an exploratory data analysis to assess the correlation between the two

variables, AskPrice and Odometer. Based on your prior knowledge or assumptions, what was

your expectation regarding the relationship between the two variables? Does the observed

relationship match your expectation?

(d) [10 marks] Fit a simple linear regression (SLR) model to the data, with AskPrice as the response

variable and Odometer as the predictor variable. Then, create several plots to assess the

model’s assumptions and identify any unusual data points. These plots should include a plot

of the residuals against the fitted values, a normal Q-Q plot of the residuals, a bar plot of the

leverages for each observation, and a bar plot of Cook’s distances for each observation. Use

these plots, along with any other relevant means, to comment on the model assumptions and

identify any potential outliers or influential observations.

(e) [5 marks] Produce the ANOVA table for the fitted model in part (d) and conduct the F-test

based on the output. Please provide your interpretation of the ANOVA results and the F-test.

What is the coefficient of determination (R-squared) for this model, and how should it be

interpreted as a summary measure?

(f) [10 marks] What are the estimated coefficients and their standard errors for the fitted model

in part (d)? Provide an interpretation of these coefficient values and perform t-tests to

determine if they differ significantly from zero. Based on the results of these tests, what

conclusions can you draw?

(g) [5 marks] Perform an exploratory data analysis to assess the correlation between the two

variables, AskPrice and Year. Based on your prior knowledge or assumptions, what was your

expectation regarding the relationship between the two variables? Does the observed

relationship match your expectation?

(h) [10 marks] Fit a simple linear regression (SLR) model to the data, with AskPrice as the response

variable and Year as the predictor variable. Then, create several plots to assess the model’s

assumptions and identify any unusual data points. These plots should include a plot of the

residuals against the fitted values, a normal Q-Q plot of the residuals, a bar plot of the

leverages for each observation, and a bar plot of Cook’s distances for each observation. Use

these plots, along with any other relevant means, to comment on the model assumptions and

identify any potential outliers or influential observations.

(i) [5 marks] Produce the ANOVA table for the fitted model in part (h) and conduct the F-test

based on the output. Please provide your interpretation of the ANOVA results and the F-test.

What is the coefficient of determination (R-squared) for this model, and how should it be

interpreted as a summary measure?

(j) [10 marks] What are the estimated coefficients and their standard errors for the fitted model

in part (h)? Provide an interpretation of these coefficient values and perform t-tests to

determine if they differ significantly from zero. Based on the results of these tests, what

conclusions can you draw?

(k) [10 marks] Do you think that the simple linear regression models fitted in parts (d) and (h) are

suitable? If not, what variable transformations could you apply to enhance the models? Please

provide an explanation for your choice of transformation. Using the transformation of your

choice, refit the model(s) and select the best one for interpretation and prediction. Please

provide reasoning for your selection and interpret its coefficients.

(l) [10 marks] After analysing the data, select the best model from parts (d), (h), and (k) and

explain your reasoning for choosing that model. Based on your analysis of the data,

recommend a car for your friend to purchase and explain why you chose that particular car.

Additionally, suggest a car that you would advise your friend not to buy and provide reasons

for your decision.

(m) [10 marks] Your friend is selling a Toyota Corolla with a 2015 year of manufacture and an

odometer reading of 100,000 km. Using the model of your choice, construct a 95% confidence

interval for the expected selling price and a 95% prediction interval for the potential selling

price of the car. Explain how these intervals can assist in determining an appropriate asking

price for the car. Based on your analysis, what would be your recommended asking price for