HW2 Introduction to Time Series

Homework 2: [YOUR NAME HERE]
Introduction to Time Series, Fall 2023
Due Thursday September 21 at 5pm
The total number of points possible for this homework is 35. The number of points for each question is written below, and questions marked as “bonus” are optional. Submit the knitted html file from this Rmd to Gradescope.
If you collaborated with anybody for this homework, put their names here:
Simple regression
1. (2 pts) Derive the population least squares coefficients, which solve
by differentiating the criterion with respect to each , setting equal to zero, and solving. Repeat the calculation but without intercept (without the coefficient in the model).
SOLUTION GOES HERE
2. (2 pts) As in Q1, now derive the sample least squares coefficients, which solve
Again, repeat the calculation but without intercept (no in the model). SOLUTION GOES HERE
3. (2 pts) Prove of disprove: in the model without intercept, is the regression coefficient of
inverse of that from the regression of on ? Answer the question for each of the the population and sample versions.
SOLUTION GOES HERE
4. (3 pts) Consider the following hypothetical. Let be the height of a child and be the height of their parent, and consider a regression of on , performed in in a large population. Suppose that we estimate the regression coefficients separately for male and female parents (two separate regressions)
and we find that the slope coefficient from the former regression is smaller than that from the latter
. Suppose however that we find (in this same population) the sample correlation between a father’s height and their child’s height is larger than that between a mother’s height and their child’s height. What is a plausible explanation for what is happening here?
SOLUTION GOES HERE
Multiple regression
5. (2 pts) In class, we claimed that the multiple regression coefficients, with respect to responses and feature vectors , , can be written in two ways: the first is
n,…,1=i pR∈ix
.2)ix1β − 0β − iy(∑ nim
, ] 2 ) x 1 β − 0 β − y ( [ E 0 n β i , 1 βm
CS Help, Email: tutorcs@163.com
The second is
where is a feature matrix, with row , and is a response vector, with component . Prove that these two expressions are equivalent.
SOLUTION GOES HERE
6. (Bonus) Derive the population and sample multiple regression coefficients by solving the corresponding least squares problem (differentiating the criterion with respect to each , setting equal to zero, and solving). For the sample least squares coefficient, deriving either representation in Q5 will be fine.
SOLUTION GOES HERE
Marginal-multiple connection
7. (1 pts) Consider the simple linear regression of a generic response on a constant predictor , , without intercept. Give the exact form of the sample regression coefficient.
SOLUTION GOES HERE
8. (3 pts) Recall the connection between multiple and marginal regression coefficients, as covered in lecture: the multiple regression coefficient can be written in general as
which we interpret as the simple linear regression coefficient of on . These are and , respectively, after we regress out the contributions of all other features. (See the lecture notes for the precise details.)
Now note that we can treat a simple linear regression with an intercept term as a multiple regression with two features, with the first feature just equal to the constant 1. Using the above formula, and the answer from Q7, re-derive the expression for the slope in the simple linear model with intercept:
SOLUTION GOES HERE
Covariance calculations
9. (3 pts) Let and be random vectors, and let and be fixed matrices. Prove that
Prove as a consequence that . Hint: you may use the rule for covariances of linear combinations (as reviewed in the lecture from week 2, “Measures of dependence and stationarity”).
, y T X 1 − ) X T X ( = ^β
.iyix∑ )iTxix∑(=^β
. 2) ̄x−ix(1=i∑ =β
) ̄y − iy() ̄x − ix(1=i∑
TA)x(voCA = )xA(voC
. TB)y ,x(voCA = )yB ,xA(voC
j^xT) j^x(
n,…,1 = i
程序代写 CS代考加微信: cstutorcs
SOLUTION GOES HERE
10. (2 pts) Suppose that , with and fixed, and where is a vector with white noise entries, with variance . Use the rule in Q9 to prove that for the sample least squares coefficients, namely,
, it holds that
SOLUTION GOES HERE
11. (4 pts) An equivalent way to state the Gauss-Markov theorem is as follows. Under the model from Q10, if
is any other unbiased linear estimator of (where linearity means that for a fixed matrix ) then
where means less than or equal to in the PSD (positive semidefinite) ordering. Precisely, and only if is a PSD matrix, which recall, means for all vectors this is indeed equivalent to the statement of the Gauss-Markov theorem given in lecture.
SOLUTION GOES HERE
Cross-validation
. Prove that
12. (3 pts) Recall the R code from lecture that performs time series cross-validation to evaluate the mean absolute error (MAE) of predictions from the linear regression of cardiovascular mortality on 4-week lagged particulate levels. Adapt this to evaluate the MAE of predictions from the regression of cardiovascular mortality on 4-week lagged particulate levels and 4-week lagged temperature (2 features). Fit each regression model using a trailing window of 200 time points (not all past). Plot the predictions, and print the MAE on the plot, following the code from lecture.
Additionally (all drawn on the same figure), plot the fitted values on the training set. By the training set here, we mean what is also called the “burn-in set” in the lecture notes, and indexed by times 1 through
t0 in the code. The fitted values should come from the initial regression model that is fit to the burn-in set. These should be plotted in a different color from the predictions made in time series cross-validation pass. Print the MAE from the fitted values the training set somewhere on the plot (and label this as “Training MAE” to clearly differentiate it).
13. (2 pts) Repeat the same exercise as in Q12 but now with multiple lags per variable: use lags 4, 8, 12 for each of particulate level and temperature (thus 6 features in total). Did the training MAE go down? Did the cross-validated MAE go down? Discuss. Hint: you may find it useful to know that lm() can take a predictor matrix, as in lm(y ~ x) where x is a matrix; in this problem, you can form the predictor matrix by calling cbind() on the lagged feature vectors.
14. (2 pts) Repeat once more the same exercise as in the last question but but now with many lags per variable: use lags 4, 5, …, through 50 for each of particulate level and temperature (thus 47 x 2 = 94 features in total). Did the training MAE go down? Did the cross-validated MAE go down? Are you surprised? Discuss.
z 0 ≥ z)A − B(Tz A − B
)~β(voC ≲ )^β(voC
M yM=~β β ~β
.1−)XTX(2σ = )^β(voC
y T X 1 − ) X T X ( = ^β
ε β X ε + βX = y
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE

More features, the merrier?
15. (2 pts) Let be an arbitrary response, and be an arbitrary feature vector, for . Let
be the result of appending one more feature. Let denote the fitted values from the regression of , and let denote the fitted values from the regression of on . Prove that
In other words, the training MSE will never get worse as we add features to a given sample regression problem.
SOLUTION GOES HERE
16. (2 pts) How many linearly independent features do we need (how large should be) in order to achieve a perfect training accuracy, i.e., training MSE of zero? Why?
SOLUTION GOES HERE
17. (Bonus) Implement an example in R in order to verify your answer to Q16 empirically. Extra bonus points if you do it on the cardiovascular mortality data, using enough lagged features. You should be able to plot the fitted values from the training set and see that they match the observations perfectly (and the CV predictions should look super wild).
n,…,1 = i
,)1+p,i~x,pix,…,1ix( = i~x
n , … ,1 = i
.2)i^y − iy(∑ ≤ 2)i~y − iy(∑
# CODE GOES HERE
程序代写 CS代考加QQ: 749389476