BAX 442 Final Exam
Course Instructor : Dr. Rahul Makhijani
Note: You have 24 hours for the exam and must submit your solution on canvas. Late submissions will not be accepted.
Comparison of model selection methods 15 points
Permeability is the measure of a molecule’s ability to cross a cell membrane. Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identify- ing molecules that have a sufficient permeability to become a drug. We will compare various model selection methods to calculate permeability The data can be obtained in R using the following commands: – library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints (a binary sequence of numbers that represents the presence or absence of a specific molecular substructure) contains the 1,107 binary molecular predictors for the 165 compounds, while permeability con- tains permeability response.
(a) Split the data set into a training set and a test set in an 70:30 split 1 pts
(b) The fingerprint predictors indicate the presence or absence of substruc- tures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling? 2 pts
(c) Fit a linear model using least squares on the training set, and report the test error obtained. 2 pts
(d) Fit a ridge regression model on the training set, with λ chosen by cross- validation. Report the test error obtained. 2 pts
(e) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. 2 pts
BAX 442 final exam
Programming Help, Add QQ: 749389476
BAX 442 final exam
(f) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 2 pts
(g) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.2 pts
(h) Comment on the results obtained. How many latent variables are op- timal in PLS vs PCR ? Is there much difference among the test errors resulting from these approaches? 2 pts
Predicting Probability of Default 35 pts
A loan offering company would like to build a default risk model so that they can target high-risk customers early and perhaps preempt the default event, which ends up costly for all involved. The data in this problem consists of historical loan records for a case-control sample of 2400 past customers. The variables characterize some aspects of the loan, such as duration, amount, interest rate and many other more technical features of the loans. There are also some qualitative variables such as reason for the loan, a quality score and so on. One of the variables is default, a 0/1 variable indicating whether or not the borrower has defaulted on their loan payments. You are provided with a training set loan_train_final.csv which represents a sample of 2400 customers, and contains 30 features and the binary outcome “default” (in the first column). There is also a file loan_test_final.csv which consists of a random sample of 600 other customers from the general pool. For these you are provided only the 30 features. Your job is to build a risk score(probability of default) — for each customer in the test set. You may use any of the tools discussed in the lectures in this class. You may not use tools not discussed in this class, such as deep learning, random forests or boosting. You should produce a writeup describing what you did, and how you selected your final model. Give some indication which variables were important in the calculation of your risk score. You should also provide the Mean Absolute error (MAE) of the loss on the test dataset where the MAE is defined as
|yi − yˆi|
• n is the number of rows • yi is the actual loss
• yˆi is the predicted loss.
In case of no default (default) in the test set for a customer, yi would be 0 (amount). The predicted loss is the probability of default multiplied with the amount.
Note: If you want to apply probability calibration in R, you can use the cali- bration function probability.calibration from the package rfUtilities.
PCA Vs LeastSquares 20 pts
PCA and least squares regression can be viewed as algorithms for inferring (linear) relationships among data variables. In this question, we will look at how effective they are under different circumstances of noise in the variables. We will consider a simple example with two variables, x and y, where the true relationship between the variables is y = 2x. Our goal is to recover this relationship—namely, recover the coefficient “2” under the following two cases: –
1. In first case we consider the setting where our data consists of the actual values of x, and noisy estimates of y.
2. in the second case, we consider the case where our data consists of noisy measurements of both x and y.
Exercises:
1. Write functions to return the PCA and least squares parameters.
(a) Write a function pca-recover that takes a vector X of xi’s and a vector Y of yi’s and returns the slope of the first component of the PCA (namely, the second coordinate divided by the first).
(b) Write a function ls-recover that takes X and Y and returns the slope of the least squares fit.
BAX 442 final exam
BAX 442 final exam (c) Set X = [.001, .002, .003, …, 1] and Y = 2X. Make sure both
functions return 2 as a sanity check. 2 pt
2. IID Random Variables Consider the case when the elements of X and Y were chosen identically and independently at random in the square [0, 1] × [0, 1]. What would PCA recover, and what would LS recover? 4 pts
3. Consider the case where x is an independent (a.k.a. explanatory) vari- able, and we get noisy measurements of y. Fix X = [x1, x2, . . . , x1000] = [.001,.002,.003,…,1]. Foragivennoiselevelc,letyˆi =2xi+N(0,c)= 2i/1000+N(0,c), and Yˆ = [yˆ1,yˆ2,…,yˆ1000]. Make a scatter plot with c on the horizontal axis, and the output of pca-recover and ls-recover on the vertical axis. For each c in [0, 0.05, 0.1, . . . , .45, .5], take a sam- ple Yˆ , plot the output of pca-recover as a red dot, and the output of ls-recover as a blue dot. Repeat 30 times. You should end up with a plot of 660 dots, in 11 columns of 60, half red and half blue. 5 pts
4. We now examine the case where our data consists of noisy estimates of both x and y. For a given noise level c, let xˆi = xi +N(0,c) = i/1000 + N (0, c) and yˆi = yi + N (0, c) = 2i/1000 + N (0, c). For each c in [0,0.05,0.1,…,.45,.5], take a sample Xˆ and Yˆ, plot the output of pca-recover as a red dot, and the output of ls-recover as a blue dot. Repeat 30 times. You should have a plot with 330 red dots and 330 blue dots.5 pts
5. In which case does PCA do better ? Why ? 4 pts
Causal Inference 17 pts
1. The table below lists two potential outcomes (Y(0),Y(1)), where 0 denotes the placebo and 1 denotes the active treatment, for a set of 6 individuals along with an indication of the treatment assignment T for each. 5 pts
Assume that we are interested in an additive effect for the treatment.
Programming Help
BAX 442 final exam
Unit T Y(0) Y(1) 1 1 80 95 2 0 70 85 3 1 70 70 4 0 82 62 5 0 60 72
(a) Explain the meaning of Y (0) and Y (1). 1 pt
(b) What is the causal effect of receiving treatment for the first unit?
(c) What is the average treatment effect in this population? 1 pt
(d) Explain the meaning of the following statement – In a randomized study T is independent of Y (0) and T is independent of Y (1) but T is not independent of Y obs. 2 pts
2. The table below lists the results of a hypothetical experiment on 200 people. Each row identifies a category of people with the same values of X, T, Y(0), Y(1), and Y obs. Thus for example there are 30 people in category 1 and each of these people has X = 0,W = 0,Y(0) = 4, Y (1) = 6 and because they are in the control group we have Yobs = 4.
Category #people X Y(0) Y(1) T Yobs
5 20 01012010
6 20 01012112
7 15 11012010
8 45 11012112
(a) Do you believe these data came from a randomized experiment? Justify your answer. 2 pts
(b) Describe what it means for the treatment assignment mechanism to be unconfounded given X? 2 pts
BAX 442 final exam (c) Do you believe that treatment assignment is unconfounded given
X for these data? Justify your answer. 2 pts
(d) Give the definition of a propensity score p(X). 1 pt
(e) The naive estimate of the average treatment effect compares the observed mean value of Y for those assigned to treatment (T = 1) and those assigned to control (T = 0). In this case the naive estimate is 9.12 – 6.80 = 2.32. Carry out a propensity score anal- ysis (any suitable propensity score analysis is fine) to estimate the average treatment effect and show that it yields the correct estimate. Be specific about your approach. 5 pts
Conceptual questions
1. A political scientist wants to understand the predictors of household in- come. She has collected a set of 2000 features on 5000 observations. To fit her model, the researcher has come up with the following plan. She wants to use random forests (a method that models interactions) but her software implementation is too slow taking > 20 minutes to run. So she will first run the lasso using using glmnet(with cross-validation) and extract the predictors P with nonzero coefficients in the selected model. Then she will use just the set P in the random forest. Do you think this is a reasonable idea? Give details. 4 pts
2. You have training data on 1200 software engineers from 5 different companies, and a separate test set of 100 engineers from 3 different companies. You wish to predict their salary Y from 37 different pre- dictors, using the lasso. You use standard 10 fold cross-validation and obtain a CV error rate of 2%. But when you apply the model to the test set, the error rate is 15%. Explain what has happened and offer a fix, if possible. 5 pts
3. You have a two-class classification problem with n = 2000 observations and p = 200 features, and you are told by your collaborators that most of the features are likely to be uninformative (but he doesn’t know which features are informative). Which method(s) are likely to work
Code Help, Add WeChat: cstutorcs
BAX 442 final exam
best on this data: (a) ridge regression (b) all subset selection, (c) lasso ? Give reasons. 4 pts