MAST90083 2022 S2 exam paper

Semester 2 Assessment, 2022
School of Mathematics and Statistics
MAST90083 Computational Statistics & Data Science
Reading time: 30 minutes — Writing time: 3 hours — Upload time: 30 minutes This exam consists of 9 pages (including this page) with 8 questions and 55 total marks
Permitted Materials
ˆ This exam and/or an offline electronic PDF reader, blank loose-leaf paper and a Casio FX-82 calculator.
ˆ No books or other material are allowed. Only one double side A4 page note (handwritten or printed) is permitted.
ˆ No headphones or earphones are permitted. Instructions to Students
ˆ Wave your hand right in front of your webcam if you wish to communicate with the supervisor at any time (before, during or after the exam).
ˆ You must not be out of webcam view at any time without supervisor permission.
ˆ You must not write your answers on an iPad or other electronic device.
ˆ Off-line PDF readers (i) must have the screen visible in Zoom; (ii) must only be used to read exam questions (do not access other software or files); (iii) must be set in flight mode or have both internet and Bluetooth disabled as soon as the exam paper is downloaded.
ˆ There are 8 questions with marks as shown. The total number of marks available is 55.
ˆ Write your answers on A4 paper. Page 1 should only have your student number, the subject code and the subject name. Write on one side of each sheet only. Each question should be on a new page. The question number must be written at the top of each page.
Scanning and Submitting
ˆ You must not leave Zoom supervision to scan your exam. Put the pages in question order and all the same way up. Use a scanning app to scan all pages to PDF. Scan directly from above. Crop pages to A4.
ˆ Submit your scanned exam as a single PDF file and carefully review the submission in Gradescope. Scan again and resubmit if necessary. Do not leave Zoom supervision until you have confirmed orally with the supervisor that you have received the Gradescope confirmation email.
ˆ You must not submit or resubmit after having left Zoom supervision.
©University of Melbourne 2022 Page 1 of 9 pages Do not place in Baillieu Library
Student number

MAST90083 Computational Statistics & Data Science Semester 2, 2022
Question 1 (7 marks)
Consider a set of i.i.d. samples x1, …, xn from a Pareto distribution which has the probability density function
f(x|θ)=θx−(θ+1),x≥1; θ>0
Suppose we have observed x1 = y1,…,xm = ym and xm+1 > c,…,xn > c where m is given, m < n, and y1, ..., ym and c are given numerical values. This implies that x1, ..., xm are completely observed whereas xm+1, ..., xn are partially observed in that they are right-censored. We want to use the EM algorithm to find the maximum likelihood of θ. (a) Find the complete-data log-likelihood function l (θ) = ln L (θ). (b) In the E-step, we calculate Q(θ,θ(k)) = E[lnL(θ) | x1 = y1,··· ,xm = ym,xm+1 > c,··· ,xn > c;θ(k)] where θ(k) is the current estimate of θ. Show that
here can be used without proof.)
(c) In the M-step, we maximize Q(θ,θ(k)) with respect to θ to find an update θ(k+1) from
θ(k). Show that
)=nlnθ−(θ+1)
Xm n − m !
(Note: The result E(ln X|X > ξ) = ln ξ + 1θ for the Pareto random variable X considered
“m n−m#−1 θ(k+1) =n Xlnyi +(n−m)lnc+ θ(k) .
lnyi +(n−m)lnc+ θ(k) .
(d) Assuming the sequence θ(k); k = 1, 2, … converges to the MLE θˆ when k → ∞. Derive an expression for θˆ.
Page 2 of 9 pages

MAST90083 Computational Statistics & Data Science Semester 2, 2022
Question 2 (2 marks)
Consider a Gaussian mixture model in which the marginal probability density for the latent variable z, p (z) is given by
p (z) = Y πzk k
where z is a K −dimensional binary random variable for which the elements zk ∈ {0, 1} and
PKk=1 zk = 1, and the conditional distribution p (x|z) for the observed variable is given by
p (x|z) = Y N (x|μk, Σk)zk
1 1 ⊤−1  d/2 1/2 exp −2(x−μk) Σk (x−μk)
Show that the marginal density function of x, p (x) is a Gaussian mixture and derive its expres- sion (as a function of πk and N (x|μk, Σk), k = 1, …, K only.)
N(x|μk,Σk)=
assuming x ∈ Rd, μk ∈ Rd is a mean vector and Σk is a d × d covariance matrix.
Page 3 of 9 pages

MAST90083 Computational Statistics & Data Science
Semester 2, 2022
Question 3 (7 marks)
Consider scatterplot data (xi, yi) , 1 ≤ i ≤ n such that yi =f(xi)+εi
where yi ∈ R, xi ∈ R, εi ∈ R ∼ N(0,σ2) and are i.i.d. The function f(x) = E(y|x)
characterizing the underlying trend in the data is some unspecified smooth function that needs to be estimated from (xi, yi) , 1 ≤ i ≤ n. For approximating f we propose to use quadratic spline basis with truncated quartic functions 1, x, x2, x3, x4, (x−k1)4+,…,(x−kK)4+.
(a) Provide the quartic spline model for f and define the set of unknown parameters that need to be estimated
(b) Derive the matrix form of the model and the associated penalized spline fitting criterion
(c) Derive the expression for the penalized least squares estimator for the unknown parameters of the model and the associated expression for the best fitted values.
(d) Find the degrees of freedom of the fit (effective number of parameters) obtained with the proposed model and its extremes or limit values when the regularization parameter λ varies from 0 to +∞.
(e) Find the optimism of the fit and its relation with the degrees of freedom.
Page 4 of 9 pages

MAST90083 Computational Statistics & Data Science Semester 2, 2022
Question 4 (9 marks)
Let X = {x1,…,xn} be a set of independent and identically distributed variables from a population distribution Fx0 and let
denotes the mean of this population assumed to be a scalar. Let Y = {y1,…,yn} be a set of independent and identically distributed variables from a population distribution Fy0 and let
denote the mean of this population assumed to be a scalar. We are interested in estimating θ0 = θ(Fx0, Fy0) = (μx − μy)2 when the samples X and Y are independent and both populations are taken to be exponential with mean μx and μy, having density
1x 1y fμx(x)= μ exp −μ , and fμ(y)= μ exp −μ
(Note: For the exponential probability density σ2 = E (X − μ)2 = μ2 and γ = E (X − μ)3 = 2μ3 .)
(a) Using the assumed population parametric probability densities for X and Y , derive the maximum likelihood estimator of μx and μy. What is the relation between the obtained maximum likelihood estimators μˆxML, μˆyML and the nonparametric estimators μˆx and μˆy obtained from the empirical distributions?
(b) Provide the form of the nonparametric estimator θˆ of θ0 obtained from the empirical distributions Fx1 and Fy1
(c) Derive the expression of the bias b1 = E θ − θ0 .
(d) Derive the expression of the bootstrap estimate of b1.
(e) Use the expression of b1 to derive the bootstrap bias-reduced estimate θˆ1 of θ0 ˆ 
(f) Derive the expression of the bias b2 = E θ1 − θ0
(g) Compare b1 and b2
Page 5 of 9 pages

MAST90083 Computational Statistics & Data Science Semester 2, 2022
Question 5 (8 marks)
Let Y = (y1,…,yn) be a set of n vector observations of dimension m such that yi = (y1i, …, ymi)⊤ ∈ Rm. For modeling these observations we propose to use the parametric model given by
yi =Φ1yi−1 +Φ2yi−2 +…+Φpyi−p +Θ1xi−1 +…+Θqxi−q +εi
where εi are independent identically distributed normal random variables with mean vec- tor zero and m × m variance-covariance matrix Σ modeling the approximation errors, X = (x1, …, xn) be another set of n vector observations of dimension m such that xi = (x1i, …, xmi)⊤ ∈ Rm and the Φj, j = 1,…,p, Θi, i = 1,…,q are m × m matrices of unknown coefficient or pa- rameters.
(a) How many vector observations need to be lost to work with this model ? And what is the effective number of observations ?
(b) Provide a matrix form for the model where the parameters are represented in a ((p + q)m) × m matrix form Φ = [Φ1, …, Φp, Θ1, …, Θq]⊤, derive the least square estimator of Φ and the maximum likelihood estimate of Σ
(c) What could you describe as an inconvenience of this model? Find the number of param- eters involved in the model.
(d) Derive the expression of the log-likelihood for this model
(e) Use the obtained log-likelihood expression to obtain the expressions of AIC and BIC
(f) What consequences this model has on selection criteria ?
Page 6 of 9 pages
程序代写 CS代考 加微信: cstutorcs
MAST90083 Computational Statistics & Data Science Semester 2, 2022
Question 6 (9 marks) Given the model
y = Xβ + ε
wherey∈Rn,X∈Rn×p isofrankr≤pComputer Science Tutoring
MAST90083 Computational Statistics & Data Science Semester 2, 2022
Question 7 (8 marks)
Suppose we have a three-layer neural network with r input nodes xm, m = 1,…,r, two layers (L = 2) of t1, zi, i = 1,…,t1 and t2, wj, j = 1,…,t2 hidden nodes and s output nodes yk, k = 1, …, s. Let βmi be the weight of the connection xm → zi with bias β0i, let αij be the weight of the connection zi → wj with bias α0j and let γjk be the weight of the connection wj → yk with bias γ0k. The functions fi(.), i = 1,…,t1, hj(.), j = 1,…,t2 and gk(.), k = 1,…,s are the activation functions for the hidden and output layers nodes respectively (so that zi = fi(.), wj = hj(.) and yk = gk(.)).
(a) Derive the expression for the value of the kth output node of the network as function of gk(.), γ0k, γjk, hj(.), α0j, αij, fi(.), β0i, βmi and xm
(b) Derive the matrix form for the vector of outputs of the network
(c) Under which conditions this network becomes equivalent to a single-layer perceptron
(d) What is the special case model obtained when the activation functions for the hidden and output nodes are taken to be identity functions.
(e) Given a set of inputs-outputs D = {(x1, y1) , …, (xn, yn)} where xi ∈ Rm, i = 1, …, n are the inputs and yi ∈ Rs, i = 1,…,n are the noisy outputs obtained from an unknown function corrupted by i.i.d. additive multivariate Gaussian noise εi ∼ N (0, Σ), where Σ is an s × s covariance matrix. Assuming we want to approximate the unknown function that generated this data set using the three-layer neural network obtained in (a), derive the likelihood loss function to be used for learning the neural network parameters θ (the vector of parameters which contains all the model unknown parameters).
(f) Derive the modified likelihood loss function to be used for learning the neural network parameters θ so that the connections between the second hidden layer and the output layer are sparse.
Page 8 of 9 pages

MAST90083 Computational Statistics & Data Science Semester 2, 2022
Question 8 (5 marks)
Let D = (xi,yi), i = 1,…,n where xi ∈ Rp and yi ∈ {−1,+1} be a training data set. We are interested in finding a hyperplane (linear decision boundary) f (x) = β0 + x⊤β that approximately separates the training observations according to their class labels or
y β +x⊤β≥1−ε i=1,…,n i0ii
where {εi} are slack variables that allow individual observation to be miss classified and Pni=1 εi ≤ t. The quantity t is a parameter that helps control the level of miss-classification. This problem can be seen as the one of finding the optimal hyperplane; namely the hyperplane that maximizes the margin 2/∥β∥ subject to these constraints or
Minimize 2∥β∥ + τ
Subjecttoε ≥0, y β +x⊤β≥1−ε i=1,…,n.
(a) Define the Lagrange function (primal functional) that can be used to solve this problem and the associated primal variables.
(b) Derive the dual functional of the optimization problem in matrix form and define the associated parameters
End of Exam — Total Available Marks = 55
Page 9 of 9 pages
CS Help, Email: tutorcs@163.com