MTHM017 Advanced Topics in Statistics Assignment
Please make sure that the submitted work is your own. This is NOT a group assignment, therefore approaches and solutions shouldn’t be discussed with other students. Plagiarism and collusion with other students are examples of academic misconduct and will be reported. More information on academic honesty can be found here.
The assignment has two main parts. Part A involves (i) fitting a Poisson regression model to assess the effect of using different priors, and (ii) fitting a fixed effects and a random effects model in order to compare posterior distributions under different models. Part B involves using different methods for classification of data into two groups.
A. Bayesian Inference [66 marks]
1. The first question of part A involves fitting a Poisson regression model using the Scotland dataset, which contains the observed and expected counts of lip cancer cases for the local authority administrative areas in Scotland between 1975 and 1986.
a. [3 marks] Calculate the Standard Mortality Ratios (SMR=observed/expected) for each administrative area and plot the distribution of the SMRs. Then plot a map of the SMRs by administrative area. To map the SMRs you may use the ScotlandMap function readily available on the ELE page.
The code below uses random numbers to demonstrate how the ScotlandMap function works. Note that in order for this to run you need to add the shapefiles for Scotland (.shx, .shp, .prj, .dbf) to your working directory. The ScotlandMap.R file and the shapefiles are available on the ELE page.
We are interested in estimating the relative risk (RR) for each area. The relative risk is the parameter that quantifies whether an area i has a higher (RRi > 1) or lower (RRi < 1) occurrence of cases than what we would expect from the reference rates. In order to estimate the relative risk we are going to fit a Poisson model of the following form:
Obsi ∼ Pois(μi)
log(μi) = log(Expi) + β0 + θi
RRi = exp(β0)exp(θi)
Where θi ∼ Normal(0,σ2). Here, the Exp(ected) numbers are an ‘offset’, which are treated as fixed
data (i.e. no coefficient is estimated for this term).
b. [3 marks] Describe the role (interpretation) of β0 and the set of θ’s in this model and how they contribute to the estimation of RR. Focus on the meaning of these parameters in the estimation of the relative risk.
library(tidyverse)
library(RColorBrewer)
library(sf)
library(rgdal)
source("ScotlandMap.R") # need to read in the ScotlandMap function testdat <- runif(56) # generate random numbers to use as observations ScotlandMap(testdat,figtitle="Scotland random numbers")
c. [14 marks] Code up this Poisson(-Lognormal) model in JAGS to analyse the Scotland data. For the parameter β0 use a Normal prior with mean 0 and variance 1000. And for the precision parameter τ(= 1/σ2) of θi use a prior τ ∼ Gamma(1,0.05) (which is parametrised so as to have a mean of 20). Initialise 2 chains and run the model with these two chains. You will have to decide on the appropriate values of n.iter and burnin. Produce trace plots for the chains and summaries of all the model parameters. Investigate whether the chains for all the parameters have converged.
d. [4 marks] Extract the posterior means for the RR and map them. Compare the posterior mean relative risk to the SMR for each area. Note that we can think about the SMR as the maximum likelihood estimate of the relative risk.
e. [7 marks] Now calculate the posterior probabilities that the relative risk in each area exceeds 1. Extract these probabilities and map them. For which areas are you at least 95% certain that there is an excess risk of lip cancer (i.e. relative risk > 1)?
f. [6 marks] Repeat the analysis with different priors for β0 and τ. The exact choice is yours, but explain why you have chosen them and what they mean. Map the two sets of RRs and explain any differences you see in the map and the summaries of the posteriors for the parameters of the model.
2. In this question we will analyse the effect different models have on the posterior distribution. We will use the surgical.csv file, which contains data on 12 hospitals, where the column n gives the total number of heart surgery operations carried out on children under 1 year old by each centre between April 1991 and March 1995, and the column r gives the number of these operations where the patient died within 30 days of the operation.
A binomial model seems reasonable for these data, so we will focus on comparing different models for the hospital specific mortality rates.
We will first fit a logistic random effects model of the following form:
ri ∼ Binom(ni, θi), logit(θi) ∼ N(μ,σ2).
This model assumes that failure rates (θi) across hospitals are similar in some way. This similarity is represented by the assumption that the mortality rates for the hospitals are drawn from the same (parent) distribution.
a. [9 marks] Code this model in JAGS to analyse the surgical data. Use vague priors for the parameters of the shared normal distribution. In particular you can use the following model definition:
jags.mod <- function(){ for(i in 1:12){
r[i] ~ dbin(theta[i],n[i])
logit(theta[i]) <- logit.theta[i]
logit.theta[i] ~ dnorm(mu, tau)
mu ~ dnorm(0,1.0E-3)
tau <- 1/sigmaˆ2
sigma ~ dunif(0,100)
Run the model for 10,000 iterations, with 2 chains, discarding the first 5,000 as ‘burn-in’. Produce traceplots for the chains and summaries for the fitted parameters. Comment on whether the chains for all the parameters have converged.
b. [8 marks] We want to know whether the assumptions of the above random effects model are reasonable, or whether there is any evidence that the mortality rates for one or more of the hospitals are not drawn from the same parent distribution. We will assess this by comparing the observed and predicted number of deaths in each hospital.
Programming Help, Add QQ: 749389476
Edit your JAGS model so that it (i) finds the posterior prediction of the number of deaths (rpred) in i
each hospital, and (ii) computes the following posterior probability for each hospital: pi =P(rpred >ri)+1P(rpred =ri).
The above probability is called the ‘mid p-value’. This allows us to summarise conflict between discrete quantities.
Examining the mid p-values, are there any hospitals that appear to have an unusual mortality rate? Confirm your conclusion by producing kernel density plots of the predicted number of deaths in each hospital, and visually comparing these to the observed number of deaths in each hospital.
c. [8 marks] As an alternative model we will consider the independent (fixed) effects model, which has the following hospital-specific mortality rates:
θi ∼ Unif(0,1),
for i = 1,2,…,12.
Fit this alternative model, and compare the posterior distributions of the hospital mortality rates under the random effects and the fixed effects models. Interpret the differences. (Hint: the comparison is easiest to do by producing box plots of the mortality rates θ under each model).
d. [4 mark] Compute the mid p-values for the fixed effects model and compare these to the mid p-values of the random effects model. Does the fixed effects model better explain the unusual mortality rates?
B. Classification [34 marks]
The following figure shows the information in the dataset Classification.csv – it shows two different groups, plotted against two explanatory variables. This is simulated data – the aim is to find a suitable method for classifying the 1000 datapoints into two groups from a selection of possible approaches.
1. [5 marks] Summarise the two groups (separately) in terms of the variables X1 and X2, and for each group plot the distribution of both variables. Describe your findings.
Programming Help
2. [4 marks] Considering the plot showing the observations, your density plots and the numerical summaries, which of the following classification methods do you think are suitable for classifying this data? For each method, you should explain the reasons behind your answer (i.e. why is the given method suitable/why it might not be suitable)?
a. Linear discriminant analysis.
b. Quadratic discriminant analysis.
c. K-nearest neighbour classification. d. Support vector machines.
e. Random forests.
3. [1 marks] Select 80% of the data to act as a training set, with the remaining 20% for testing/evaluation.
4. [22 marks] Choose three of the methods listed in Q2 that might be suitable to classify the data. Perform classification using these methods. (If you use more than three of the listed methods, only the first three will be considered for marking). In each case, briefly describe how the given classification method classifies the data, present the results of an evaluation of the model fit (highlighting different aspects of the model performance) and describe your findings. Where appropriate optimise the
(hyper)parameters of the method.
5. [2 marks] Compare the results from your chosen three approaches and select what you think is the best method for classification in this case, explaining your reasoning.
Total for paper = 100 marks
The deadline for submission is Noon (12pm), 31st March. Note that late submissions will be penalised.
You should submit a pdf that contains your answers (and relevant output/plots!) to the questions via eBart. In Part A you should use the R programming language, but in Part B you can choose to use R or Python (or both).
浙大学霸代写 加微信 cstutorcs