(6 credits)
lOMoARcPSD|10773645
Question Nr. 4HR86EZ11LNA05GE44NY1
The olive dataset from the dslabs package contains the % of 8 fatty acids found in Italian olive oils.
## region area palmitic palmitoleic stearic oleic linoleic
library(dslabs) data(olive) head(olive)
## 1 Southern Italy North-Apulia
## 2 Southern Italy North-Apulia
## 3 Southern Italy North-Apulia
## 4 Southern Italy North-Apulia
## 5 Southern Italy North-Apulia
## 6 Southern Italy North-Apulia
## linolenic arachidic eicosenoic
10.75 0.75
10.88 0.73
9.11 0.54
9.66 0.57
10.51 0.67
9.11 0.49
Write R code using ggplot2
plots with different line colors to distinguish each fatty acid. Give meaningful labels to both axes.
dt <- as.data.table(olive)
dt <- dt[, oleic := NULL]
oliver_dt <- melt(dt, id.vars= c("region", "area"), variable.names = "fatty_acid", value.name = "value")
ggplot(olive_melted, aes(x=variable, color = fatty_acid)) +
geom_density()
## 1 0.36
## 2 0.31
## 3 0.31
## 4 0.50
## 5 0.50
## 6 0.51
0.60 0.29
0.61 0.29
0.63 0.29
0.78 0.35
0.80 0.46
0.70 0.44
IN-DataViz-1-20200629-E5115-04 – Page 4 / 24 – Page empty Downloaded by Julie Huang
2.26 78.23 6.72
2.24 77.09 7.81
2.46 81.13 5.49
2.40 79.52 6.19
2.59 77.71 6.72
2.68 79.24 6.78
to plot the distribution of the % of the fatty acids except oleic using density
lOMoARcPSD|10773645
Question Nr. 3QM41LNA09GZ73DB55AQ5
The mpg dataset from the ggplot2 package contains different cars features, mostly involving fuel.
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
##
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
library(ggplot2) data(mpg) head(mpg)
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
a4 1.8 1999
a4 2 2008
a4 2 2008
4 manual(m5) f
4 manual(m6) f
4 auto(av) f
21 29p 20 31p 21 30p
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26p
Write R code that produces the following plot.
0.2 0.1 0.0
0.06 0.04 0.02 0.00
0.3 0.2 0.1 0.0
45678 234567
0.075 0.050 0.025 0.000
10 15 20 25 30 35
data <- as.data.table(mpg)
data <- melt(data, id.vas = c("manufacturer", "model", "year", "trans", "fl", "class"))
IN-DataViz-1-20200629-E5115-06 – Page 6 / 24 – Downloaded by Julie Huang
Page empty
lOMoARcPSD|10773645
Question Nr. 0GW52IK81LNA06UY23CW8
The admissions dataset from the dslabs package provides the number of applicants and admitted students to 6 different majors stratified by gender. Write R code using ggplot2 to plot the difference between applicants and admitted students on each major, using bars and stratified by gender using facets. Give meaningful labels to both axes. Do not mind if you obtain negative values.
IN-DataViz-1-20200629-E5115-08
– Page 8 / 24 –
Page empty
library(dslabs) data(admissions) admissions
## ##1 ##2 ##3 ##4 ##5 ##6 ##7 ##8 ##9 ##10 ##11 ## 12
major gender admitted applicants A men 62 825 B men 63 560 C men 37 325 D men 33 417 E men 28 191 F men 6 373 A women 82 108 B women 68 25 C women 34 593 Dwomen 35 375 Ewomen 24 393 F women 7 341
Downloaded by Julie Huang
Code Help, Add WeChat: cstutorcs
Problem 2 (2 credits)
lOMoARcPSD|10773645
Which operation has been applied to table A and table B to return the result table? Justify your answer. Write
1 one line of R code that would produce the result table assuming a data table A and a data table B in the working
environment.
id CreditCard
15 1837655746651971 21 5927428911423246 14 7393954899774435 23 7844437946592947 16 7364376521545978 13 3923818281216234
8 1764682661721638 24 2622321425978251 19 7271112241595296 18 4225693846619738
582 l 221 i 142 r 479 l 881 l 698 o 566 o 528 o 393 o 421 r
Result table:
Aamina Marcus Derek Muntasir Alexis Alexis Khanea Julia Keith Tiana Adam
id CreditCard
8 1764682661721638
13 3923818281216234
14 7393954899774435
15 1837655746651971
16 7364376521545978
18 4225693846619738
19 7271112241595296
21 5927428911423246
23 7844437946592947
24 2622321425978251
lastName customer_id
el-Sinai 16 Hendrix 13 Martinez 8 al-Sharifi 24 Smith 19 Arreola 18 Forrest 1 Deronde 6 Hart 25 Ramirez 9 Highman 2
CCV type firstName lastName
NA NA NA NA NA NA 566 o NA NA 698 o 142 r 582 l 881 l 421 r 393 o 221 i 479 l 528 l NA NA
Khanea Adam Julia Derek Tiana Marcus NA
NA Aamina Alexis Alexis NA
NA Muntasir Keith
Forrest Highman Deronde Martinez Ramirez Hendrix NA
NA el-Sinai Arreola Smith NA
NA al-Sharifi Hart
IN-DataViz-1-20200629-E5115-10
– Page 10 / 24 –
Page empty
Downloaded by Julie Huang
Problem 3 (2 credits)
Question Nr. 6CK12JZ6CD9OQ44JL0
Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line.
lOMoARcPSD|10773645
−15−10−5 0 5 10 15
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
Page empty – Page 11 / 24 – IN-DataViz-1-20200629-E5115-11 Downloaded by Julie Huang
sample sample
sample sample
lOMoARcPSD|10773645
Question Nr. 3JN34PA3DD3YU35RP0
Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line.
−15−10−5 0 5 10 15
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
IN-DataViz-1-20200629-E5115-12 – Page 12 / 24 – Page empty Downloaded by Julie Huang
sample sample
sample sample
Problem 7 (2 credits)
Question Nr. 8ZF48OQ3WP45KH55GQ0
We consider a linear regression model parameterized as
yi =α+β·xi +εi
where i = 1...N denotes the data point indices, yi is the response variable, α and β the coefficients, xi the explanatory variable
and εi the error term. Let yˆi be the fitted value.
Does the following plot provide evidence against the assumptions of the linear regression? Justify.
lOMoARcPSD|10773645
theoretical quantiles
Page empty – Page 17 / 24 – IN-DataViz-1-20200629-E5115-17 Downloaded by Julie Huang
response quantiles
Question Nr. 0JU8SP106HN41KM58HL9
We consider a linear regression model parameterized as
yi =α+β·xi +εi
where i = 1...N denotes the data point indices, yi is the response variable, α and β the coefficients, xi the explanatory variable
and εi the error term. Let yˆi be the fitted value.
Does the following plot provide evidence against the assumptions of the linear regression? Justify.
lOMoARcPSD|10773645
0 20 40 60
IN-DataViz-1-20200629-E5115-18 – Page 18 / 24 – Page empty Downloaded by Julie Huang
Problem 8 (2 credits)
Question Nr. 4LY90YX24AQ71F66EX2
library(dslabs)
Consider the “brca” dataset from dslabs package. Fit a logistic regression model which predicts the response variable brca$y given the feature perimeter_se. Assume that all assumptions of the logistic regression model are met. Starting from an original probability of 10 % of malignant (cancer) how much does the probability of developping a malignant (cancer) increase, when the feature perimeter_se increases by 0.8.
lOMoARcPSD|10773645
Page empty – Page 19 / 24 – IN-DataViz-1-20200629-E5115-19 Downloaded by Julie Huang
Code Help
QuestionId: 4YF2RH7TD00JE84ZN0
Consider the features smoothness_mean, radius_mean from the brca dataset. Provide R code that plots a ROC curve of both features as predictors of malignancy (variable brca$y == “M”), and indicate the feature that has the highest true positive rate when the false positive rate is 0.4.
Problem 9 (2 credits)
IN-DataViz-1-20200629-E5115-20 – Page 20 / 24 – Page empty Downloaded by Julie Huang
lOMoARcPSD|10773645
Problem 10 (2 credits)
library(dslabs) library(data.table)
Question Nr. 4IL88XL37CT72GN68OW7
Consider the variable ‘concavity_worst’ of the ‘brca’ dataset. A researcher states that no other variable from the matrix ‘brca$x’ associates with the variable ‘concavity_worst’ according to Spearman’s correlation. Do you reject this hypothesis using the significance level of 1%? Provide code and justify your answer. Do not mind warnings, if any, about exact p-values with ties.
lOMoARcPSD|10773645
Page empty – Page 21 / 24 – IN-DataViz-1-20200629-E5115-21 Downloaded by Julie Huang
Programming Help, Add QQ: 749389476
Problem 11 (1 credit)
lOMoARcPSD|10773645
Question Nr. 0PJ29LOW8LMHZ4060
Assume 5 elements v, w, x, y and z. A first clustering gave the clusters {v} and {w,x,y,z}. One then run k-means which yields two clusters: {v,w,x} and {y,z}. Applying hierarchical clustering yields also two clusters: {v,w} and {x,y,z}. Which of the k-mean clustering and the hierarchical clustering yielded the clustering most similar to the first clustering? Support your answer with a metric learned in the lecture.
IN-DataViz-1-20200629-E5115-22 – Page 22 / 24 – Page empty Downloaded by Julie Huang