HR86EZ11LNA05GE44NY1

(6 credits)
lOMoARcPSD|10773645
Question Nr. 4HR86EZ11LNA05GE44NY1
The olive dataset from the dslabs package contains the % of 8 fatty acids found in Italian olive oils.
## region area palmitic palmitoleic stearic oleic linoleic
library(dslabs) data(olive) head(olive)
## 1 Southern Italy North-Apulia
## 2 Southern Italy North-Apulia
## 3 Southern Italy North-Apulia
## 4 Southern Italy North-Apulia
## 5 Southern Italy North-Apulia
## 6 Southern Italy North-Apulia
## linolenic arachidic eicosenoic
10.75 0.75
10.88 0.73
9.11 0.54
9.66 0.57
10.51 0.67
9.11 0.49
Write R code using ggplot2
plots with different line colors to distinguish each fatty acid. Give meaningful labels to both axes.
dt <- as.data.table(olive) dt <- dt[, oleic := NULL] oliver_dt <- melt(dt, id.vars= c("region", "area"), variable.names = "fatty_acid", value.name = "value") ggplot(olive_melted, aes(x=variable, color = fatty_acid)) + geom_density() ## 1 0.36 ## 2 0.31 ## 3 0.31 ## 4 0.50 ## 5 0.50 ## 6 0.51 0.60 0.29 0.61 0.29 0.63 0.29 0.78 0.35 0.80 0.46 0.70 0.44 IN-DataViz-1-20200629-E5115-04 – Page 4 / 24 – Page empty Downloaded by Julie Huang 2.26 78.23 6.72 2.24 77.09 7.81 2.46 81.13 5.49 2.40 79.52 6.19 2.59 77.71 6.72 2.68 79.24 6.78 to plot the distribution of the % of the fatty acids except oleic using density lOMoARcPSD|10773645 Question Nr. 3QM41LNA09GZ73DB55AQ5 The mpg dataset from the ggplot2 package contains different cars features, mostly involving fuel. ## # A tibble: 6 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ##
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
library(ggplot2) data(mpg) head(mpg)
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
a4 1.8 1999
a4 2 2008
a4 2 2008
4 manual(m5) f
4 manual(m6) f
4 auto(av) f
21 29p 20 31p 21 30p
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26p
Write R code that produces the following plot.
0.2 0.1 0.0
0.06 0.04 0.02 0.00
0.3 0.2 0.1 0.0
45678 234567
0.075 0.050 0.025 0.000
10 15 20 25 30 35
data <- as.data.table(mpg) data <- melt(data, id.vas = c("manufacturer", "model", "year", "trans", "fl", "class")) IN-DataViz-1-20200629-E5115-06 – Page 6 / 24 – Downloaded by Julie Huang Page empty lOMoARcPSD|10773645 Question Nr. 0GW52IK81LNA06UY23CW8 The admissions dataset from the dslabs package provides the number of applicants and admitted students to 6 different majors stratified by gender. Write R code using ggplot2 to plot the difference between applicants and admitted students on each major, using bars and stratified by gender using facets. Give meaningful labels to both axes. Do not mind if you obtain negative values. IN-DataViz-1-20200629-E5115-08 – Page 8 / 24 – Page empty library(dslabs) data(admissions) admissions ## ##1 ##2 ##3 ##4 ##5 ##6 ##7 ##8 ##9 ##10 ##11 ## 12 major gender admitted applicants A men 62 825 B men 63 560 C men 37 325 D men 33 417 E men 28 191 F men 6 373 A women 82 108 B women 68 25 C women 34 593 Dwomen 35 375 Ewomen 24 393 F women 7 341 Downloaded by Julie Huang Code Help, Add WeChat: cstutorcs Problem 2 (2 credits) lOMoARcPSD|10773645 Which operation has been applied to table A and table B to return the result table? Justify your answer. Write 1 one line of R code that would produce the result table assuming a data table A and a data table B in the working environment. id CreditCard 15 1837655746651971 21 5927428911423246 14 7393954899774435 23 7844437946592947 16 7364376521545978 13 3923818281216234 8 1764682661721638 24 2622321425978251 19 7271112241595296 18 4225693846619738 582 l 221 i 142 r 479 l 881 l 698 o 566 o 528 o 393 o 421 r Result table: Aamina Marcus Derek Muntasir Alexis Alexis Khanea Julia Keith Tiana Adam id CreditCard 8 1764682661721638 13 3923818281216234 14 7393954899774435 15 1837655746651971 16 7364376521545978 18 4225693846619738 19 7271112241595296 21 5927428911423246 23 7844437946592947 24 2622321425978251 lastName customer_id el-Sinai 16 Hendrix 13 Martinez 8 al-Sharifi 24 Smith 19 Arreola 18 Forrest 1 Deronde 6 Hart 25 Ramirez 9 Highman 2 CCV type firstName lastName NA NA NA NA NA NA 566 o NA NA 698 o 142 r 582 l 881 l 421 r 393 o 221 i 479 l 528 l NA NA Khanea Adam Julia Derek Tiana Marcus NA NA Aamina Alexis Alexis NA NA Muntasir Keith Forrest Highman Deronde Martinez Ramirez Hendrix NA NA el-Sinai Arreola Smith NA NA al-Sharifi Hart IN-DataViz-1-20200629-E5115-10 – Page 10 / 24 – Page empty Downloaded by Julie Huang Problem 3 (2 credits) Question Nr. 6CK12JZ6CD9OQ44JL0 Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line. lOMoARcPSD|10773645 −15−10−5 0 5 10 15 −4 −2 0 2 4 theoretical −4 −2 0 2 4 theoretical −4 −2 0 2 4 theoretical −4 −2 0 2 4 theoretical Page empty – Page 11 / 24 – IN-DataViz-1-20200629-E5115-11 Downloaded by Julie Huang sample sample sample sample lOMoARcPSD|10773645 Question Nr. 3JN34PA3DD3YU35RP0 Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line. −15−10−5 0 5 10 15 −4 −2 0 2 4 theoretical −4 −2 0 2 4 theoretical −4 −2 0 2 4 theoretical −4 −2 0 2 4 theoretical IN-DataViz-1-20200629-E5115-12 – Page 12 / 24 – Page empty Downloaded by Julie Huang sample sample sample sample Problem 7 (2 credits) Question Nr. 8ZF48OQ3WP45KH55GQ0 We consider a linear regression model parameterized as yi =α+β·xi +εi where i = 1...N denotes the data point indices, yi is the response variable, α and β the coefficients, xi the explanatory variable and εi the error term. Let yˆi be the fitted value. Does the following plot provide evidence against the assumptions of the linear regression? Justify. lOMoARcPSD|10773645 theoretical quantiles Page empty – Page 17 / 24 – IN-DataViz-1-20200629-E5115-17 Downloaded by Julie Huang response quantiles Question Nr. 0JU8SP106HN41KM58HL9 We consider a linear regression model parameterized as yi =α+β·xi +εi where i = 1...N denotes the data point indices, yi is the response variable, α and β the coefficients, xi the explanatory variable and εi the error term. Let yˆi be the fitted value. Does the following plot provide evidence against the assumptions of the linear regression? Justify. lOMoARcPSD|10773645 0 20 40 60 IN-DataViz-1-20200629-E5115-18 – Page 18 / 24 – Page empty Downloaded by Julie Huang Problem 8 (2 credits) Question Nr. 4LY90YX24AQ71F66EX2 library(dslabs) Consider the “brca” dataset from dslabs package. Fit a logistic regression model which predicts the response variable brca$y given the feature perimeter_se. Assume that all assumptions of the logistic regression model are met. Starting from an original probability of 10 % of malignant (cancer) how much does the probability of developping a malignant (cancer) increase, when the feature perimeter_se increases by 0.8. lOMoARcPSD|10773645 Page empty – Page 19 / 24 – IN-DataViz-1-20200629-E5115-19 Downloaded by Julie Huang Code Help
QuestionId: 4YF2RH7TD00JE84ZN0
Consider the features smoothness_mean, radius_mean from the brca dataset. Provide R code that plots a ROC curve of both features as predictors of malignancy (variable brca$y == “M”), and indicate the feature that has the highest true positive rate when the false positive rate is 0.4.
Problem 9 (2 credits)
IN-DataViz-1-20200629-E5115-20 – Page 20 / 24 – Page empty Downloaded by Julie Huang
lOMoARcPSD|10773645

Problem 10 (2 credits)
library(dslabs) library(data.table)
Question Nr. 4IL88XL37CT72GN68OW7
Consider the variable ‘concavity_worst’ of the ‘brca’ dataset. A researcher states that no other variable from the matrix ‘brca$x’ associates with the variable ‘concavity_worst’ according to Spearman’s correlation. Do you reject this hypothesis using the significance level of 1%? Provide code and justify your answer. Do not mind warnings, if any, about exact p-values with ties.
lOMoARcPSD|10773645
Page empty – Page 21 / 24 – IN-DataViz-1-20200629-E5115-21 Downloaded by Julie Huang
Programming Help, Add QQ: 749389476
Problem 11 (1 credit)
lOMoARcPSD|10773645
Question Nr. 0PJ29LOW8LMHZ4060
Assume 5 elements v, w, x, y and z. A first clustering gave the clusters {v} and {w,x,y,z}. One then run k-means which yields two clusters: {v,w,x} and {y,z}. Applying hierarchical clustering yields also two clusters: {v,w} and {x,y,z}. Which of the k-mean clustering and the hierarchical clustering yielded the clustering most similar to the first clustering? Support your answer with a metric learned in the lecture.
IN-DataViz-1-20200629-E5115-22 – Page 22 / 24 – Page empty Downloaded by Julie Huang