Section A: Short Answer Questions
Questions 1-10 – Total 25 marks
*** Answer ALL questions in this section ***
(Answers should be fewer than five (5) sentences)
Briefly define what data science is (one sentence), and name two aspects of
Question 1.
data science. [2 marks]
Question 2.
Explain briefly the difference between ordinal and interval data. [2 marks]
Question 3.
Briefly explain what “data sampling” means in data pre-processing. Why is it useful? [2 marks]
Question 4.
Both Mean and Median are typical ways to explore the data’s statistical summary using “average”. List two ways to explore the “spread” of data. [2 marks]
Question 5.
Briefly explain the overfitting issue in classification. [2 marks]
Question 6.
We need to divide the total dataset into three sets of data for classification: ‘Training data’, ‘Validation data’, and ‘Test data’. Briefly explain what purpose each dataset is
Question 7.
What are the roles of k in k-nn and k-means clustering? [2 marks]
Question 8.
Briefly compare and contrast ‘lazy learning’ and ‘eager learning’. Give an example method for each type. [4 marks]
Question 9.
Define the terms ‘support’ and ‘confidence’ as they relate to association rule mining. Is ‘support’ and ‘confidence’ for a rule XàY the same as for the rule YàX? Why or why not? [4 marks]
Question 10.
Briefly describe the “bootstrap” data partition method in classification. [2 marks]
SECTION B BEGINS NEXT PAGE
Page 1 of 3
Code Help
Section B: Application Questions
Questions 11-13. Total 30 marks
*** Answer ALL questions in this section ***
Question 13.
The performance of a classifier is as given in the confusion matrix below. The training dataset consisted of 220 patient data based on some medical research to identify if the patient has the rare disease. [8 marks]
90 40 20 70
(a) Calculate the following statistics for the above confusion matrix:
1. Accuracy of the classification
2. Overall how often it was wrong?
3. When it predicts yes, how often is it correct?
4. When it’s actually no, how often does it predict yes?
5. When it’s actually no, how often does it predict no?
6. When it’s actually yes, how often does it predict yes?
(b) Which of the values you calculated above are the sensitivity and specificity?
Question 14.
Draw the dendrogram based on the distance matrix given below by applying the hierarchical clustering algorithm (agglomerative) with single-link technique. The distance in the table below was calculated using the Manhattan distance function. Show all the steps to final stage. [8 marks]
Model Prediction
Distance A B C D E F
A B C D E F
Question 15.
The table shows the gold playing data set consisting of 14 instances. Each row has four attributes namely Outlook, Temp, Humidity and Windy and a class attribute Play. These attributes are used to predict whether people will play golf. Use Naïve Bayesian classification to predict the class of the last data point. Perform any necessary adjustments to the data in your evaluation. [14 marks].
Page 2 of 3
Computer Science Tutoring
outlook temp humidity windy play ——————————————————————————-
cloudy cool high rain hot high rain mild high sunny cool normal rain hot normal cloudy mild normal cloudy cool high rain mild normal cloudy cool high rain hot normal cloudy cool high rain cool high cloudy cool normal rain mild high
New instance to classify:
false yes
false no
false yes
false no
true yes
false no
true no
false no
true yes
true yes
true yes
false yes
false yes
true yes
sunny mild high false ??
END OF EXAMINATION
Page 3 of 3
程序代写 CS代考 加微信: cstutorcs