0668d90d 2886 4194 b5fd 5446e3352e8f

Machine Translated by Google
BDML – Final project – WiSe 23/24
December 15, 2023
1 Information about the data set
• The data set is available in Ilias compressed as .zip.
• The data set is available twice in Ilias: once compressed as .7z and another
Sometimes as .zip. It is only necessary to download one of them. Who problems
The former has to be unpacked and can access the .zip files. Otherwise the folders do not differ. Extract the data into its own folder in the same
Directory in which the Python scripts should also be executed.
• The data set contains acceleration measurements on ball bearings
• For each measurement there is a Bearingx y directory containing some CSV files acc #.csv
contains, as well as a file Bearingx and health state.csv.
• The CSV files acc #.csv contain horizontal and vertical acceleration, as well the measurement time is recorded according to the following scheme:
Hour Minute Second Microsecond (μs) Horiz. Decide Vert. Dec.
. . . . . . . . . . . . .. . . . .
• The CSV file Bearingx y health state.csv shows the wear status for each ̈
File acc #.csv specified. We will use this as a label. The values are there there for ̈
0: New stock
1: Bearing worn
2: Bearings severely worn
All scripts and notebooks must be commented on in order to be able to understand the work steps.
A total of 100 points can be achieved.

Machine Translated by Google
2.1 Task 1: Data viewing and pre-processing (50 points)
In this task you should familiarize yourself with the data set. Save the script used under the following name: “datainspection firstname lastname.ipynb”
2.1.1 Data review (13 points)
• Read in 3 different series of measurements using Pandas. The first time series should
according to the table Bearingx y health state.csv have the state 0, the second time series the state 1 and the third time series the state 2. Complete each Column names described above in the data frames.
• Does the data contain missing values? If yes, how many?
• Fill in the missing data. Choose a method suitable for time series data and explain why it was chosen over ̈
other alternatives.
• Plot the horizontal and vertical accelerations over time for the 3 data series. Use axis labels, a legend, and a
title for each.
• What do you notice?
Tip: Write a function that combines the individual time columns into a datetime object in order to be able to use the time data sensibly. Also pay attention to the separators in the CSV files.
2.1.2 Visualization of the data (14 points)
In the following, the vibration behavior in the ball bearings will be examined over the period of an entire measurement:
• Read all acceleration data for a bearing.
• Then create a data frame containing the averages and standard deviations of the horizontal and vertical accelerations as well as the corresponding label
from Bearingx y health state.csv contains. So per line in the data frame you should
Features and the label of a CSV acc #.csv must be included. Fill in missing values again.
• Then plot the means and standard deviations of the horizontal and
vertical accelerations and the corresponding label over time. Supplements
again axis labels, a legend and a title. What tendencies are there? recognize?

Machine Translated by Google
2.1.3 Creating a data set (23 points)
• Now create a data set from all measurements that is used for model training for a ̈
Classification can be used. The data set should contain several statistical features for both ̈
acceleration time series. These include maximum, minimum, standard deviation, median,
lower and upper quartiles. Before calculation ̈ ̈
of the features fill in missing values again. Each line now corresponds to an original series of measurements. Name the new columns “a ver [feature name]” or “a hor [feature name]”.
Tip: Have the CSV files from different warehouses partly different separators.
• In the data frame, add the state of the bearing based on the table Bearingx and health
in the pairplot.
• Which features are the worst for classification? Name two features
tures and explain your choice.
• Which features are best suited for classification? Choose two features ̈
and explain your choice.
• One hot encode the labels and remove the original labels from the data set.
• Save the data as a CSV file for later training. Name the file
“dataset”. This data set should not be sent at the end!
2.2 Task 2: Training (50 points)
In task 2, various machine learning methods are to be applied to the previously saved data set “dataset”. Save the script as:
Names: “train firstname lastname.ipynb”
2.2.1 Decision tree (10 points)
• Divide the data set into a training and test data set. Use 75% of the data ̈
for the training data set.
• Train a decision tree on the two features identified in exercise 2.1.3 as the ̈
The best features were chosen for the classification.
• Calculate the accuracy on the test data
• Plot the decision boundaries of the trained classifier and explain is presented in the plot.
state.csv as a new column for each row.
• Using Seaborn, create a pairplot of the calculated features and color them
three different storage conditions. For reasons of clarity, do not use the calculated quantiles
Code Help
Machine Translated by Google
2.2.2 Random Forest Classifier (11 points)
A random forest classifier consists of a set of decision trees
trained on random subsets of the features in the data set. Every tree within
A random forest classifier makes a prediction, and the final prediction is determined by majority voting (classification) or average of the individual predictions
(regression).
• What advantages and disadvantages could a random forest classifier have compared to a ̈
have a single decision tree? Name at least 2 advantages and disadvantages and justify ̈
each point.
• Train a random forest classifier with sklearn. Now use all of them before
calculated features and divide the data again into a training and test data set on.
• Calculate ̈the accuracy for all three classes as well as the accuracy, the precision and the
Recall for each of the three classes on the test data.
2.2.3 Hyperparameter optimization (18 points)
• Scale the training and test data using a StandardScaler.
• Explain why scaling occurs when training many machine learning models ̈
the data is necessary? Give at least 2 reasons. Why was the scaling of the data not possible when using the random forest and the decision tree?
necessary?
• Now carry out hyperparameter training. For a supportvector machine, at least 6 different ̈
combinations of different values for the hyperparameter C and the type of kernel used
should be tested. Tip: sklearn
has implemented its own class “GridSearchCV” for hyperparameter optimization.
• Output the hyperparameters and the accuracy for the hyperparameter combination,
which provided the best accuracy.
• Explain the hyperparameter C. Why is this necessary?
• Now test the same procedure with a MinMaxScaler instead of a StandardScaler. Measured by the accuracy on the test data set, which method for ̈
did the scaling of the data work better here?
Programming Help, Add QQ: 749389476
Machine Translated by Google
2.2.4 Neural Networks (11 points)
• Train a neural network using Keras (not sklearn!). Use again
a training size of 75% of the data and a standard scaler. Score at least
an accuracy of 80% on the test data set and achieve at least in the training
• Explain the meaning of regularization and the method used for it Regularization.
If the preprocessing of all data is too computationally complex for your computer, all tasks can be ̈
carried out with fewer measurements. In general, it is advisable to initially only carry out all tasks
and pre-processing steps on a small part ̈
of the entire data set to test the correct functionality before
the entire data set is used. ̈
Relative paths should be used to read the data instead of absolute paths so that we can execute ̈
the code after submission.
The two notebooks/scripts must all be stored in a folder named after the student (name: “bdml
delivery first name last name”). The saved and generated data must be removed beforehand. ̈
Before submitting, it is advisable to close the code editor again and then run all scripts again. The
scripts should be sent as a .zip file to We
We will then send a confirmation of receipt as quickly as possible.
If there are any problems with packages, Python installations or anything else that is unclear, please let us know ̈
please email us as early as possible.
The grading depends on the respective course of study. If the lecture is BDML as Subject-related SQ courses are included in the course, then the submission will be made graded. This is the case, for example, with the mechanical engineering course.
Otherwise, depending on the points achieved, the difference is between passing and failing passed differed. This includes, for example, the Autonomous Systems course.
Good luck! Page 5
a measure for regularization.
程序代写 CS代考 加微信: cstutorcs