Large Scale Computing on the Cloud Homework 4: Spam Filtering Using Spark MLlib
Learning Goal: use Spark MLlib to implement spam filtering following example in lecture 4 notes page 50-53.
You need to complete the following steps:
1. Collect 20 spam text samples, and 20 non-spam text samples (one potential source is your own email), and 2 test samples.
2. Follow lecture 4 notes page 50-53, implement spam filtering using PySpark.
3. Train the model on your own samples and test the trained model on your own test samples.
You can use Logistic Regression model or other classification models from PySpark. References are at: https://spark.apache.org/docs/latest/ml-classification-regression.html
4. Answer the following questions within one page:
a. Does your trained model work on your test samples?
b. Briefly explain why it works/does not work.
Submission:
After implementation, download .ipynb file with all the results. Zip the file with your own spam/ham text files, and a word/pdf/text document for Step 4. Submit the compressed file on Canvas.
Reference: https://github.com/databricks/learning-spark/blob/master/src/python/MLlib.py