Lecture 4 51

Reflections on last week
MapReduce: parallelism framework, divide and conquer • Map: the divide step
• Reduce: the aggregate step
• Reason: we need to use multiple machines to speed up
Frequent itemset mining problem
• Association patterns
• Two products, two movies, two medicines, etc.
Achieved frequent singleton itemset mining in lab3
Minghong Xu, PhD. 2

AWS Steps vs On-Premise Steps
Do not use blank space in any file/folder names on AWS
Blank/spaces cause troubles in programming
Minghong Xu, PhD. 3

Homework 2 Review
EMR cluster on AWS:
• Primary, Core and Task
• Refer to Lab 2 instruction: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master- core-task-nodes.html
Change core and task nodes demoed in lab 3 • Change of primary node not required
Default is 1 node of each type • 3 nodes/machines cluster
Minghong Xu, PhD. 4

Release and Due
Homework 3 hint:
• Reduce function is the same as lab3
• Differ at map function, output pairs of items instead
• Reference: https://github.com/gautamdasika/Aprioiri-frequent-3-itemsets- with-Hadoop-MapReduce
• But output “(item1,item2) \t count” instead of “item1 \t item2 \t count”
• If you use f-string, call python3 engine instead of python
• Deadline extended till Friday 11:59pm
Homework 4 (Spark) release after lab 4
Minghong Xu, PhD. 5

Additional Python Learning Resource
Coursera-JHU Python course:
https://www.coursera.org/learn/python- genomics
• Free ChatGPT
• Free or $20 for GPT 4
Minghong Xu, PhD. 6

Midterm Feedback
Minghong Xu, PhD. 7

Today’s Agenda
Text mining
• Frequencyanalysis
• Co-occurrence analysis • Topic modeling
Sparks: Improved MapReduce
Lab 4: Simple text mining in Python Spark
Minghong Xu, PhD. 8

User Generated Content (UGC)
Online users are creating and sharing images and words like never before
“any form of content such as blogs, wikis, discussion forums, posts, chats, tweets, podcasts, digital images, video, audio files, and other
forms of media that were created by users of an online system or service”
Minghong Xu, PhD. 9

UGC: Generate Trust
Consumers are more likely to trust recommendations by people they know over other forms of advertising (Nielson)
Listing of a book on the New York Times bestseller list causes a modest increase in sales (Sorensen, 2007)
Willingness to pay of consumers is about $4.50 greater for a top ranked app than for the same unranked app (Carare, 2012)
Minghong Xu, PhD.
https://www.statista.com/statistics/222698/consumer-trust-in- different-types-of-advertising/ 10

UGC: Big Data for Mining
Limitless pool of content
Businesses that use customer content on their marketing channels see higher conversion, click-through rates to product pages, and average order values
Fake reviews detector
• https://streetfightmag.com/2022/10/13/5-tools-for-fake-review-detection/ • https://www.fakespot.com/
Minghong Xu, PhD. 11

Text Mining
Text analytics, natural language processing (NLP)
Mine knowledge/information from huge amount of text (unstructured) data
Many NLP systems are trained on very large collections of text (also called corpora) such as the Wikipedia corpus
Minghong Xu, PhD. 12

Text Mining: Frequency Analysis
I would like to know what is your biggest concern…
Minghong Xu, PhD.

Bag of Words Model
A text (such as a sentence or a document) is represented as the bag of its words
Disregard grammar and even word order, but keep multiplicity
Commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier
Minghong Xu, PhD. 14

Punctuations removed, converted to all lower cases
Count occurrence of each word
• What model we can use?
• Wordcount in MapReduce (lab 2 and lecture 3)
Minghong Xu, PhD. 15

Tokenization
Break a stream of text into meaningful units
Tokens: words, phrases, symbols
You can split text using space and/or special characters
• Such as apostrophe’ hyphen- quotations’’”” comma, period. etc
Depends on language, corpus, or even context
• Input: It’s easy to use Hadoop eco-system in BU.330.740.
• Output(1): ‘It’s’ ‘easy’ ‘to’ ‘use’ ‘Hadoop’ ‘eco-system’ ‘in’ ‘BU.330.740’
• Output(2): ‘It’ ‘’’ ‘s’ ‘easy’ ‘to’ ‘use’ ‘Hadoop’ ‘eco’ ‘system’ ‘in’ ‘BU’ ‘330’ ‘740’
Minghong Xu, PhD. 16

Term Frequency
Term: token in text • Words, phrases, etc.
Term frequency: how often a term occurs in a document
• A term is more important if it occurs more frequently in a document
So what to do?
• Count the occurrence!
!” !, $ = “&'()’*+, +-)*! -” !’&. ! /* $-+ $
Any issue with this approach?
Minghong Xu, PhD. 17

TF Normalization
Documents have different length
• Doc 1 has 1000 words, and ‘Hadoop’ appears 5 times • Doc 2 has 10 words, and ‘Hadoop’ appears 2 times
How to solve this?
• Normalization!
!” !, $ = !”#$%#&'( ‘)%&* )! *#”+ * ,& -)’ –
*)*./ 0)”-1 ,& -)’ –
There are other ways to do normalization, such as maximum TF normalization
Minghong Xu, PhD. 18

Stop words
‘the’ ‘to’ are high frequency words in this document
But are they important?
Stop words: commonly used words in language
• If it appears in all documents frequently, then it is not related with any specific doc
Filter out before or after processing of natural language data Can be achieved in Python using NLP packages
Minghong Xu, PhD. 19

Programming Help, Add QQ: 749389476
Inverse Document Frequency
Document frequency: a term is more discriminative if it occurs only in fewer document
Inverse document frequency: assign higher weights to rare terms
/$” ! = log( !”!#$ %”&'()*!+ ) %”&'()*!+ ,-!. !)/( !
Combining tf and idf !”5/$”=!” !,$ ×/$”(!)
Minghong Xu, PhD.

Example (Cont.)
Punctuations removed, converted to all lower cases Wordcount applied (tokenization)
Stop words excluded
What else?
Minghong Xu, PhD. 21

Let me address your concerns
Temporary relief from AWS today
• More in week 5&6, but simply clone the cluster from lab2
More coding today on PySpark
More explanations and examples will be provided
Focus: data mining/big data analytics problems widely used in business • Distributed computing, cloud(AWS and Google Colab) is the tool
Readings and Videos: http://www.mmds.org • Chapter 1-2, 5-6, 9
Minghong Xu, PhD. 22

Use Case: Word Cloud
Visualize text data, perceive the most prominent terms to determine its relative prominence
• Bigger term means greater weight
Frequency of each item, show the top n words/categories
Minghong Xu, PhD. 23

Use Case: Spam Filtering
Imagine there are two literal bags full of words
• “Spam” bag contains spam-related words
• “Ham” bag will contain more words found in
legitimate e-mail
To classify an e-mail message
• Assume that the message is a pile of words that has been poured out randomly from one of the two bags
• Use Bayesian probability to determine which bag it is more likely to be in
Minghong Xu, PhD. 24
Dada, Emmanuel Gbenga, et al. “Machine learning for email spam filtering: review, approaches and open research problems.” Heliyon 5.6 (2019): e01802.

Related Concepts
‘recording’ and ‘videos’ are related Humans may know it from our experience Is there a way for machine to learn it?
Minghong Xu, PhD. 25

Text Mining: Co-occurrence Analysis
How to automatically figure out Elon Musk and Tesla are related?
Count number of articles that have different named entities co-occur
Minghong Xu, PhD. 26

Text Mining: Topic Modeling
Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently
Minghong Xu, PhD. 27

Latent Dirichlet Allocation (LDA) Model
Most common topic model currently in use
Documents cover a small number of topics and that topics often use a small number of words
Other topic models are generally extensions on LDA
Word/Topic Weather
Cold 0.3 Hot 0.7 Apple 0 Pie 0
0.1 0.3 0.5 0.1
Topic/Document
Minghong Xu, PhD.

The AI Tree
Minghong Xu, PhD.
http://home.cse.ust.hk/~lzhang/topic/ai-tree.pdf

Text Mining Business Case
Mine your own business: Market-structure surveillance through text mining (Netzer, Oded, et al. 2012 Marketing Science)
• Explore online user-generated content and “listen” to what customers write about their and their competitors’ products
• Convert the user-generated content to market structures and competitive landscape insights
• Compare with traditional sales and survey-based data
Minghong Xu, PhD. 30

Discussion Time
What are the drawbacks of Bag Of Words?
Minghong Xu, PhD. 31

Spark: Improved MapReduce

Spark Introduction
Cluster computing platform designed to be fast and general purpose • Speed: 10-20x faster than MapReduce for certain jobs
• Generality: combine SQL queries, ML, streaming…
Start in 2009 as a class project by Matei Zaharia
More widely supported than original MapReduce engine And easier
Minghong Xu, PhD. 33

Spark Components
Spark Core
• Basic functionality, task scheduling, memory management, faulty recovery, interaction with storage systems
• Resilient distributed datasets (RDDs), a collection of items distributed across many computer nodes, and can be manipulated in parallel
• Package for structured data
Spark Streaming
• Process live streams of data, e.g., logfiles, queues of messages
Minghong Xu, PhD. 34

Spark Components
• Machine learning library
• Graph mining
Cluster Managers
• Can run over Yarn
• Also include a simple Standalone Scheduler
Minghong Xu, PhD. 35

Writing and Deploying Spark Applications
Will use Google CoLab today • Easier than AWS
Can also work locally in Python IDLE or Anaconda
• Install via command: “pip install pyspark” or “conda install pyspark”
• https://docs.anaconda.com/anaconda-scale/spark/
• https://gist.github.com/dvainrub/b6178dc0e976e56abe9caa9b72f73d4a • Note that Windows system may encounter issues
Windows users stick with Google CoLab
Minghong Xu, PhD. 36

SparkContext
Every Spark program needs a SparkContext Named sc by convention
Call sc.stop when program terminates
• DO NOT run any spark steps after you call sc.stop
Minghong Xu, PhD. 37

Resilient Distributed Dataset (RDD)
Resilient: if data in memory is lost, it can be recreated
Distributed: processed across the cluster
Dataset: initial data can come from a file or created programmatically
Fundamental unit of data in Spark
Most Spark programming consists of performing operations on RDD
Minghong Xu, PhD. 38

Two types of RDD Operations
• Return values
Transformations
• Define new RDD based on the current ones
Lazy execution: Data in RDDs is not processed until an action is performed
Minghong Xu, PhD. 39

Creating RDDs
Minghong Xu, PhD. 40

CS Help, Email: tutorcs@163.com
Example: map and filter
lambda: Python in-line function
definition in the format
“lambda arguments: return”
e.g. lambda line: line.split(“ “)
is equivalent to
def split_word(line):
return line.split(” “)
Minghong Xu, PhD.

Chaining transformation
Minghong Xu, PhD.
\: Python line continuation

Some Other RDD Operations
Minghong Xu, PhD. 43

Example: flatMap and distinct
Minghong Xu, PhD. 44

Example: multi-RDD transformations
Minghong Xu, PhD. 45

RDDs consisting of key-value pairs (two-element tuple) Special form of RDD
Keys and values can be any type
Commonly used functions to create pair RDDs • map
Minghong Xu, PhD. 46

Example: a simple pair RDDs
Minghong Xu, PhD. 47

Other pair RDD Operations
Minghong Xu, PhD. 48

Coding Examples

Example: Bag of Words Model
Pseudocode
Split line into words
Map each word to (word, 1)
Reduce for each word to summation of counts
Minghong Xu, PhD. 50

Bag of Words (aka. wordcount) PySpark Code
#create RDD from a text file
text_file = sc.textFile(“Google_CoLab_folder/ebook.txt”)
#1st line split line into words #2nd line map word to (word,1)
#3rd line reduce values to summation
counts = text_file.flatMap(lambda line: line.split(” “)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
#write output to a text file
#and output file must not exist! counts.saveAsTextFile(“Google_CoLab_folder/output”)
Minghong Xu, PhD. 51
sc: SparkContext (page 33)
flatmap: map one line to multiple
words (page 39)
lambda: Python in-line function
definition in the format
“lambda arguments: return”
e.g. lambda line: line.split(“ “)
is equivalent to
def split_word(line):
return line.split(” “)
\: Python line continuation

Example: Spam Filtering ( seen in page 22)
Pseudocode
Training phase:
1. Get some spam and ham examples
2. Convert to words (features)
3. Use ML to identify spam-related words
and ham-related words (classifier)
Testing phase:
1. Convert test case to words (features)
2. Use learnt classifier to predict case
Minghong Xu, PhD. 52

Spam Filtering PySpark Code
#import mllib package since we will use ML models
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.feature import HashingTF
#training: load spam and ham examples (two bags)
spam = sc.textFile(“Google_CoLab_folder/spam.txt”)
ham = sc.textFile(“Google_CoLab_folder/ham.txt”)
#initiate a feature vector
tf = HashingTF()
#each email is split into words, each word is mapped to one feature #thus each email is mapped to a feature vector
spamFeatures = spam.map(lambda email: tf.transform(email.split(” “))) hamFeatures = ham.map(lambda email: tf.transform(email.split(” “)))
Minghong Xu, PhD. 53
HashingTF: transforms words into fixed-length feature vectors using hashing function. e.g. “Hadoop class uses Hadoop” can be converted to [(00,1),(01,2),(11,1)] Here, “class” is hashed to 00, with count 1
“Hadoop” is hashed to 01, with count 2
“uses” is hashed to 11, with count 1 https://spark.apache.org/docs/latest/ml-features

Code Help
Spam Filtering PySpark Code (Cont.)
#for ML purpose (classifier), spam features are labeled as positive (1)
#non-spam features are labeled as negative (0)
positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1, features)) negativeExamples = hamFeatures.map(lambda features: LabeledPoint(0, features)) #union spam and ham examples with labels now, for classifier
training_data = positiveExamples.union(negativeExamples)
#cache data since Logistic Regression is an iterative algorithm training_data.cache()
#use ML model logistic regression with SGD method
model = LogisticRegressionWithSGD.train(training_data)
Minghong Xu, PhD. 54
SGD: stochastic gradient descent is an optimization method, details will be in deep learning course

Spam Filtering PySpark Code (Cont.)
#testing: first transform test cases to words (features)
testExample = tf.transform(testMessage.split(” “))
#use classifier from training stage to predict the test cases
print (“Prediction for test example: %g” % model.predict(testExample))
#call stop function from SparkContext to clean up resources
Minghong Xu, PhD. 55
sc.stop(): called when program terminates (page 33)

Lab 4: Simple Text Mining in Python Spark
Use Bag of Words example in lecture (page 48-49)
No AWS today but Google CoLab
Drawback with Google CoLab:
• You need to install PySpark package every time you need it
Minghong Xu, PhD. 56

Sentiment analysis
Hive: distributed query language
Lab 5: Twitter sentiment analysis using Hive
Minghong Xu, PhD. 57

Appendix: Python Packages for NLP
Natural Language Toolkit (NLTK) TextBlob
spaCy polyglot scikit–learn Pattern
Minghong Xu, PhD. 58

Appendix: Other Techniques in NLP
Lemmatization
• Words in different tenses, forms…
Word Sense Disambiguation (WSD)
• Words have different meanings in context
Named Entity Recognition
• Phrases instead of two words, such as “Johns Hopkins”
Part-of-speech tagging
• Distinguish nouns, verbs, adjectives…
Sentence recognition
• Figure out when sentences end, text reasoning
Minghong Xu, PhD. 59

References
George Chen’s notes (2018) Kunpeng Zhang’s notes (2019)
Minghong Xu, PhD. 60