Spam vs. Ham Sentiment Comparison

Jourdan Parham

The use of spam is taking over people’s email inboxes as displayed in March 2018 reports of 85% of all emails considered to be spam. Not only does spam affect email inboxes but also come in the form of instant messages, forum and newsgroups, social networks, and text messages. From my own experience of receiving suspicious and unwanted messages via text, email, and various social media platforms, I am interested in detecting the common trends among these messages. I expect to find data that supports messaging that contains information that seems too good to be true, create a sense of urgency, use repetition, and have an overall positive sentiment. More specifically messages that contain the words: free, click here, money, containing shortened links, and all capped uppercase messages with the use of exclamation marks. Spam is considered to be unsolicited messages with commercial and malicious intent compared to legitimate messages termed ham. While typically thought of as harmless, spam messages can carry some serious consequences to the receiver as spam messages can employ phishing tactics. An example of phishing is when a seemingly legitimate sender (such as “Elon University”) intends to receive sensitive data such as passwords from its recipients. Here is an example from my spam depicting what a phishing spam message tends to look like:

The data set I have used throughout this project to make the comparison between ham and spam messages is provided by UCI Machine Learning containing over 5,000 text messages.

Most Frequent Words

The variable txtSpamWords is created by using the data frame of txtSpam that contains all of the text message in the dataset categorized as spam. By using the unnest_tokens function, the variable is formatted by word without stop words. By using the count function, I can determine that “call” and “free” are the most frequently used words in spam text messages. Here are some example text messages of the words in context to the spam message:

WINNER!! As a valued network customer you have been selected to receivea Œ£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

txtSpamWords <- txtSpam %>% unnest_tokens(output = word, input = v2)

txtSpamWords <- txtSpamWords %>% anti_join(stop_words)

txtSpamWordsCount <- txtSpamWords %>% count(word, sort = TRUE)

txtSpamWordPlot <- txtSpamWordsCount %>%

top_n(20) %>%

ggplot(aes(reorder(word, n), n)) +

geom_bar(stat = "identity") + coord_flip()

In comparison, the words “2” and “gt” are the most frequently used words in ham text messages. While 2 and gt are not generally viewed as legitimate words, they are both often used as slang for “to” and “got to”. Below displays how the common words are used in context:

Yup, no need. I'll jus wait 4 e rain 2 stop. Found it, ENC <#> , where you at?

Upon first assumption, I believed gt was used used as slang for “got to”. However upon further investigation, what I was interpreting as a word is a unicode.

The process for determining the most frequent ham words is the same as determining the spam words frequency denoted above.

Word cloud

As predicted in my hypothesis, the word cloud for spam text contain words and phrases that seem too good to be true. These words include “free”, “claim”, and “prize”.

Spam

Ham

Ham vs. Spam Word Sentiments

When comparing the sentiment of words used in ham and spam text messages, it is no surprise to me that spam messages use a majority of positive categorized words rather than overwhelmingly negative words. I believe this is to appeal to the receiver of the message, referring back to my hypothesis spammers typically want to create the illusion of too good to be true.

Spam

Ham

bing_spam_counts <- txtSpamWords %>%

nner_join(get_sentiments("bing")) %>%

count(word, sentiment, sort = TRUE) %>%

ungroup()

The variable bing_spam_counts uses the spam texts that have been separated by word. Each of the words in the variable is then categorized into either positive or negative sentiments by using the bing lexicon.

bing_spam_counts %>%

group_by(sentiment) %>%

top_n(10) %>%

ungroup() %>%

mutate(word = reorder(word, n)) %>%

ggplot(aes(word,n,fill = sentiment)) +

geom_col(show.legend = FALSE) +

facet_wrap(~sentiment, scales ="free_y") +

coord_flip() +

ggtitle("Spam Word Sentiment")

The process for plotting ham sentiment is the same as plotting the spam sentiment denoted above.

Discussion

Based on the dataset from the UCI Machine Learning on spam and ham text messages, I can make a number of assumptions. One of which is that the data collected is from younger people as shown with the use of slang and abbreviated words in addition to the foul language. While the federal trade commision recommends not responding to these spam messages, if ever curious of what would happen when responding to spam watch this TED talk. In all seriousness, clicking on spam links or responding to spam messages can pose a serious threat to a users device being infected with malware. The best way to protect yourself from spammers and hackers is to delete any suspicious messages. With further research, I am interested in investigating spam on other forms of media in which I often see spam messages. Such mediums include email, and web applications like Facebook, Twitter and GroupMe.