Sentiment analysis of popular media publications

Reid Cobb

By far one of the most divisive candidates in recent times, President Trump was portrayed in a variety of lights depending on the publication. This dataset from Kaggle provides 142,570 articles from 15 different news outlets from 2016-2017, and a few from 2015. The outlets are:The Atlantic, Breitbart, Business Insider, Buzzfeed News, CNN, Fox News, The Guardian, National Review, New York Post, New York Times, NPR, Reuters, Talking Points Memo, Vox, and The Washington Post.

Taking a look at the data presented, I wanted to take a look at how these publications viewed the candidates from a data science point of view. I used the nrc sentiment tool available in R to attach overall emotions to articles. What it does is labels each individual word with the emotion most commonly associated with it.

In addition to this, I looked into n-gram association with key words across different publications and mediums. One particular area that is of interest is comparing predominantly print media to predominantly online media, and if there are any differences in associations.

My overall hypothesis is that sentiments for Trump will be rather negative, particuarly within online publications, because sometimes it appears that they are not held to the same journalistic standards as print publications, and therefore have more leniency to be biased.

Organizing The Data

The data from Kaggle was split into three different csv's, so in order to get them into one dataset, we can just do a simple rbind.

#load the csv in

article1 <- read_csv("Final Project/articles1.csv")

article2 <- read_csv("Final Project/articles2.csv")

article3 <- read_csv("Final Project/articles3.csv")

#combine into one dataset

allarticles <- rbind(article1, article2, article3)

From there, we can either use 'allarticles' for an overall analysis, or we can look at individual publications by filtering them as such:

library(tidyverse)

library(tidytext)

require(devtools)

library(wordcloud2)

library(wordcloud)

library(stringr)

library(dplyr)

library(syuzhet)

library(ggplot2)

#create seperate datasets for each publication

atlantic <- allarticles %>% filter(publication %in% "Atlantic")

breitbart <- allarticles %>% filter(publication %in% "Breitbart")

business <- allarticles %>% filter(publication %in% "Business Insider")

buzz <- allarticles %>% filter(publication %in% "Buzzfeed News")

cnn <- allarticles %>% filter(publication %in% "CNN")

fox <- allarticles %>% filter(publication %in% "Fox News")

guardian <- allarticles %>% filter(publication %in% "Guardian")

nr <- allarticles %>% filter(publication %in% "National Review")

nyp <- allarticles %>% filter(publication %in% "New York Post")

nyt <- allarticles %>% filter(publication %in% "New York Times")

npr <- allarticles %>% filter(publication %in% "NPR")

reuters <- allarticles %>% filter(publication %in% "Reuters")

tpm <- allarticles %>% filter(publication %in% "Talking Points Memo")

vox <- allarticles %>% filter(publication %in% "Vox")

wp <- allarticles %>% filter(publication %in% "Washington Post")

#create seperate datasets for print and web publications

print <- rbind(atlantic, guardian, nr, nyp, nyt, wp)

web <- rbind(breitbart, business, buzz, cnn, fox, npr, reuters, tpm, vox)

You will notice that the last two variables created are 'print' and 'web', even though all of our articles are from websites. I decided to split these publications into what they were predominantly known for, i.e. NYT being a newspaper, and Breitbart being a website. The two outliers would be FOX and CNN, as they are cable news networks primarily, so I listed them under 'web', as they do not produce a physical copy of their reports like those in 'print' do.

There are, of course, downfalls to my methodology presented here. One was just listed above. Another one is the 'nrc' sentiment tool. It can only detect sentiments at face value, and not with context, so sarcasm, jokes, or different scenarios may disrupt the data. In addition, it is largely web based, and may not be 100% adept for analysing journalism in terms of sentiment. Another thing to note is that there may be lots of cross contamination between articles of certain candidates, as some may mention both. In this cae, it could potentially raise or lower the data for a candidate, even though the article is primarily not about them. However I still believe that this data is important to look at, you just have to recognize these pit falls and take them for what they are.

Across All Articles

Let's first look at the overall sentiment of articles that contain the keyword, "Trump".

trumparticles <- allarticles %>% filter(str_detect(content, "Trump"))

get_sentiments("nrc")

TRsentiment <- get_nrc_sentiment(trumparticles$content) trumpsentiment <- cbind(trumparticles, TRsentiment)

View(trumpsentiment)

TRsentimentTotals <- data.frame(colSums(trumpsentiment[,c(11:20)]))

names(TRsentimentTotals) <- "count"

TRsentimentTotals <- cbind("sentiment" = rownames(TRsentimentTotals), TRsentimentTotals)

rownames(TRsentimentTotals) <- NULL

ggplot(data = TRsentimentTotals, aes(x=sentiment, y = count)) + geom_bar(aes(fill=sentiment), stat="identity") + theme(legend.position = "none") + xlab("Sentiment") + ylab("Total Count") + ggtitle("Sentiment for Trump Article")

Let's take this from top to bottom. I first filtered all the articles to detect if they had the word, 'Trump' in them through a str_detect.From there we find the nrc sentiments of all the articles, and then do a cbind of the overall sentiments of the articles, and the articles themselves.

Next, we total all of the suentiments together, and then put them in a dataset. The final code is a ggplot of our dataset, which can be found below. I simply repeated this process for all of the sentiment analysis graphs, but just used different initial datasets, i.e. CNN articles vs all articles, and also included 'Clinton' as a string to search for.

As you can see, the sentiment for both is ultimately the same, but with Trump just having more sentiments because he is mentioned more. However, the ratios of positve and negative sentiments are equal between both.

Using the same method, we can also look at articles that contain the keyword, "Clinton", in them.

CNN vs FOX

Another interesting comparison is the sentiment of both of those candidates across the two of the most popular cable networks. Typically, CNN is regarded as left-leaning and FOX is considered right leaning, which makes this an interesting juxtaposition. Now, the downfall with this is that they are primarily cable outlets, meaning it would be more accurate to compare TV transcripts. But, we only have their online articles, so that will have to do.

Here, it largely remains the same as well. However, for Clinton, there is slightly more sadness , and for Trump, there is slightly more anticipation.

Ultimately, we are having the same conclusions for CNN, each seem to have the same sentiments across their articles. Both FOX and CNN appear to have the same sentiments for each person.

Web vs. Print

Once again, for the Web articles, both candidates appear to have the same sentimetns across the board.

This is again reflected in the print articles: there seems to be no discrepency on sentiment levels between Trump and Hillary. Even looking at print vs web trump graphs, they seem to remain largely consistent.

By these analyses, we can conclude that the sentiments associated with each candiate in articles is ultimately even, and that outlets showed no favoritism based off of negative or positive connotations.

N-Grams

I also thought it would be fun to compare the same publicationss above, but with n-grams, or most common word pairings. For every n-gram, there will be two variables executed: one with most common words before the keyword, and one with the most common words after the keyword.

articleDF <- data.frame(text = allarticles)

documentLines <- unnest_tokens(articleDF, input = text.content, output = line, token = "sentences", to_lower = F)

documentLines$lineNo <- seq_along(documentLines$line)

df_text_to_bigrams_tidy <- documentLines %>% unnest_tokens(output = bigram, input = line, token = "ngrams", n = 2)

head(df_text_to_bigrams_tidy)

df_text_to_bigrams_tidy %>% count(bigram, sort = TRUE)

bigrams_separated <- df_text_to_bigrams_tidy %>% separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)

bigram_counts <- bigrams_filtered %>% count(word1, word2, sort = TRUE)

bigram_counts

#what word most commonly precedes 'Trump'?----

teerump <- bigrams_filtered %>% filter(word1 == "trump") %>% count(word2, sort = TRUE)

tearump <- bigrams_filtered %>% filter(word2 == "trump") %>% count(word1, sort = TRUE)

clinton <- bigrams_filtered %>% filter(word1 == "clinton") %>% count(word2, sort = TRUE)

cleanton <- bigrams_filtered %>% filter(word2 == "clinton") %>% count(word1, sort = TRUE)

Ok, let's break that down. Essentially, we are taking our dataframe, breaking it down line by line,and telling it to find the most common word pairings with the keyword.

Some of these pairings will be self-evident, such as a first name, 'administration', or 'campaign', but it is still fun to look at particular differences between mediums.

Before Trump

1 donald 55418 2 president 10795 3 ivanka 1619 4 melania 1208 5 support 970 6 campaign 757 7 eric 558 8 stop 546 9 endorsed 512 10 time 491

After Trump

1 administration 10168 2 campaign 6310 3 tower 3289 4 supporters 3155 5 told 2717 6 organization 2179 7 won 2073 8 realdonaldtrump 1823 9 presidency 1549 10 administration’s 1387

Before Clinton

1 hillary 30633 2 bill 5807 3 secretary 1172 4 president 590 5 chelsea 516 6 rodham 363 7 support 266 8 top 235 9 endorsed 201 10 book 149

After Clinton

1 campaign 3978 2 foundation 2955 3 cash 691 4 won 681 5 administration 594 6 told 593 7 supporters 578 8 email 528 9 aide 489 10 wins 415

Here we can see that family members are often brought into the political fray. One particular thing to note is that it seems outlets often quoted Trump's twitter, as they have his handle in the top 10 after category. For Clinton, email's were of particular importance, as was 'aide' and 'endoresed'.

Fox vs. CNN

Fox: Before Trump

1 donald 1691 2 called 42 3 support 41 4 coverage 40 5 melania 37 6 endorse 29 7 president 28 8 stop 27 9 endorsed 25 10 campaign 24

Fox: After Trump

1 campaign 308 2 told 152 3 supporters 119 4 tweeted 80 5 university 74 6 tower 71 7 won 69 8 administration 67 9 called 45 10 supporter 45

Fox: Before Clinton

1 hillary 1482 2 bill 307 3 secretary 139 4 president 42 5 chelsea 40 6 top 33 7 coverage 31 8 trump 28 9 longtime 21 10 support 21

Fox: After Clinton

1 campaign 388 2 foundation 271 3 aide 86 4 told 60 5 email 58 6 emails 47 7 leads 35 8 aides 34 9 supporters 34 10 won 33

CNN: Before Trump

1 donald 3814 2 president 653 3 melania 131 4 ivanka 129 5 campaign 83 6 time 62 7 clinton 53 8 related 52 9 eric 48 10 support 48

CNN: After Trump

1 administration 814 2 campaign 729 3 told 323 4 tweeted 242 5 tower 234 6 administration’s 142 7 called 134 8 supporters 133 9 transition 124 10 organization 112

CNN: Before Clinton

1 hillary 2036 2 bill 464 3 secretary 78 4 president 42 5 chelsea 39 6 related 31 7 top 24 8 check 22 9 endorsed 22 10 accused 18

CNN: After Clinton

1 campaign 374 2 foundation 194 3 told 81 4 trump 53 5 email 48 6 aides 47 7 won 42 8 correctional 41 9 supporters 39 10 aide 34

Something I spotted here is that CNN seemd to focus on Trump's subsidiaries, such as Trump Towers or his Organization, whereas FOX did not mention these things in their top n-grams. However, FOX did mention the Clinton Foundation, where CNN did not mention that in their top N-grams. This could potentially point to inherant biases each network holds.

Another thing is that CNN has the word 'accused' in 'Before Hillary', meaning they could have potentially been writing about all the accusations Hillary had. Whether they were defending or attacking on this issue is unclear.

Web vs. Print

Web: Before Trump

1 donald 34294 2 president 5245 3 ivanka 767 4 melania 593 5 support 527 6 campaign 432 7 2016 315 8 stop 313 9 time 310 10 endorsed 277

Web: After Trump

1 administration 5723 2 campaign 3707 3 supporters 1784 4 told 1744 5 realdonaldtrump 1424 6 tower 1345 7 won 1249 8 tweeted 818 9 administration’s 755 10 presidency 721

Web: Before Clinton

1 hillary 17650 2 bill 3286 3 secretary 781 4 president 313 5 chelsea 271 6 rodham 187 7 support 174 8 top 148 9 endorsed 134 10 book 130

Web: After Clinton

1 campaign 2567 2 foundation 2013 3 cash 633 4 won 390 5 email 382 6 told 367 7 supporters 361 8 aide 326 9 administration 275 10 wins 259

Print: Before Trump

1 donald 21124 2 president 5550 3 ivanka 852 4 melania 615 5 support 443 6 campaign 325 7 eric 319 8 endorsed 235 9 stop 233 10 endorse 209

Print: After Trump

1 administration 4445 2 campaign 2603 3 tower 1944 4 organization 1485 5 supporters 1371 6 told 973 7 presidency 828 8 won 824 9 administration’s 632 10 university 573

Print: Before Clinton

1 hillary 12983 2 bill 2521 3 secretary 391 4 president 277 5 chelsea 245 6 rodham 176 7 support 92 8 top 87 9 endorsed 67 10 team 64

Print: After Clinton

1 campaign 1411 2 foundation 942 3 administration 319 4 won 291 5 told 226 6 supporters 217 7 aide 163 8 presidency 163 9 wins 156 10 comparing 148

These findings are similar to the previous categories before. One interesting thing to note is in the Web Before Trump list, number 8 is 'stop', meaning there is a bit of opposition to Trump's Presidency among web publications. Another interesting thing is in Web Clinton After, the mention of the word, 'Cash' suggesting that there may have been a scandal or something of that sort discussed among these websites.

Print is largely the same story, however, they do mention Trump University, the first time the phrase has come up. This may be due to the more investigative journalism that print publications take part in moreso than online publicaitons. Another key phrase found in Clinton's groupings is the term, 'team', suggesting that there may have been a more positive outlook on this candidate from the Print publicaitons.

Conclusion

Overall, my hypothesis was wrong. There are no more overwhelmingly negative sentiment trends directed at Trump over the course of the timeline moreso than Clinton, as seem by the various graphs above. One important insight gained from this study was the implicit biases that different forms of publications tended to have. It would be interesting to take an even deeper look into each publicaiton and see how they compare to each other in terms of overall sentiments and n-gram associations. Another thing I wish I could do, but could not figure out how to, was doing an 'afinn' analysis of articles over time, and then plottting the correlating values of a keyword over a year to see if the sentimetns ebcame more positive or negative. However, for this study, I feel like it provided a relatively comprehensive look into how different media portrays certain subjects, especially one as monumental as an election.