By Erik Webb
Viewers of “The Office” are as loyal as they come. You either love the show or you hate it. Running for nine seasons on NBC, the show follows the characters of the Dunder Mifflin Paper Company branch in Scranton, Pennsylvania.
Filmed in a documentary style, the audience is able to directly connect to the characters and become connected in the lives of these actors. As a result, extreme fandom has developed by a large portion of the audience, including myself. That is why I chose to do this project. When we were given the assignment in class to do a text-based analysis, I wondered if it would be possible to analyze the scripts of “The Office.” After finding a data set online that had all of the information I needed, I formed a hypothesis and off I went.
Because of the nature of the show, I predict that there will be a much more conversational tone in the writing, which will result in a higher level of sentiment when each episode is analyzed separately.
In order to analyze the words of each character, I had to limit the original data set to all of the lines spoken by that specific character. I chose 18 of the characters that I considered to be main or secondary characters, appearing in the show for more than one season or providing an important purpose in the show’s story line.
originalData %>% filter(speaker=="Michael") -> Michael
At the same time, I created a new variable, Michael2 for example, that separated all of the words in the lines and ran a count function to determine the most common words said by that character throughout the show. This code can be seen below.
Michael2 <- Michael %>%
dplyr::select(line_text) %>%
unnest_tokens(word, line_text) %>%
count(word, sort = TRUE) %>%
ungroup()
After that had run, I took the Michael2 variable and gathered the sentiments of each of the words using the bing lexicon (which assigns each word a positive or negative sentiment) and assigned that to the Michael3 variable because I am going to need the Michael2 list later in this project. At the same time, I created Michael4, which grouped all of those words by sentiment.
Michael3 <- Michael2 %>% inner_join(get_sentiments("bing")) %>% ungroup()
The next step for me was to take the Michael4 variable and graphing the top 20 most common words associated with positive sentiment and the top 20 most common words associated with negative sentiment spoken by Michael.
Michael3 %>% group_by(sentiment) %>% ungroup() -> Michael4
Michael4 %>%filter(sentiment=="negative") -> MichaelNegative
MichaelNegative %>% head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word,n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip()
With those graphs, you can compare the most common words spoken by each character of both negative and positive sentiment.
The final step to my analysis of each character’s dialogue was to create a word cloud of the most common words. To do this, I took the Michael2 data set mentioned earlier (which contains all of the words and the count of how many times that those words are used) and created a new data set called MichaelCloud. When I did that, I removed the stop words because I knew that having words like “like” and “and” would be very common in the conversation that took place in the show. This allowed me to analyze the words of importance and see what resulted. The code to create the word clouds is listed below, including how I had to change the “n” column in the Michael2 data set to “freq” so the wordcloud2 package would recognize the frequency of the words used.
MichaelCloud <- Michael2 %>% anti_join(stop_words) %>% head(100)
names(MichaelCloud)[2] <- "freq"
wordcloud2(MichaelCloud, size=1, shape = "oval", fontFamily="Arial", color="random-dark")
Below are the word clouds of the 100 most popular words from each character analyzed, as well as the breakdown of the words by sentiment.
If you look at each character and analyze the words by sentiment, both positive and negative, you can get an indication as to what words are most common in conversation between characters depending on the tone of the show.
By analyzing the sentiment graphs, I am glad that I removed the stop words when doing the word cloud analysis. The most common word associated with positive sentiment was “like,” appearing as the most common word in 15 of the 19 characters I examined. Other common words across the board are “well,” “right” and “good” – all very common conversational words.
When it comes to the words associated with negative sentiment, the most word appearing across the board was “sorry,” appearing as the most common word in 16 of the 19 characters that I analyzed. Other most common words were “bad,” “lost” and “weird,” which appeared pretty consistently near the top of the lists of all 18 characters that I analyzed.
When it comes to the word clouds, I was not sure what to expect when removing the stop words from the data set. What I did see surprised me, but made sense. Appearing frequently in the word clouds were the names of the other characters, with the most common being the main character, Michael. This makes sense if the characters are conversing and also makes sense since the names of other characters were more prominent in certain characters that have love interests in the show. For example, this can be seen with Jim and Pam, Dwight and Angela, Erin and Andy, and Jan.
It is fascinating to look at the most common words of each character and see how they differ depending on their roles in the show, but yet are still so similar based on how conversational the show is and how much people talk about, or talk to, Michael.
In addition to the analysis by character, I also decided to analyze the sentiment of every episode of the show as well to determine which episodes had the highest sentiment and which ones had the lowest.
To do this, I filtered each episode one by season and episode and created a variable out of it. I wrote code that separated out all of the words and counted them, using the afinn lexicon to assign each word a sentiment value on a -5 to +5 scale. From there, I created a new column out of the value created from multiplying the number of times a word was used and the sentiment value and then got a sum of that new column to get an overall sentiment score for the episode.
originalData <- read_csv("the-office-lines.csv")
originalData %>% filter(episode == 1 & season == 1) -> s01ep01
s01ep01 <- s01ep01 %>%
[> dplyr::select(line_text) %>%
unnest_tokens(word, line_text) %>%
inner_join(get_sentiments("afinn")) %>%
count(word, score, sort = TRUE) %>%
ungroup() %>%
mutate(total = score * n)
sum(s01ep01$total)
I feel like getting the sum of the total sentiment of every word in the episode using that lexicon is an accurate way to access the level of sentiment. I was able to get a sentiment score for every episode and put it on a spreadsheet in Microsoft Excel as I went. I saved the spreadsheet as a .csv file and imported it back into R.
episodes <- read_csv("episode_sent.csv")
I sorted the list from highest to lowest score and then graphed the top 10 episodes and bottom 10 episodes based on score and graphed both of those data frames. The code and those graphs are listed below.
arrange(episodes, desc(episodes$total_sent)) -> episodesRearranged
head(episodesRearranged, n=10) -> episodesTop10
tail(episodesRearranged, n=10) -> episodesBottom10
ggplot(episodesTop10, aes(reorder(title,-total_sent), total_sent)) + geom_bar(stat = "identity",fill="#bf80ff") + labs(x="Episode", y="Sentiment", title="Top 10 Episodes by Sentiment") -> episodeTopBar
ggplot(episodesBottom10, aes(reorder(title,total_sent), total_sent)) + geom_bar(stat = "identity",fill="#bf80ff") + labs(x="Episode", y="Sentiment", title="Bottom 10 Episodes by Sentiment")-> episodeBottomBar
The episode with the highest sentiment level is the episode called “Dunder Mifflin Infinity” in which Ryan returns to his old office from corporate and reveals his plan to bring new technology to the company. In addition, Jim and Pam reveal their relationship to the rest of the office. Both of these things are very positive in terms of the story line of the show, so it makes sense that it is the episode with the highest sentiment. I was surprised that the finale of the show was sixth on the list – I thought it was going to be higher on the list.
The episode with the lowest, and only negative total, sentiment score is “Murder.” In this episode in season six, the company is going through a lot of financial troubles and the murder mystery game that Michael puts on does not go well. Because of these, I can understand why the episode is the only one to receive a negative sentiment score.
If you want to see all of the results broken down by season, the graphs below show just that.
I feel like my results supported my conclusion because the writing did turn out to be much more conversational with the frequent use of words such as “like,” “good” and “right.” To conifrm this, I put the entire script through a online test that determines how simplistic the writing is and identifies the grade level at which the reader would need to be in order to be comfortable reading the text. When I did that with multiple different calculators, I got a range in grade levels from second grade to fourth. Regardless of whatever grade you determine to be correct, this proves that tthe scipts of the show are very simplistic and easy to read/watch. That can also be seen in the conversation between characters and the frequent use of each other’s names. When it comes to sentiment, all the episodes had a positive sentiment except for one episode, which makes sense considering the comedy level of the show and the overall plot.
Overall, season six is the least sentimental season because four of the bottom six episodes by sentiment are from that season. The finales of each season proved interesting to me because they are the episodes with the some of the highest sentiments, except for season six which has a finale with one of the lowest sentiment scores of any episode. Other than that, there doesn’t seem to be a correlation between episode number and sentiment score.