Text Analysis of Commencement Speeches

Ivy League Schools - 2017

By: Natalie Wright | April 22, 2018

Hypothesis: Commencement speeches for Ivy League schools in 2017, included negative words as much as positive words because of a large focus on social issues/movements and national events that were heavily present in the media over the course of the academic year.

Background

What is the Ivy League? The Ivy League is a collegiate athletic conference comprised of eight private universities that are located in the Northeastern part of the United States. The eight schools are:

The Ivy League represents more than 8,000 student-athletes each year. The schools combined, have won 287 team national championships and 579 individual national championships since the beginning of intercollegiate competitions.

Not only are Ivy League schools known for their success in athletics, they are also referred to as some of the most prestigious universities in the nation. For example, U.S. News & World Report ranks Princeton, Harvard and Yale, respectively, as the top three best universities in their national rankings list. And all eight schools are included in the top fourteen spots.

Still curious? Check out the Ivy League website: ivyleague.com/index.aspx

Commencement Speeches

These speeches are given to graduating seniors during the ceremony where they receive their diplomas and degrees. Schools usually invite politicians, celebrities or important citizens to share their experiences and any advice they have for the young professionals that are soon to embark on their own unique journeys.

Here's a list of the commencement speakers at all eight Ivy League Schools in 2017:

School	Speaker	Title
Brown University	Christina Paxon	President of Brown University
Columbia University	Lee Bollinger	President of Columbia University
Cornell University	Joe Biden	former Vice President of the US
Dartmouth College	Jake Tapper	'91 alum, award-winning journalist
Harvard University	Mark Zuckerberg	CEO of Facebook
Princeton University	Christopher Eisgruber	President of Princeton University
University of Pennsylvania	Cory Booker	New Jersey Senator
Yale University	Theo Epstein	President of Chicago Cubs

*click on the Speaker's name to see full text transcript of speech

Data & Analysis

Required R Packages

Tidyverse
Tidytext
Scales
Wordcloud
Reshape2
get_sentiments("bing")
data("stop_words")

Data Collection

In order to begin my analysis, I collected the full-text transcripts for all eight commencement speeches from the Ivy League school graduations in 2017 (These can be found in the links above). I pulled the text into a Microsoft Word document and cleaned them up by removing any text that was not said by the speaker and also sound cues. For example, one of the transcripts included "(Applause)" and "(Laugh)" everytime the crowd responded to the speaker. This does not pertain to our analysis so I deleted it from the text. I also needed to delete “—” because it interfered with one of the steps later on (seperating text into words).

It was also very important to remove the words “it’s”, “I’m”, “I’ve”, “don’t”, “that’s and “we’re” when cleaning the text. This is because later on in the analysis, these words would be very popular. Using gsub later in the analysis is an option, but it would mess with important words for the analysis. For example, if you removed “I’ve” using gsub, university would then be spelled “unrsity”. So it is best to remove these contractions now.

Once all of the speeches were cleaned, I saved the document as a txt file. I used the Unicode 9.0 UTH-8 text encoding instead of the Mac OS (Default). This document will be referred to later in the analysis as, ivyLeagueText3.txt.

Analysis

Most Popular Words

First, I began by creating a variable for the ivyLeagueText3.txt file that I added into the working folder with this R project. The variable, which I called ivyLeague, was listed as a value. So I turned it into a dataframe called ivyLeagueDF, which would allow me to conduct further analysis. ivyLeagueDF lists the text, line by line. Next, I used the unnest tokens option to separate the text into into sentences (shown with the ivyLeagueLines variable) and words (shown with the ivyLeagueWords variable).

ivyLeague <- readLines("ivyLeagueText3.txt")
ivyLeagueDF <- data.frame(text = ivyLeague)
ivyLeagueLines <- unnest_tokens(ivyLeagueDF, input = text, output = line, token = "sentences", to_lower = F)
ivyLeagueWords <- ivyLeagueDF %>% unnest_tokens(word, text)

Now, in order to clean the data even more, it was time to remove the stop words. Stop words are the words that aren’t necessary when conducting a text analysis (for example: “of”, “to” or “the”). This was done by using the ivyLeagueWords variable and the stop_words dataset that is part of the tidytext package. The new variable was named ivyLeagueWords2.

ivyLeagueWords2 <- ivyLeagueWords %>% anti_join(stop_words, by = c("word" = "word"))

Next I created a variable called ivyLeagueWords3 which added a column called “n”, that showed the count for each word. This number showed how many times the word appeared in commencement speeches for Ivy League schools in 2017. The new dataset shows these counts from highest to lowest.

ivyLeagueWords <- ivyLeagueWords2 %>% count(word, sort = TRUE)

Finally, the data is ready to be made into a visualization. I used ivyLeagueWords3 to create a bar chart (that I flipped so the bars are horizontal) for the 20 words with the highest counts. Since there was a tie between the 20th and 21st word, the bar chart actually includes the top 21 words. I also made sure to order the words so that the most popular was on top, for ease of reading and understanding.

ivyLeagueWords3 %>%
top_n(20) %>%
mutate (word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = “identity”, fill = “#a72121”) +
coord_flip() +
labs(title = “Most Popular Words - Ivy League Commencement 2017”)

People/Places	Measures/Time	Feelings/Emotions
people	time	purpose
world	day	love
class	life	change
president	generation	hard
university	remember
country	moment
players	2017
society
college
person

The words that appeared most in the commencement speeches for Ivy League schools in 2017 can be divided into three sections (shown in the table to the left): People/Places, Feelings/Emotions and Measures/Time. With the largest amount of words coming from the section for People/Places. These are three sections that I have created on my own based on the types of words provided by the data.

It is possible that the words in the People/Places section are words that were used to address the audience and describe the situation and event. They also work to create an understanding of the members of the community that are involved with the graduation. The Measures/Time section makes a lot of sense for a commencement speech. Graduating from a prestigious university is a milestone and major accomplishment in one’s life. Because of the large use of these words, you can suspect that the speakers focused on explaining the importance and relevance of this milestone for the graduates. Finally, the Feelings/Emotions section includes four words that are relevant to what a graduate may be experiencing at the specific moment in time. Further analysis will be conducted to gain understanding on the sentiment of these words.

wordcloud(ivyLeagueWords3$word,
ivyLeagueWords3$n,
color="#a72121",
random.order=FALSE,
max.words=20)

*this is a word cloud made from the top 20 words (ivyLeagueWords3)
(shown on the right)

Sentiment Analysis

To begin the sentiment analysis, I decided to use the bing sentiment dataset because this allows you to determine if a word is either negative or positive. This is the best option for my hypotheses.

I used the ivyLeagueWords2 variable, which is a dataset for all of the commencement speech words that are part of the analysis, and I used the inner_join function to combine with the bing dataset. Then, I also added the count of words to this dataframe. This variable is called, ivyLeagueBingCounts. Then, I was able to create side-by-side bar graphs to show the top ten negative words and top ten positive words within the analysis.

ivyLeagueBingCounts <- ivyLeagueWords2 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()

ivyLeagueBingCounts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(title = "Sentiment of Ivy League Commencement Speech Words - 2017") +
coord_flip()

By looking at the side-by-side bar charts for the bing measured negative and positive words used in commencement speeches for Ivy League schools in 2017, “hard” and “love” are at the top. While hard is listed as a negative word, it is very possible that the commencement speakers were referring to “hard work” which isn’t really considered a negative action. Further analysis may determine if this is correct. The other words in the top ten negative words are all negative feelings that graduates might have.

In the list of top ten positive words, “freedom” and “free” are at the top. These are both patriotic terms that could also be referring to the idea that the graduates are now free and able to move on in their own individual journeys. Other words like “congratulations” and “success” are common words used when addressing a recent graduate.

ivyLeagueBingCounts %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#a72121", "#a9a9a9"), max.words = 200, scale=c(.8,0.25))

This second word cloud was created by using the variable called, "ivyLeagueBingCounts" and shows the top 200 words once the bing dataset was combined with the dataset that includes the words from the commencement speeeches. All of the negative words are on the top of the word cloud in red and all of the positive words are on the bottom of the word cloud in grey.

Not much can be communicated by a word cloud, but it provides an interesting visualization.

Bigrams

Using the dataframe called ivyLeagueLines (the commencement speech text seperated by sentences), I used the unnest_tokens function to seperate the text into bigrams. Bigrams are pairings of two words.

Then I seperated the bigrams into two columns. The first column for word 1 and and the second column for word 2.

Next, I got rid of stop words because those won't tell very much about the text. They won't support any findings that could potentially tell anything about the text.

Finally, I created a column for counts of each bigram. This provided the dataset below. (The dataset has been limited to only show the top ten bigrams.)

df_text_to_bigrams_tidy <- ivyLeagueLines %>%
unnest_tokens(output = bigram, input = line, token = "ngrams", n = 2)

bigrams_seperated <- df_text_to_bigrams_tidy %>% separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_seperated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)

bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)

As you can see in the dataset, the most common bigram is "world series". One reason for why this pairing of words is so popular may be because Theo Epstein, President of the Chicago Cubs, was the commencement speaker at Yale University in 2017, so it is possible that he used this bigram multiple times to explain his experiences and relationship with the World Series.

The next most common bigram was "god bless". This is a common phrase to conclude a speech and would make sense considering Joe Biden spoke at Cornell University, and he was the former Vice President of the United States of America.

One interesting observation is that the bigram, "french fries" was used four times out of all the commencement speeches. This must have been part of a specific speech. And notice that none of the bigrams are related to social events or national headlines for the academic year we are analyzing.

Conclusion

The first key finding for this analysis was that the top 20 words for Ivy League commencement speeches in 2017 can be divided into three main sections (people/places, measures/time, feelings/emotions). The largest section was people/places with a heavy influence on every member in the community that played a role in the steps which led to the graduation ceremonies. The second key finding was that the top negative words found in the speeches were all related to feelings that graduates may be experiencing (in regards to thinking about their future) or feelings that they had during their journey to graduation. This included words like "struggling", "risk" and "loss". On the other hand, the positive words were reflective of the future. For example, top words were "free", "freedom" and "love". These positive words could also be described as patriotic. Finally, the last key finding was that the most common bigrams were "world series" and "god bless". This is, again, another patriotic theme.

It is important to understand that these findings are not reflective of all college and university commencement speeches. This is only reflective for 2017 at the prestigious Ivy League schools.

In regards to my hypotheses, I am surprised to see that social issues/movements and national events did not appear to play a major role in the content of commencement speeches for these select schools. Instead, I did see some patriotic influences. As far as the amount of negative words compared to positive, the sentiment analysis showed that the counts were similar but there were definitely more positive words than negative. Understanding what important social and political speakers are saying to graduates of Ivy League schools is an interesting way of thinking about how educated individuals are communicated to.

Additional Resources

"Text Mining with R" by Julia Silge and David Robinson