High School Required Reading Analysis

By: Laurel Wind

Hypothesis 1: Many high school students do not enjoy required reading because the sentiment of the books is the same, when that is not necessarily representative of the sentiment of the particular authors.

Hypothesis 2: High school required reading novels have similar sentiment patterns throughout the novel, but different sentiment patterns from other novels by the same author.

For this analysis, I chose to analyze the top three most popular required reading books. According to GoodReads, those three books are "To Kill A Mockingbird" by Harper Lee, "The Great Gatsby" by F. Scott Fitzgerald, and "Romeo and Juliet" by William Shakespeare.

To test my hypothesis, I compared these three books to other popular books by the same author. I chose to compare the aforementioned books to Harper Lee's most recent work, "Go Set A Watchman", a novel by F. Scott Fitzgerald written two years earlier, "This Side of Paradise", and Shakespeare's "Hamlet", as according to Indy100, it is the fourth most popular Shakespeare play and was selected for this analysis as the first three have similar popularity scores.

Sources:

The six novels can be accessed from the following:

Required R Packages:

This analysis requires the following R packages: wordcloud, tm, ggplot2, stringr, syuzhet, and tidytext.

Word Clouds:

To begin this analysis, I created word clouds for each of the six novels to compare the writing schemes of the two books by the same author.

To create the word clouds, I used the following code and repeated it for all six novels.

romeo <- readLines(file.choose("romeo.txt"))

romeo.vec <- VectorSource(romeo)

romeo.corpus <- Corpus(romeo.vec)

romeo.analysis <- tm_map(romeo.corpus, removePunctuation)

romeo.analysis <- tm_map(romeo.analysis, tolower)

romeo.analysis <- tm_map(romeo.analysis, removeWords, stopwords("english"))

romeo.analysis <- tm_map(romeo.analysis, removeNumbers)

romeo.analysis <- tm_map(romeo.analysis, stripWhitespace)

romeo.analysis <- tm_map(romeo.analysis, PlainTextDocument)

romeo.analysis <- iconv(romeo.analysis, to = "utf-8")

romeo.analysis <- (romeo.analysis[!is.na(romeo.analysis)])

wordcloud(romeo.analysis, max.words=400, colors=c("salmon", "olivedrab3", "orchid3", "seagreen3", "orange3", "cyan3", "green3", "deepskyblue2", "slateblue1", "maroon2"))

Harper Lee:

To Kill A Mockingbird

Go Set A Watchman

The word cloud on the left is "To Kill A Mockingbird" and the word cloud on the right is "Go Set A Watchman". Based on the two word clouds, it is clear the two novels are about the same characters and utilize similar word choices and writing styles.

F. Scott Fitzgerald:

The Great Gatsby

This Side of Paradise

The word cloud on the left is "The Great Gatsby" and the word cloud on the right is "This Side of Paradise". Based on the two word clouds, it is clear the two novels are about different characters, however, both utilize similar word choices and writing styles.

William Shakespeare:

Romeo and Juliet

Hamlet

The word cloud on the left is "Romeo and Juliet" and the word cloud on the right is "Hamlet". Based on the two word clouds, it is clear the two novels are about the different characters, however, both utilize the standard Shakespearean writing style.

Sentiment Analysis:

To begin this section of analysis, I compared the sentiments of the three most popular required reading novels in high school.

To analyze the sentiment of the novels, I used the following code and repeated it six times for each novel.

romeo_df <- data_frame(romeo=romeo)

tidy_romeo <- romeo_df %>% unnest_tokens(word, romeo)

data(stop_words)

tidy_romeo <- tidy_romeo %>% anti_join(stop_words)

tidy_romeo <- tidy_romeo %>% count(word, sort=T)

nrc <- get_sentiments("nrc")

tidy_romeo <- tidy_romeo %>% inner_join(nrc) %>% count(sentiment, sort=T)

ggplot(data = tidy_romeo, aes(x = sentiment, y = nn)) + geom_bar(aes(fill = sentiment), stat = "identity") + theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Sentiment") + ylab("Count") + ggtitle("Sentiment for Romeo and Juliet")

Based on the graphs, it is clear all three novels have very similar sentiments, as they are all slightly more negative than positive and all three have surprise as the lowest sentiment.

At this point in our analysis, our hypothesis is supported, as the top three novels read in high school all have similar sentiments.

Comparison:

To further test my hypothesis, I compared the three novels to novels written by the same author to test whether or not the selected novels for high school are representative of the sentiment of the authors.