By: Laurel Wind
Hypothesis 1: Many high school students do not enjoy required reading because the sentiment of the books is the same, when that is not necessarily representative of the sentiment of the particular authors.
Hypothesis 2: High school required reading novels have similar sentiment patterns throughout the novel, but different sentiment patterns from other novels by the same author.
For this analysis, I chose to analyze the top three most popular required reading books. According to GoodReads, those three books are "To Kill A Mockingbird" by Harper Lee, "The Great Gatsby" by F. Scott Fitzgerald, and "Romeo and Juliet" by William Shakespeare.
To test my hypothesis, I compared these three books to other popular books by the same author. I chose to compare the aforementioned books to Harper Lee's most recent work, "Go Set A Watchman", a novel by F. Scott Fitzgerald written two years earlier, "This Side of Paradise", and Shakespeare's "Hamlet", as according to Indy100, it is the fourth most popular Shakespeare play and was selected for this analysis as the first three have similar popularity scores.
The six novels can be accessed from the following:
This analysis requires the following R packages: wordcloud, tm, ggplot2, stringr, syuzhet, and tidytext.
To begin this analysis, I created word clouds for each of the six novels to compare the writing schemes of the two books by the same author.
To create the word clouds, I used the following code and repeated it for all six novels.
romeo <- readLines(file.choose("romeo.txt"))
romeo.vec <- VectorSource(romeo)
romeo.corpus <- Corpus(romeo.vec)
romeo.analysis <- tm_map(romeo.corpus, removePunctuation)
romeo.analysis <- tm_map(romeo.analysis, tolower)
romeo.analysis <- tm_map(romeo.analysis, removeWords, stopwords("english"))
romeo.analysis <- tm_map(romeo.analysis, removeNumbers)
romeo.analysis <- tm_map(romeo.analysis, stripWhitespace)
romeo.analysis <- tm_map(romeo.analysis, PlainTextDocument)
romeo.analysis <- iconv(romeo.analysis, to = "utf-8")
romeo.analysis <- (romeo.analysis[!is.na(romeo.analysis)])
wordcloud(romeo.analysis, max.words=400, colors=c("salmon", "olivedrab3", "orchid3", "seagreen3", "orange3", "cyan3", "green3", "deepskyblue2", "slateblue1", "maroon2"))
The word cloud on the left is "To Kill A Mockingbird" and the word cloud on the right is "Go Set A Watchman". Based on the two word clouds, it is clear the two novels are about the same characters and utilize similar word choices and writing styles.
The word cloud on the left is "The Great Gatsby" and the word cloud on the right is "This Side of Paradise". Based on the two word clouds, it is clear the two novels are about different characters, however, both utilize similar word choices and writing styles.
The word cloud on the left is "Romeo and Juliet" and the word cloud on the right is "Hamlet". Based on the two word clouds, it is clear the two novels are about the different characters, however, both utilize the standard Shakespearean writing style.
To begin this section of analysis, I compared the sentiments of the three most popular required reading novels in high school.
To analyze the sentiment of the novels, I used the following code and repeated it six times for each novel.
romeo_df <- data_frame(romeo=romeo)
tidy_romeo <- romeo_df %>% unnest_tokens(word, romeo)
data(stop_words)
tidy_romeo <- tidy_romeo %>% anti_join(stop_words)
tidy_romeo <- tidy_romeo %>% count(word, sort=T)
nrc <- get_sentiments("nrc")
tidy_romeo <- tidy_romeo %>% inner_join(nrc) %>% count(sentiment, sort=T)
ggplot(data = tidy_romeo, aes(x = sentiment, y = nn)) + geom_bar(aes(fill = sentiment), stat = "identity") + theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Sentiment") + ylab("Count") + ggtitle("Sentiment for Romeo and Juliet")
Based on the graphs, it is clear all three novels have very similar sentiments, as they are all slightly more negative than positive and all three have surprise as the lowest sentiment.
At this point in our analysis, our hypothesis is supported, as the top three novels read in high school all have similar sentiments.
To further test my hypothesis, I compared the three novels to novels written by the same author to test whether or not the selected novels for high school are representative of the sentiment of the authors.
According to the graphs, both novels written by Harper Lee have similar sentiments, however vary in length. This could be the result of both novels being about the same characters and similar storylines.
According to the graphs, both novels written by F. Scott Fitzgerald have similar sentiments, however vary in length. This could be the result of both novels being written around the same time, as they were written in 1920 and 1925.
Based on the graphs, both plays written by William Shakespeare have similar sentiments, but also vary in length. The two plays having similar sentiments could be the result of both being classified as tragedies.
After comparing the sentiments of each of the six novels, I looked into the sentiment per line of the novel to see if required reading novels had similar sentiment patterns to other books by the same author.
To test my second hypothesis, I used the following code and repeated it for all six novels.
romeo_afinn <- romeo_df %>% mutate(linenumber = row_number()) %>%
unnest_tokens(word,romeo) %>% anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) %>% group_by(index= linenumber) %>%
summarise(sentiment=sum(score)) %>% mutate(method="afinn")
ggplot(data=romeo_afinn, aes(index, sentiment)) + geom_bar(aes(fill=sentiment), stat="identity") +
scale_y_continuous(limits = c(-15,15)) + theme(legend.position = "none") +
xlab("Sentiment") + ylab("Count") + ggtitle("Sentiment for Romeo and Juliet")
When comparing these three required reading novels, it is clear they have similar sentiment trends as the novel progresses.
However, I then compared the sentiment trends of each of the two novels by each author.
Based on these two charts, it is clear that while they have similar sentiments overall, Go Set A Watchman has much more extreme sentiments throughout the novel, especially in terms of negative sentiment.
Based on these two charts, it is clear they also have similar sentiment trends, however, This Side of Paradise appears to have slightly more extreme sentiments towards the end of the novel.
Based on the two charts, it is clear this comparison differs from the previous two, as the more popular required reading has more extreme sentiments than the less popular required reading. Howevere, this still supports my hypothesis, as the two plays differ in sentiment trends, as Romeo and Juliet is significantly more positive in the beginning and significantly more negative in the end.
Hypothesis 1: Based on the graphs, the first half of my hypothesis was supported: the most popular required reading novels have similar sentiments, however, the second half of my hypothesis was not supported, since other novels written by the same authors have very similar sentiments and vary in length. To further test this hypothesis, I would analyze the sentiment of less popular required readings to see if their sentiments differ from the more popular ones.
Hypothesis 2: Based on the graphs, it is clear the two novels for each author greatly vary in their sentiment trends as the story progresses. While the sentiment trends for Harper Lee and F. Scott Fitzgerald could lead to the conclusion that high school students do not enjoy required reading because they do not have extreme sentiments, William Shakespeare's plays discredit this claim, however, this could be the result of Shakespeare's playing being written 400 years prior to Lee's and Scott's.