Seasonal Mood Changes as Seen Through Music

Ryan Walker

I'm sure we've all felt that our mood changes along with the seasons. As soon as spring hits, many of us can't wait to be outside in the sunshine, and tend to feel generally much happier than we did in the cold of winter.

Personally, even the songs I listen to depend on the season. During the fall and winter, I tend to listen to slower, more emotional songs while in the spring and summer I listen to more upbeat songs. I was curious to see if others do this as well.

Using the Spotify API and downloading a number of CSV files from Spotify's weekly "Top 200" lists, I analyzed the different audio features of the top ten most popular songs from each month to see if there were any differences between songs that were popular during different seasons. For my analysis, I defined the seasons of December, January, and February as winter, March, April, and May as spring, June, July, and August as summer, and September, October, and November as fall.

Hypothesis: The average valence, danceability, and energy of tracks will be much higher for songs that were popular during spring and summer than those popular in the fall and winter.

Methodology

For my analysis, I needed two sets of data. First, I needed data from Spotify's weekly Top 200 playlist to collect the most popular songs from each month, and then data from Spotify's API that would provide track audio features for each song such as danceability, energy, and tempo.

Since Spotify did not start providing public lists of their top 200 most-played songs by week until late 2016 and we are only a few months into 2019, I limited my data to only include songs on these lists from the years 2017 and 2018. I also limited my data to only include these lists as they pertained to the United States as opposed to how these songs ranked globally. This is because I am centering my analysis on seasons, and there are many countries Spotify is used in that are in the southern hemisphere, and thus their seasons are the opposite of what they are in the U.S.

Spotify offers CSV files for daily and weekly lists, but not monthly. Because it would be extremely time-consuming to download 104 individual files (for every week of 2017 and 2018), I only downloaded the CSV files for the first week of each month. Additionally, I reduced these lists to only include the top ten songs from each chart to really emphasize the popularity of these songs at these particular times of the year.

For my second set of data, the audio features for each individual song, I used the Spotify API to download track details from the Spotify playlists "Top Tracks of 2017: USA" and "Top Tracks of 2018" since it was likely that the majority of the songs in my data frame would also appear in these playlists. For the songs that were in my data frame but not in these playlists, I used their individual URIs to download this same data, as you will see later in this report.

Data Cleaning

First, I modified every CSV file in Excel to eliminate the first row of the file, which read "Note that these figures are generated using a formula that protects against any artificial inflation of chart positions." I did this before importing the data frames to R because, if I had kept it, that line would be imported as the column names and push the actual column names (Position, Track.Name, Artist, etc.) down to the first row as values. Still within Excel, I created a new column that assigned the month to each value. For example, for my CSV file providing the top 200 songs from the week of December 1-7 in 2017, I assigned "December" to each value in the chart.

Then, it was time to import my data to R. I assigned each CSV file to a value with its corresponding month and year, and then used "rbind" to combine each of those data frames into one.

I then filtered this large data frame, "df", to only include the top ten songs from each month. After that, I created a new column that assigned a season to each value based on the month during which the song was in the top 10 of its list.

Note: For the sake of saving space in this section, I will only be including what I wrote to assign the CSVs from the first three months of 2017 to a value. The code is the same for each CSV file.

# Convert CSV files into tables
jan17 ‹– read.csv("Weekly CSVs/regional-us-weekly-2016-12-30--2017-01-06.csv")
feb17 ‹– read.csv("Weekly CSVs/regional-us-weekly-2017-01-27--2017-02-03.csv")
mar17 ‹– read.csv("Weekly CSVs/regional-us-weekly-2017-02-24--2017-03-03.csv")
...

# Combine all data into one data set, take top 10 from each, assign seasons to months
df ‹– rbind(jan17, jan18, feb17, feb18, mar17, mar18, apr17, apr18, may17, may18, june17, june18, july17, july18, aug17, aug18, sep17, sep18, oct17, oct18, nov17, nov18, dec17, dec18)

df ‹– df %>%
filter(Position %in% 1:10)

df ‹– df %>%
mutate(
Season = case_when(
Month %in% c("December", "January", "February") ~ "Winter",
Month %in% c("March", "April", "May") ~ "Spring",
Month %in% c("June", "July", "August") ~ "Summer",
Month %in% c("September", "October", "November") ~ "Fall"))

Now that I had my first data frame complete, it was time to move on to downloading the audio features for these tracks from Spotify's API. To do this, I first needed to create a Spotify Developer account in order to obtain a "Client ID" and a "Client Secret" so that I may access their Web API. For further explanation about this process, please refer to this documentation from GitHub user charlie86, who created spotifyr.

Now, I was able to access Spotify's API. Using the URIs for the two playlists I mentioned earlier, I created the variable "playlist", which would contain all of the track information for the songs in these playlists.

# Client ID, Secret, and access token
Sys.setenv(SPOTIFY_CLIENT_ID = 'xxxxxxxxxxxxxxxxxxxxx')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'xxxxxxxxxxxxxxxxxxxxx')

access_token ‹– get_spotify_access_token()

# Gather audio features for majority of songs in CSV files
playlist_username ‹– "spotify"
playlist_uris ‹– c("37i9dQZF1DX1HUbZS4LEyL", "37i9dQZF1DX7Axsg3uaDZb")
playlist_audio_features ‹– get_playlist_audio_features(playlist_username, playlist_uris)

playlist ‹– get_playlist_audio_features(Spotify, playlist_uris, authorization = get_spotify_access_token())

Now that I had the information for each track in these playlists, it was time for me to combine my two data frames, "df" and "playlist". The easiest way to do this was to merge them via the column for the name of the track. Because this column in the data frames were had a difference in capitalization, I first changed the column name in "playlist" to match how it was spelt in "df", and then merged the two tables together via that column and named that variable "spotify". The values that didn't have a match were deleted.

However, while many values were deleted, "spotify" had more values than "df" (255 vs 240). This is because some of the songs from "df" were featured in both the 2017 and 2018 playlists, and thus many duplicates were created. For example, "I Fall Apart" by Post Malone was among the top ten songs for five months (October, November, and December in 2017 and January and February in 2018). When I merged "df" with the Spotify playlist data, it created four total values per entry, so instead of "I Fall Apart" appearing five times in my list, it appeared 20 times. To get rid of these duplicates, I first selected only the columns I needed from the data frame, including track name, season, and the audio features, and then removed all duplicate values. This then reduced "spotify" to 168 values, meaning 72 values were missing.

Note: By removing duplicates in this manner, I ensured I would not affect the results of the analysis.

# Make "spotify" smaller to only include columns I want to use
spotify ‹– spotify %>%
select(Position, Track.Name, Artist, Streams, Month, Season, danceability, energy,
speechiness, valence, tempo)

# Remove duplicates
spotify ‹– spotify %>%
unique()

As I mentioned earlier, I knew that the playlists I selected would not include all of the songs I had in my "df" table. Thus, I had to begin a very tedious process to obtain the track information for these missing songs.

First, I had to find which songs were in "df" that were not in the newly created "spotify" data frame, and create a new table with these values. I named this new data frame, which had the 72 missing values from "spotify", "missing". Then, I arranged the values by track name and selected each individual URI (the random arrangement of letters and numbers at the end of the URLs) and used one of the Spotify API's functions to collect the track audio features for each individual song. I put them in a new table in the same order as the arranged "missing" data table and named this new table (that had all of the audio features for the songs in "missing") "additional".

Since the function to gather track audio features does not keep the track name as a column, I had to create an extra column in both "missing" and "additional" by which I could merge the two together. To do this, I assigned every value in both tables a number from 1 to 72 in order of arrangement. Then, I merged the two tables together via this column, and named this new data frame "spotifyMissing".

Note: I reduced the code in my creation of the "additional" data frame to only include the first three URIs I entered since including all 72 would take up an unnecessary amount of space.

# Create a data frame with variables in df that are NOT in Spotify (lost values)
missing ‹– subset(df, !(Track.Name %in% spotify$Track.Name))

# Arrange in order of track name
missing ‹– missing %>%
arrange(Track.Name)

# Create data frame with the track audio features for missing songs
additional ‹– get_track_audio_features(c("7zgqtptZvhf8GEmdsM2vp2", "75FDPwaULRdYDn4StFN2rT", "43ZyHQITOjhciSUUNPVRHc", ...))

# Make a new column with assigned numbers (to merge the two tables), then merge them
missing$order ‹– 1:nrow(missing)

additional$order ‹– 1:nrow(additional)

spotifyMissing ‹– inner_join(additional, missing, by = "order")

Then, I reduced "spotifyMissing" to only include the same columns that I limited "spotify" to, and then finally merged the two tables together to get my complete data set with all 240 values. I named the final data set "data".

# Select same columns from "spotifyMissing" that were selected from "spotify"
spotifyMissing ‹– spotifyMissing %>%
select(Position, Track.Name, Artist, Streams, Month, Season, danceability, energy, speechiness, valence, tempo)

# Merge the two data sets together
data ‹– spotifyMissing %>%
merge(spotify, all = TRUE)

To the right are screenshots of the seven main variables I created and used for this data cleanup process. In summary, "df" was the simple data set that included the top 10 songs from each month of 2017 and 2018 that was created using the 24 CSV files. This table was merged with "playlist" (which contained the track audio features of every song in Spotify's 2017 and 2018 playlists) to create "spotify". Then, I created "missing", which contains each of the values that were eliminated from "df" when I merged it with "playlist", and then merged that table with "additional" (the audio features for each track in "missing") to create "spotifyMissing". Then finally, I merged the "spotifyMissing" and "spotify" tables to create "data", my final data frame that I will use to run my analyses.

Analysis

• Valence, Danceability and Energy •

Now that I had all of my data sorted and cleaned, I could finally begin my analysis. Remember, I want to compare the valence, danceability, and energy of songs that were popular during different seasons. I believe that these will all have greater values for songs popular during spring and summer than those popular in the fall and winter.

Here are the definitions of these three measurements from Spotify's documentation page "Get Audio Features for a Track":

Valence is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.

The first thing I needed to do was calculate the average scores for each audio feature by season. To do this, I created a new table that contained these calculations and the corresponding seasons, which I called "audioFeatures". I used this table to create my bar graphs.

I also wanted to see the distribution of these values over the seasons using graphs called joyplots. For this, I just used my regular "data" table since I didn't need any calculated averages, but rather the individual scores.

Scroll to see all of my graphing code.

# Create new columns that calculate each audio feature by season
audioFeatures ‹– data %>%
group_by(Season) %>%
summarise(
avgValence = mean(valence),
avgDance = mean(danceability),
avgEnergy = mean(energy),

# Bar graphs for audio features by season
ggplot(audioFeatures, aes(Season, avgValence, fill = Season)) +
geom_col(show.legend = FALSE) +
xlab("Season") +
ylab("Valence") +
scale_fill_manual(values = palette) +
ggtitle("Average Valence by Season")

ggplot(audioFeatures, aes(Season, avgDance, fill = Season)) +
geom_col(show.legend = FALSE) +
xlab("Season") +
ylab("Danceability") +
scale_fill_manual(values = palette) +
ggtitle("Average Danceability by Season")

ggplot(audioFeatures, aes(Season, avgEnergy, fill = Season)) +
geom_col(show.legend = FALSE) +
xlab("Season") +
ylab("Energy") +
scale_fill_manual(values = palette) +
ggtitle("Average Energy by Season")

# Joyplots for audio features by season
ggplot(data, aes(valence, Season, fill = Season)) +
geom_joy(show.legend = FALSE) +
theme_joy() +
xlab("Valence") +
scale_fill_manual(values = palette) +
ggtitle("Joyplot of Valence by Season")

ggplot(data, aes(danceability, Season, fill = Season)) +
geom_joy(show.legend = FALSE) +
theme_joy() +
xlab("Danceability") +
scale_fill_manual(values = palette) +
ggtitle("Joyplot of Danceability by Season")

ggplot(data, aes(energy, Season, fill = Season)) +
geom_joy(show.legend = FALSE) +
theme_joy() +
xlab("Energy") +
scale_fill_manual(values = palette) +
ggtitle("Joyplot of Energy by Season")

The resulting graphs can be seen below. The bar graphs and joy plots for each audio feature can be seen side by side.

As shown in the first bar graph, the average valence was highest for songs that were popular during the summer months, followed by spring, fall, and then winter. The distribution chart for valence is interesting because the majority of songs for each month had a valence around 0.4, where the "mounds" peak. What seems to set spring and summer apart, though, and what boosted their averages were their fairly high peaks around the 0.8 mark as well.

However, according to the graphs for danceability, the results were nearly the opposite. Fall and winter had the highest average scores for danceability while spring and summer were lower. The distribution chart for fall leaned heavily towards scores around 0.9 while the majority of spring and summer songs congregated more around 0.7 and 0.8. Winter seemed to have the most variety in scores for this value.

The differences in average energy by season were not as stark as the results for valence and danceability. Spring, summer, and fall all had similar averages at just above 0.6 while winter fell just below that. However, the joyplots for energy had the most diverse shapes in their distributions. Fall, which had the highest average energy, had two major peaks at around 0.5 and 0.75 while spring, the second highest average, was more concentrated around 0.6.

While valence provided results that supported my hypothesis, I was surprised to see the averages for danceability and energy. Danceability, especially, was the opposite of what I had predicted, and energy had little variance in the averages.

After seeing these charts, I was curious to see if the seasons had any effect on other aspects of music. Because I have had previous experience analyzing lyric sentiments from my analysis on Miley Cyrus's albums over time, I believed this would be an interesting extra step to take on for this project.

• Lyric Sentiments •

For my sentiment analysis, I further reduced my data to only include the top five most-streamed songs from each season since gathering the lyrics for all of the songs in my "data" table would take a great deal of time. This would provide me with 20 songs to analyze. To retrieve this data for each of the songs, I used the Genius API by GitHub user JosiahParry. As I did in my previous report, I will be using the AFINN lexicon, which rates words on a scale from -5 to +5 based on their positivity or negativity.

After I collected the lyrics from the songs and assigned them to individual variables, I sent them through a pipe operator to separate the lines into individual words, filter out "stop words," and then to count them. I also assigned the words a season based on when the song ranked among the top five most-streamed songs. For example, "God's Plan" by Drake was amongst the top five songs in both the winter and the spring, so I assigned these lyrics to both seasons. Then, I merged all of these data sets into one large table which I named "songs".

Note: For the sake of saving space in this section, I will only be including what I wrote to assign the first three songs to values, as well as the corresponding code for filtering their data. The code is the same for each song with the exception of the month assigned to its words.

# Gather top 5 songs from each season to have 20 total songs
topSongs ‹– data %>%
group_by(Season) %>%
top_n(5, Streams) %>%
arrange(desc(Streams))

# Gather lyrics for each song in topSongs list
genius_lyrics(artist = "Drake", song = "God's Plan", info = "title") -> godsplan
genius_lyrics(artist = "Drake", song = "Nonstop", info = "title") -> nonstop
genius_lyrics(artist = "Drake", song = "In My feelings", info = "title") -> feelings
...

# Filter out "stop words" and assign song to season
godsplan %>%
unnest_tokens(word, lyric) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(season = "Winter") -> godCountWinter

godsplan %>%
unnest_tokens(word, lyric) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(season = "Spring") -> godCountSpring

nonstop %>%
unnest_tokens(word, lyric) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(season = "Winter") -> nonstopCount

feelings %>%
unnest_tokens(word, lyric) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(season = "Winter") -> feelingsCount
...

# Merge all filtered songs into one table
songs ‹– rbind(godCountWinter, godCountSpring, nonstopCount, feelingsCount, dontmatterCount, emotionlessCount, imupsetCount, thankuCount, luckyCount, monaCount, dontcryCount, goingbadCount, boujeeCount, psychoCount, betternowCount, rockstarCountFall, rockstarCountWinter, calloutCount, humbleCount)

Once that was complete, I could join the words with the AFINN lexicon to assign them their scores. I then created a new column named "sentiment" with the word's calculated score (sentiment score multiplied by its number of occurrences by season). Then, I used this table to create bar graphs displaying the top 50 scoring words from each season. The resulting graphs can be seen below the section of code.

# Join lyrics with AFINN lexicon and create new column with calculated score
songsSentiment ‹– songs %>%
group_by(season) %>%
inner_join(get_sentiments("afinn")) %>%
mutate(sentiment = n * score) %>%
arrange(desc(abs(sentiment)))

# Create bar graphs with top 50 scoring words from each season
winterPlot ‹– songsSentiment %>%
filter(season == "Winter") %>%
head(50) %>%
mutate(word = reorder(word, sentiment))

ggplot(winterPlot, aes(word, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words") +
ylab("Sentiment") +
ggtitle("Winter Songs Sentiment") +
coord_flip()

summerPlot ‹– songsSentiment %>%
filter(season == "Summer") %>%
head(50) %>%
mutate(word = reorder(word, sentiment))

ggplot(summerPlot, aes(word, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words") +
ylab("Sentiment") +
ggtitle("Summer Songs Sentiment") +
coord_flip()

springPlot ‹– songsSentiment %>%
filter(season == "Spring") %>%
head(50) %>%
mutate(word = reorder(word, sentiment))

ggplot(springPlot, aes(word, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words") +
ylab("Sentiment") +
ggtitle("Spring Songs Sentiment") +
coord_flip()

fallPlot ‹– songsSentiment %>%
filter(season == "Fall") %>%
head(50) %>%
mutate(word = reorder(word, sentiment))

ggplot(fallPlot, aes(word, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words") +
ylab("Sentiment") +
ggtitle("Fall Songs Sentiment") +
coord_flip()

As shown in the sentiment graphs, every season seems to have lyrics that are generally more negative than positive. This is likely because every song that was analyzed for this sentiment analysis, with the exception of two, were rap. (The two non-rap songs were "Call Out My Name" by The Weeknd and "thank u, next" by Ariana Grande.) Hip-hop and rap songs typically have more aggressive lyrics and use swear words more often, which explains the incredibly high scores of these words.

Note: Something I found very interesting was that all of the top five most-streamed songs for the summer months were by Drake, four of which were the most-streamed songs for the month of July.

Now that I had gotten a little preview of the overall sentiment of these top-ranked songs, I wanted to see the total sentiment scores of each season compared to each other. To do this, I first created two new columns. The first column, "total", calculated the total summed sentiment score of each season and assigned that number to each value by its season. The second column, "order", assigned a number 1 through 4 to the seasons so that I may arrange the seasons as I wanted in the graph. (If I had not done this, R would have automatically ordered the seasons alphabetically instead of how they are ordered throughout the year.)

Then, I selected only the columns "season", "order", and "total" and filtered out all duplicates so that I would only have the 12 values I needed. I then reordered "seasons" by "order" and created my graph.

# Create two columns: "total" (total summed sentiment score by season) and "order"
sentimentsTotal ‹– songsSentiment %>%
mutate(total = sum(sentiment),
order = case_when(
season == "Spring" ~ 1,
season == "Summer" ~ 2,
season == "Fall" ~ 3,
season == "Winter" ~ 4)) %>%
select(season, order, total) %>%
unique()

# Reorder "seasons" by "order"
sentimentsTotal$season ‹– reorder(sentimentsTotal$season, sentimentsTotal$order)

# Create graph with total sentiment scores by season
ggplot(sentimentsTotal, aes(season, total, fill = season)) +
geom_col(show.legend = FALSE) +
labs(x = "", y = "Score") +
ggtitle("Total Sentiment by Season") +
scale_fill_manual(values = palette) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

To the right is the resulting graph. Unsurprisingly, since we had previously seen the general sentiment of each season, they all had total sentiment scores that were negative. Summer was the most positive with a score of only -163 while fall had the most negative score of -318. This means that the lyrics for the most popular songs during the summer, while they were overall negative, were the most positive in comparison to the other seasons. Songs that were popular in the fall, however, had the highest score of negative lyrics. While spring and winter had close scores, spring was slightly more positive with a score of -256 while winter had a score of -271.

Conclusion

I was incredibly surprised by my results. My hypothesis ("The average valence, danceability, and energy of tracks will be much higher for songs that were popular during spring and summer than those popular in the fall and winter") was only right in regards to valence. However, while the valence of songs that were popular in the spring in summer were higher, they weren't high in general. The valence scale had a maximum of 1.0, but summer only reached a score of around 0.45.

For danceability and energy, my hypothesis was wrong. In fact, danceability was the opposite of what I predicted as fall and winter had higher average scores. Energy was a little less stark in their difference, since fall had the highest score, followed by spring, summer, and then winter. It was very interesting to see the distribution of these scores in comparison to their averages.

In terms of sentiment, I was initially surprised that they were all so negative. I thought that the songs in the summer, especially, would have a generally positive sentiment since the songs at least sound more upbeat and empowering. What I failed to realize, though, is that a vast majority of the songs on these top 200 lists are hip-hop and rap, which typically contain many swear words and other negatively scoring words. While all four seasons had overall negative total sentiment scores, it was interesting to see how stark the differences were in their scores.

While the songs in spring and summer might not have the greatest danceability or energy, they generally have a happier sound overall as well as more positive lyrics.