Popular Baby Names: 1900s vs 2000s

Ryan Walker

I chose to examine the relationship between the top five most popular names for both genders during the first decade of the 1900s and the first decade of the 2000s. I wanted to find out if there was a change in which names had the highest proportion in relation to others during those years and if any would remain in the top five spots.

The population of the United States has grown exponentially over the past century. According to the United States Census Bureau, the population of the U.S. in 1900 was just over 76.8 million people. In 2000, the population had grown to well-over 280 million, an increase of roughly 266%. This extraordinary growth combined with the increase of the country’s diversity throughout the century has certainly come with the introduction of new names for the people who populate it.

Using the babynames data set, which includes all names registered by the U.S. Social Security Administration from the year 1880 to 2017, and the R programming language, I examined and analyzed the most popular names for boys and girls between the years 1900 and 1909 and compared them with the most popular names between the years 2000 and 2009, to see if and how the popularity of names have changed over the past 100+ years, from the early 1900s to the early 2000s.

Hypothesis: The most popular boys and girls names between 1900 and 1909 have significantly decreased in proportion over the century are not among the most popular boys and girls names between 2000 and 2009.

The Boys

To begin my analysis, I first reduced the original dataset to only include records from the year 1900 to the year 2009. I assigned this modified dataset the variable name “centuryNames”.

babynames %>% filter(year >= 1900, year < 2010) -> centuryNames

Then, I wanted to find out the top five most popular boy names in the United States from 1900 to 1909. I did this by first filtering my data to only include males, and then further restricting the data to only include the years I wanted to observe. I named this variable “boys1900s”.

centuryNames %>% filter(sex == "M", year >= 1900, year < 1910) -> boys1900s

When viewing this reduced portion of the data, I ran into a problem. No matter how I filtered the data when viewing the chart, it would still show me too much information, such as which name was most popular for each individual year. All I wanted to see were the most popular boy names of the entire decade. To achieve this, I used the “distinct” function to retain only distinct rows I wanted to see – in this case, name.

distinct(boys1900s, name) -> popBoys1900s

popBoys1900s

centuryNames %>% filter(sex == "M", year >= 2000, year < 2010) -> boys2000s

distinct(boys2000s, name) -> popBoys2000s

popBoys2000s

This then gave me a simplified list of boys names in the United States between 1900 and 1909 in descending order, listing them from most popular to least. I then “printed out” a the first ten rows of this table to observe the top five most popular names in that decade, of which are John, William, James, George, and Charles.

I repeated this same process and assigned my refined data new variable names, but this time for the most popular boys names between the years 2000 and 2009. The top five for that decade are Jacob, Michael, Matthew, Joshua, and Christopher.

Now that I had the names I wanted, it was time to get to the fun part: displaying the data visually in the form of graphs. Before I could start generating graphs, I first had to create a new variable that consisted of only the names I wanted to see. To do this, I filtered my centuryNames dataset to only include the top five names from the 1900s and assigned it the name “top5_1900s”. Then, I used the ggplot function to declare what data frame I wanted to use and specify the plot “aesthetics,” or what categories of my data set I wanted to assign my x and y axes and how I wanted to distinguish the data I was measuring. I then specified that I wanted my data displayed as a line graph by using “+ geom_line()”.

centuryNames %>% filter(sex == "M", name %in% c("John", "William", "James", "George", "Charles")) -> top5_1900s

ggplot(top5_1900s, aes(x = year, y = prop, color = name)) + geom_line()

To the side, you can see the resulting graph. It shows us the relative proportion of these five names to the rest of the names in the data set over the time period of 100 years from 1900 to 1909. It displays a significant decrease in each name’s popularity in relation to others over the years.

I then repeated this process with the top five names from the 2000s. For this graph in particular, I created a second variable that restricted the data to only the years after 1980 to get a closer look at the relationships.

centuryNames %>% filter(sex == "M", name %in% c("Jacob", "Michael", "Matthew", "Joshua", "Christopher")) -> top5_2000s

ggplot(top5_2000s, aes(x = year, y = prop, color = name)) + geom_line()

centuryNames %>% filter(sex == "M", year > 1980, name %in% c("Jacob", "Michael", "Matthew", "Joshua", "Christopher")) -> closeLook

ggplot(closeLook, aes(x = year, y = prop, color = name)) + geom_line()

As you can see from the resulting graph to the right, there is a noticeable difference in the trend of these names. While they all had an incredible increase in proportion in the latter half of our timeline, they all started to diminish in popularity almost as fast as they had grown. This can be due to the continually increasing social progressiveness of the country in the 1970s and following decades where people felt less pressured to meet the strict societal expectations that loomed over the country in the 50s and 60s.

Something interesting to take note of from this particular graph are the relative statuses of the names Michael and Jacob. While our initial table (popBoys2000s) told us that Jacob was the number one most popular name of the 2000s, our graph is showing us that Michael is an enormously more popular name than Jacob for each and every decade leading up to the turn of the century. In fact, it is a relatively unpopular name compared to the others until it surges in popularity in the 90s. It is only in the late 1990s that Jacob finally surpasses Michael, and even then only by hundredths of a percent. This can be seen more clearly on the graph to the right, which is a reduced version of the previous graph, limited to years after 1980.

I then wanted to compare the difference in the name popularity trends of these decades to one another in order to get a better idea of how the “old” popular names stood against the “new” popular names.

NOTE: To prevent the graph from being too overcrowded with data, I further reduced the data to include only the top two names from both decades.

centuryNames %>% filter(name %in% c("John", "William", "Jacob", "Michael")) -> top2_00s

ggplot(top2_00s, aes(x = year, y = prop, color = name)) + geom_line()

As we can see in the graph, while John and William are decreasing in popularity, Jacob and Michael are increasing and eventually overtake both names. This was obviously expected since neither John nor William appeared among the top five names of the 2000s. However, all four names show a distinct decrease in proportion coming into the 21st century, most likely because people have been gradually become more creative when it comes to naming their children, either by creating new names entirely or coming up with unique spelling variations of the same name.

The Girls

As I mentioned early on in this report, I also wanted to analyze the popularity of girl names throughout the century. I repeated each step and process exactly the same as I did with the boys, but created new variable names for my filtered data and included the pertinent names in my findings.

centuryNames %>% filter(sex == "F", year >= 1900, year < 1910) -> girls1900s

distinct(girls1900s, name) -> popGirls1900s

popGirls1900s

centuryNames %>% filter(sex == "F", year >= 2000, year < 2010) -> girls2000s

distinct(girls2000s, name) -> popGirls2000s

popGirls2000s

The most popular girl names from 1900 to 1909 are Mary, Helen, Anna, Margaret, and Ruth…

…and the most popular girl names from 2000 to 2009 are Emily, Hannah, Madison, Ashley, and Sarah.

centuryNames %>% filter(name %in% c("Mary", "Helen", "Anna", "Margaret", "Ruth")) -> f_top5_1900s

ggplot(f_top5_1900s, aes(x = year, y = prop, color = name)) + geom_line()

centuryNames %>% filter(sex == "F", name %in% c("Emily", "Hannah", "Madison", "Ashley", "Sarah")) -> f_top5_2000s

ggplot(f_top5_2000s, aes(x = year, y = prop, color = name)) + geom_line()

Using the code above, I was able to create the following two graphs. As we can see from the graph to the right, which displays the trends of the top five most popular names for girls between 1900 and 1909 throughout the century, Mary is by far the most popular name of the five, completely and overwhelmingly outperforming the others for almost the entire 19th Century until Anna just barely surpassed it in the late 90s.

As for this graph, of which portrays the trends of the top five girl names of the 2000s, we see an interesting pattern. None of the five names had even come close to reaching a proportion of even one percent until the late 1970s and early 80s. For some reason not immediately evident, all five of the names experience an incredible spike in popularity almost at the same time, or more specifically during the last 25 years of our timeline. Though I do not know the reason behind this apparent phenomenon, it is certainly a topic I believe should be further researched.

I then combined the top two girl names from the 1900s (Mary and Helen) and the top two girl names from the 2000s (Emily and Hannah) into one graph to see how they compared to one another when measured on the same scale.

centuryNames %>% filter(sex == "F", name %in% c("Mary", "Helen", "Emily", "Hannah")) -> f_top2_00s

ggplot(f_top2_00s, aes(x = year, y = prop, color = name)) + geom_line()

Again we can see that Mary is overwhelmingly more popular of a name when viewing it as a whole than any of the others, especially the “new” popular names which don't even reach a proportion of 2% while Mary reaches almost as high as 6% in the 1920s.

Boys & Girls

Finally, I wanted to compare the top name from each of our data frames to each other. These names are John and Michael for males, and Mary and Emily for females. I wanted to see this data displayed in not only a line graph, as has been used throughout this report, but also in bar graphs that account for the total proportion and number of babies who were born with each name throughout the century.

NOTE: I used Michael as the "top" name of the 2000s for males instead of Jacob since it was an undeniably far more popular name overall than Jacob, as well as to make the final comparison more objective and fair than strictly technical. In addition, I only accounted for these names as they applied to their genders, meaning I kept the data for Mary and Emily reduced to only females with those names and the data for John and Michael reduced to only males with those names.

To graph all of these names together, but make sure they included only the data for their respective genders, I had to create new variables for top four names (male and female) from both time periods as well as a data frame that combines them all together.

mary <- centuryNames %>% filter(name %in% "Mary" & sex == "F")

emily <- centuryNames %>% filter(name %in% "Emily" & sex == "F")

john <- centuryNames %>% filter(name %in% "John" & sex == "M")

michael <- centuryNames %>% filter(name %in% "Michael" & sex == "M")

total <- rbind(mary, emily, john, michael)

I then used this new data frame to create a line graph that showed the progression of these names proportions from 1900 to 2009.

ggplot(total, aes(x = name, y = prop, color = name)) +
geom_col() +
ggtitle("Top 4 Babynames — Total Proportion (1900 - 2009)")

As you can see from the line graph below, the names John and Mary were incredibly popular in the beginning of the 20th century (John beginning at just over 6% of the male babies born with that name in 1900 and Mary at over 5% of the females), and then gradually declined over the following hundred years. Meanwhile, Michael and Emily were relatively unpopular names in the beginning of the century, until the 1930s and 40s when there was a massive increase of the proportion of male babies with the name Michael, and the 70s when the proportion of female babies named Emily started to slowly grow. Something especially interesting to take note of, however, is where each of these "top" popular names fell on the last year of our observation, 2009. All four of them each only made up less than 1% of the babies born with those names in that year, with Michael being the most popular at just under 1% and Mary being the least popular at around 0.2%.

I then used the same data frame to create two bar graphs, one that measured the total proportions of these names from 1900 to 2009 and one that measured the total number of these names for the same time period.

ggplot(total, aes(x = name, y = n, color = name)) +
geom_col() +
ggtitle("Top 4 Babynames — Total Number (1900 - 2009)")

In the bar graph below (left), we can see the total proportion of each name throughout our timeline. In total, nearly 4% of all male babies born between 1900 and 2009 were given the name John, and well over 3% of all female babies born between 1900 and 2009 were given the name Mary. Michael was also extremely popular, while Emily trailed far behind each one with a total proportion of less than half a percent of girls born with that name between 1900 and 2009. Interestingly, on the other bar graph (right) that measures the total number of babies born with those names from 1900 to 2009 (measured by millions), we see some different results. Emily still falls far behind the other three with less than a million girls born with that name between 1900 and 2009 while John remains on top, tallying up a total of nearly 5 million boys named John between those same years. We can also see that the total number of boys named Michael trumped the total number of girls named Mary over the time period, with Michael surpassing 4 million and Mary reaching just around 3.9 million.

Conclusion

Overall, it appears that my hypothesis was correct. None of the top five most popular boys and girls names between 1900 to 1909 appeared among the top five most popular boys and girls names between 2000 to 2009. However, while I expected the "older" names to decrease in proportion, (and thus popularity) throughout the time period, I did not expect the "newer" names to decrease in proportion as well. I believe this is due to the increasing number of "new" names that parents are naming their children, including unique spellings of traditional names, such as naming their child "Ashleigh" instead of "Ashley," or "Rylee" instead of "Riley." It was incredibly interesting to see just how drastic of a turn the names I analyzed experienced in popularity over our timeline, and I believe it would be interesting for this topic to be further researched to find possible influences in why certain names boomed in popularity and why others took a dive.