Living in the Bay Area for the past number of years opened my eyes to this concept of women waiting longer to have children. The prevailing reasoning I heard was something along the lines of millenials (and Generation X-ers) wanting to achieve more early on in their careers and have a more solid foundation (economically) before starting a family. Makes sense. In fact, a 2016 New York Times article cites a sociologist as saying that one possible reason for women waiting longer to have children around 2009 could have been due to the Great Recession. Other reasons may include better contraception and better medical understanding and care for mothers giving birth at higher ages.

In this post we’ll take a look at some data to see if women are actually waiting longer to have children, and if there are any differences in the trends based on birth location (at the State level).

Data

The main data source for this post is the Centers for Disease Control and Prevention (CDC) natality public use data for 1995-2016, which can be accessed using the WONDER Online Database. This dataset contains information on birth location (down to the county level), characteristics of the mother (e.g., age, marital status, and education level), characteristics of the birth (e.g., weight of the baby and delivery method), as well as information on maternal risk factors (e.g., diabetes and eclampsia). For this post, we are only concerned with the birth location (at the state level) as well as the mother’s age at the birth. It is worth noting that while the 2007-2016 data contain the mother’s age down to a single year for all births, the data from 1995-2006 use bins to characterize the mother’s age at birth: less than 15 years, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, and 50 years or older. Therefore, for consistency, in this analysis we will use these mother age bins for all years from 1995-2016.

In addition to the CDC’s natality data, I also used information from the U.S. Census Bureau regarding the land area and 2016 population (projected from the 2010 census) for each U.S. state (plus Washington, D.C.). I am curious to see if a state’s population density (average number of people per square mile of land area) correlates at all with the state’s average age of mothers at birth.

Results

To read in the data, analyze the data, and create visualizations for this post I used the tidyverse library in R.

As a first step, I wanted to see how the average age of mothers at birth had changed over time for the entire U.S. To do this, I calculated a weighted-average birth age for each year, which is equal to the sum of the year’s total births multiplied by the average age in the age bin (e.g., 17 in the 15-19 age bin), and then that sum divided by the sum of the year’s total births. This approach assumes essentially a symmetric distribution of births about the mean age in the bin, which is not strictly true (and definitely not ideal, though the raw data are not available); however, for obtaining a general look into how the average age may be changing over time, this approach is fine. For this part of the analysis I ignored the age bins of less than 15 years old and 50 years or older. This is because very few births occur in those age bins (in total they represented less than 0.32% of total U.S. births in 1995, and their share has been declining ever since), and it is difficult to assign an average age to those bins.

Time series of weighted-average mother age at birth for the U.S. for 1995-2016.

As you can see from the plot above, it would appear that the national weighted-average mother age at birth has been rising since 1995. In fact, from 2009-2016 there has been a rapid increase of about 0.5 years in the average mother age at birth for every three calendar years. Interesting.

But since we have the data, let’s now take a look at how the weighted-average mother age at birth has been changing since 1995 for each State. Using the same assumptions as for the national plot above, the State-level results are provided below. Note that the States are sorted in descending order by population density (persons per square mile). As you can see, there is a not a single state in the U.S. for which the weighted-average mother age at birth has gone down since 1995. For many states the average age was rising into the early 2000s, then it remained relatively steady through the mid-2000s, and began increasing again around 2009.

Time series of weighted-average mother age at birth by state for 1995-2016.  All states show an increase in the weighted-average mother age at birth over that time period.

Now, one simple way we could determine if state population density correlates with the weighted-average age at birth is to compare the population density with the 2016 weighted-average birth age for 2016. I would expect to see a higher weighted-average birth age for states that have more big cities and are generally more urban (i.e., having higher population densities). The scatterplot below contains a data point for each of the States (except Washington, D.C., which has an extremely high population density of greater than 11,000 persons per square mile), along with a best-fit linear regression. As you can see, while the data do not fall perfectly on this regression line, it does appear that a positive correlation exists between population density and weighted-average mother age at birth (for 2016).

Weighted-average mother age at birth vs. population density for each State (excluding Washington, D.C.) in 2016.

Another way we could look at the relationship between state population density and weighted-average age at birth is to compare the population density with the slope of a best-fit linear regression line to the data shown in the State-level plots above for each State. A higher slope of the linear regression would indicate a faster increase in the state’s weighted-average mother age at birth over time. I would expect to see higher regression slopes for states that have more big cities and are generally more urban (i.e., having higher population densities). After performing this analysis and using R’s built-in correlation function, we achieve a correlation between the States’ population densities and linear regression slopes of only 0.09. The scatterplot below of the two metrics also shows the lack of correlation. However, the lesser degree of correlation here (as opposed to simply looking at the 2016 data for each state) can be attributed, at least in part, to using simple linear regressions here. In fact, for most states it would appear that the average mother age at birth has increased most rapidly only recently, suggesting a simple linear regression is not the ideal fit. Given that, the correlation above for just 2016 may be more telling.

Population density vs. annual linear increase in weighted-average mother age at birth for each state (excluding Washington, D.C.).

Conclusions

Well, based on this analysis it is plain to see that the average age of mothers at birth has been rising over the 1995-2016 time period on a national as well as for each State. In fact, from 2009-2015 the national average age of mothers at birth increased a full year, and the trend appears to continue through 2016. Additionally, my suspicion of higher mother age at birth in more urban areas appears to have some merit (though other factors are obviously still at play that have not been analyzed here).

Appendix: Data and R Code

A zip file of the data used in this analysis can be downloaded here. Here also is the actual citation for the CDC natality data:

United States Department of Health and Human Services (US DHHS), Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS), Division of Vital Statistics, Natality public-use data 1995-2016, on CDC WONDER Online Database, February 2018. Accessed at http://wonder.cdc.gov/natality.html on Jul 21, 2018

library(tidyverse)
library(ggthemes)

files = list.files(pattern = "Natality*")

read_tsv_filename <- function(filename){
  print(filename)
  ret <- read_tsv(filename,
                  col_types="-c-c-i") # Only read in necessary columns
  colnames(ret) <- c("State","Age of Mother","Births")
  ret$State <- as.factor(ret$State) # Set State column as factor
  ret$`Age of Mother` <- as.factor(ret$`Age of Mother`) # Set Age column as factor
  ret$Year <- as.integer(substring(filename,11,14)) # Generate Year column from file name
  ret %>% filter(Births >= 0) # Return rows for which a number of births is provided
}

# Read in data and append all to single data frame
data <- files %>% map(read_tsv_filename) %>% reduce(rbind)
# Change the names of the "Age of Mother" factors
data <- data %>% 
  mutate(Age = ifelse(
    `Age of Mother` == "Under 15 years", "<15",
    ifelse(`Age of Mother` == "15-19 years", "15-19",
           ifelse(`Age of Mother` == "20-24 years", "20-24",
                  ifelse(`Age of Mother` == "25-29 years", "25-29",
                         ifelse(`Age of Mother` == "30-34 years", "30-34",
                                ifelse(`Age of Mother` == "35-39 years", "35-39",
                                       ifelse(`Age of Mother` == "40-44 years", "40-44",
                                              ifelse(`Age of Mother` == "45-49 years", "45-49",
                                                     ifelse(`Age of Mother` == "50 years and over", ">=50","")
  ))))))))) %>% 
  select(-c(`Age of Mother`))
data$Age <- as.factor(data$Age)
data$Age <- factor(data$Age, 
                   levels = c("<15", "15-19", "20-24", "25-29",
                              "30-34", "35-39", "40-44", "45-49",
                              ">=50")) 

# Merge in July 2016 population estimate (from 2010 census) and
# land area (in square miles) to estimate population density for
# each state
population <- read.csv('state_population_july2016.csv')
landarea <- read.csv('state_land_area_sqmiles.csv')

population_density_df <- population %>%
  left_join(landarea, by = "State") %>% 
  mutate(Population_Density = Population_2016 / Land_Area,
         Density_Rank = rank(-Population_Density))

# Calculate percent of total annual births for each year
# that are attributed to mothers < 15 years old or at least
# 50 years old
data %>% 
  group_by(Year, Age) %>%
  summarise(BirthSum = sum(Births)) %>%
  left_join(data %>%
              group_by(Year) %>%
              summarise(AnnualBirths = sum(Births)),
            by = "Year") %>%
  mutate(BirthPercentage = BirthSum / AnnualBirths * 100) %>%
  filter(Age == "<15" | Age == ">=50") %>%
  group_by(Year) %>%
  summarise(SumBirthPercentage = sum(BirthPercentage)) %>% 
  ggplot(aes(x = Year, y = SumBirthPercentage)) +
  geom_line(color="black", size = 2) +
  scale_y_continuous(breaks = c(0, 0.1, 0.2, 0.3)) +
  labs(title = "Percentage of Births attributed to Mothers < 15 Years Old or >= 50 Years Old from 1995-2016 for the U.S.",
       caption = "Source: CDC National Center for Health Statistics",
       y = "Percentage of National Annual Births") +
  theme_bw()

# National weighted-average mother age over time
# Ignore < 15 years old and >= 50 years old
data %>%
  filter(Age != "<15" & Age != ">=50") %>% 
  mutate(AgeMiddle = as.integer(substring(Age,1,2)) + 2,
         AgeBirths = AgeMiddle * Births) %>% 
  group_by(Year) %>%
  summarise(WeightedAge = sum(AgeBirths)/sum(Births)) %>%
  ggplot(aes(x = Year, y = WeightedAge)) +
  geom_line(color="black", size=2) +
  scale_y_continuous(breaks = c(27, 28, 29)) +
  labs(title = "Weighted Age of Mother from 1995-2016 for the U.S.",
       subtitle = "Age is weighted based on number of births across U.S. for mothers between 15 and 49 years old (inclusive)",
       caption = "Source: CDC National Center for Health Statistics",
       y = "Weighted Age of Mother (years)") +
  theme_bw()

# Weighted-average mother age over time by state
# Ignore < 15 years old and >= 50 years old
data %>%
  filter(Age != "<15" & Age != ">=50") %>% 
  mutate(AgeMiddle = as.integer(substring(Age,1,2)) + 2,
         AgeBirths = AgeMiddle * Births) %>% 
  group_by(Year, State) %>%
  summarise(WeightedAge = sum(AgeBirths)/sum(Births)) %>%
  ggplot(aes(x = Year, y = WeightedAge)) +
  geom_line(color="black", size=2) +
  facet_wrap(~State) +
  labs(title = "Weighted Age of Mother from 1995-2016 for each State",
       subtitle = "Age is weighted based on number of births by state for mothers between 15 and 49 years old (inclusive)",
       caption = "Source: CDC National Center for Health Statistics",
       y = "Weighted Age of Mother (years)") +
  scale_x_continuous(breaks=c(1995, 2010)) +
  theme_bw()

# Same plot, states arranged by population density
# Ignore < 15 years old and >= 50 years old
data %>%
  filter(Age != "<15" & Age != ">=50") %>% 
  mutate(AgeMiddle = as.integer(substring(Age,1,2)) + 2,
         AgeBirths = AgeMiddle * Births) %>% 
  group_by(Year, State) %>%
  summarise(WeightedAge = sum(AgeBirths)/sum(Births)) %>%
  left_join(population_density_df, by="State") %>% 
  mutate(State = reorder(State, Density_Rank)) %>% 
  ggplot(aes(x = Year, y = WeightedAge)) +
  geom_line(color="black", size=2) +
  facet_wrap(~State) +
  labs(title = "Weighted Age of Mother from 1995-2016 for each State",
       subtitle = "States are ordered by descending population density and age is weighted based on number of births by state for mothers between 15 and 49 years old (inclusive)",
       caption = "Source: CDC National Center for Health Statistics and U.S. Census Bureau",
       y = "Weighted Age of Mother (years)") +
  scale_x_continuous(breaks=c(1995, 2010)) +
  theme_bw()

# Correlation scatter plot for 2016
data %>%
  filter(Age != "<15" & Age != ">=50" & State != "District of Columbia" & Year == 2016) %>% 
  mutate(AgeMiddle = as.integer(substring(Age,1,2)) + 2,
         AgeBirths = AgeMiddle * Births) %>% 
  group_by(Year, State) %>%
  summarise(WeightedAge = sum(AgeBirths)/sum(Births)) %>%
  left_join(population_density_df, by="State") %>%
  ggplot(aes(x = Population_Density, y = WeightedAge)) +
  geom_point() +
  geom_smooth(method='lm') +
  labs(title = "Weighted-Average Mother Age at Birth vs. Population Density for each State in 2016",
       subtitle = "District of Columbia, with a population density in excess of 11,000, has been omitted from this plot",
       caption = "Source: CDC National Center for Health Statistics and U.S. Census Bureau",
       x = "State Population Density (persons per square mile)",
       y = "Weighted Age of Mother at Birth (years)") +
  scale_y_continuous(limits=c(26, 31)) +
  theme_bw()
  
  # Regression correlation plot
  data %>%
  filter(Age != "<15" & Age != ">=50") %>% 
  mutate(AgeMiddle = as.integer(substring(Age, 1, 2)) + 2,
         AgeBirths = AgeMiddle * Births) %>% 
  group_by(Year, State) %>%
  summarise(WeightedAge = sum(AgeBirths) / sum(Births)) %>%
  group_by(State) %>% 
  summarise(Slope=lm(WeightedAge~Year)$coefficients[2]) %>% 
  left_join(population_density_df, by = "State") %>%
  ggplot(aes(x = Slope, y = Population_Density)) +
  geom_point() +
  labs(title = "Population Density vs. Increase in Weighted-Average Mother Age at Birthfor ach State",
       subtitle = "District of Columbia, with a population density in excess of 11,000, has been omitted from this plot",
       caption = "Source: CDC National Center for Health Statistics and U.S. Census Bureau",
       x = "Annual Increase in Weighted-Average Mother Age at Birth (slope of linear regression)",
       y = "State Population Density") +
  scale_y_continuous(limits = c(0, 1500)) +
  theme_bw()