Space, Above and Beyond

Tidy Tuesday Astronaut data.

David Fox true
08-01-2020

This week’s Tidy Tuesday data is about astronauts.

First I’ll load the data and examine it with the eminently useful {skmir}.

tuesdata <- tidytuesdayR::tt_load('2020-07-14')

    Downloading file 1 of 1: `astronauts.csv`
astronauts = tuesdata$astronauts

skim(astronauts)
(#tab:get data)Data summary
Name astronauts
Number of rows 1277
Number of columns 24
_______________________
Column type frequency:
character 11
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 9 34 0 564 0
original_name 5 1 2 34 0 560 0
sex 0 1 4 6 0 2 0
nationality 0 1 3 24 0 40 0
military_civilian 0 1 8 8 0 2 0
selection 1 1 2 60 0 229 0
occupation 0 1 3 23 0 12 0
mission_title 1 1 1 27 0 361 0
ascend_shuttle 1 1 4 16 0 436 0
in_orbit 0 1 3 17 0 289 0
descend_shuttle 1 1 4 17 0 432 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 639.00 368.78 1.00 320.00 639 958.00 1277.00 ▇▇▇▇▇
number 0 1 274.23 148.19 1.00 153.00 278 390.00 565.00 ▅▆▇▆▅
nationwide_number 0 1 128.75 97.26 1.00 47.00 110 204.00 433.00 ▇▅▅▂▁
year_of_birth 0 1 1951.68 11.44 1921.00 1944.00 1952 1959.00 1983.00 ▂▃▇▅▁
year_of_selection 0 1 1985.59 12.22 1959.00 1978.00 1987 1995.00 2018.00 ▃▅▇▅▁
mission_number 0 1 1.99 1.15 1.00 1.00 2 3.00 7.00 ▇▂▁▁▁
total_number_of_missions 0 1 2.98 1.40 1.00 2.00 3 4.00 7.00 ▇▅▅▂▁
year_of_mission 0 1 1994.60 12.58 1961.00 1986.00 1995 2003.00 2019.00 ▂▃▇▇▅
hours_mission 0 1 1050.88 1714.79 0.00 190.03 261 382.00 10505.00 ▇▁▁▁▁
total_hrs_sum 0 1 2968.34 4214.72 0.61 482.00 932 4264.00 21083.52 ▇▂▁▁▁
field21 0 1 0.63 1.17 0.00 0.00 0 1.00 7.00 ▇▁▁▁▁
eva_hrs_mission 0 1 3.66 7.29 0.00 0.00 0 4.72 89.13 ▇▁▁▁▁
total_eva_hrs 0 1 10.76 16.05 0.00 0.00 0 19.52 78.80 ▇▂▁▁▁

Roles in spaceflight

Occupation, which seems to mean mission role, looks like an interesting variable to explore. First order to is to eliminate any capitalization inconsistencies - which seem to plague all data. There is a lot of discussion about the pros and cons of SQL, but one of the advantages of a database structure is the ability to create rules that keep inconsistent data from ever getting into the data stream.

astronauts %>%
  mutate(occupation = str_to_lower(occupation)) %>%   # always seem to be capitalization issues
  tabyl(occupation) %>%
  adorn_pct_formatting() %>% 
  gt()
occupation n percent
commander 315 24.7%
flight engineer 196 15.3%
msp 498 39.0%
other (journalist) 1 0.1%
other (space tourist) 8 0.6%
pilot 197 15.4%
psp 59 4.6%
space tourist 2 0.2%
spaceflight participant 1 0.1%

I’m not sure what some of these roles are, also a little re-coding is in order to deal with redundant data categories. Some external research indicates that ‘msp’ probably stands for mission specialist, and ‘psp’ stands for payload specialist. This wikipedia page has more details. I’ll recode these and lump all the tourists together – sorry lone space journalist. One thing to watch out for with case_when is to make sure to use double quotes, I was using single quotes and could not figure out why things were not working.

astronauts_clean <- astronauts %>% 
  select(name, sex, occupation, year_of_mission, hours_mission) %>% 
  mutate(occupation = str_to_lower(occupation)) %>%
  mutate(role = case_when(
    occupation == "msp" ~ "mission specialist",
    occupation == "psp" ~ "payload specialist",
    TRUE ~ occupation),
    role_factor = fct_lump_n(role, 5))

Gender balance in space flight roles.

Some of the early pioneers in spaceflight were women. I’m curious about how the current gender roles look.

astronauts_clean %>%
    tabyl(role_factor, sex) %>%
  adorn_totals("row") %>% 
  adorn_percentages("row") %>% 
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined", row_name = "Role", col_name = "gender") %>% 
  gt()
Role/gender female male
commander 1.0% (3) 99.0% (312)
flight engineer 10.2% (20) 89.8% (176)
mission specialist 21.3% (106) 78.7% (392)
payload specialist 8.5% (5) 91.5% (54)
pilot 3.6% (7) 96.4% (190)
Other 16.7% (2) 83.3% (10)
Total 11.2% (143) 88.8% (1134)

Not great - only 11% of astronauts have been women. Only 1% of mission commanders have been women, and less than 4% have been pilots. How does this look over time?

set.seed(567)

p1 = astronauts_clean%>% 
  ggplot(aes(x= year_of_mission, y= sex, color = role_factor,
             text = paste(year_of_mission, "<br>",name,"<br>",role_factor))) +
  geom_quasirandom(alpha = .4,
                   size = 2,
                   groupOnX = FALSE) +
  labs(title = 'Gender roles in space flight',
       y = NULL, 
       x = NULL,
       color = "Role",
       caption = 'TidyTuesday 2020-07-14') +
  scale_color_viridis_d(option ="plasma") +
  theme(legend.position = "bottom")

ggplotly(p1, tooltip = "text")

Valentina Tereshkova was the first woman to travel to space - she did a solo flight in 1963 that lasted for 3 days. If you read her wikipedia page, she is clearly a bad ass. There was not another woman pilot in space until 1997 - Eileen Collins. It seems like the role of pilot has disappeared since the Space Shuttle was retired, but there are still male mission commanders, while the last woman commander was in 2007.

Waxing and waning of space flight.

The previous chart shows a several strong pulses in the number of missions.

missions  = astronauts %>% 
  distinct(mission_title, .keep_all = TRUE) %>% 
  select(year_of_mission, mission_title, ascend_shuttle, in_orbit, hours_mission)

p2 = missions %>% group_by(year_of_mission) %>% 
  summarize(mission_count = n()) %>% 
  ggplot(aes(x= year_of_mission,
             y = mission_count,
             color = mission_count,
             size = mission_count,
             text = paste(year_of_mission, ":", mission_count))) +
    geom_point(alpha = .5) + 
    scale_color_viridis_c(option = "plasma") +
  labs(title = str_wrap("After a heyday in the 90s the number of flights \nhas declined to the level of the 1960's."),
       x= NULL,
       y = 'missions') +
  theme(legend.position = "none")

ggplotly(p2, tooltip = "text")

Mission hours have increased dramatically with space stations.

p3 = ggplot(missions, aes(x= year_of_mission,
                          y = hours_mission,
                          size = hours_mission,
                          color = hours_mission,
                          text = paste(in_orbit, "<br>", hours_mission))) +
  geom_jitter(alpha = .5) +
  #geom_smooth(se = FALSE) +
  scale_color_viridis_c(option = "plasma") +
  labs(title = "Although number of missions has declined, \ntime is space has increased dramatically.",
       x= "",
       y = "mission hours") + 
  theme(legend.position = 'none') +
  scale_y_sqrt()


ggplotly(p3, tooltip = "text")