David Fox: Space, Above and Beyond

This week’s Tidy Tuesday data is about astronauts.

First I’ll load the data and examine it with the eminently useful {skmir}.

tuesdata <- tidytuesdayR::tt_load('2020-07-14')


    Downloading file 1 of 1: `astronauts.csv`

astronauts = tuesdata$astronauts

skim(astronauts)

(#tab:get data)Data summary
Name	astronauts
Number of rows	1277
Number of columns	24
_______________________
Column type frequency:
character	11
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1	9	34	564
original_name	5	1	2	34	560
sex	0	1	4	6	2
nationality	0	1	3	24	40
military_civilian	0	1	8	8	2
selection	1	1	2	60	229
occupation	0	1	3	23	12
mission_title	1	1	1	27	361
ascend_shuttle	1	1	4	16	436
in_orbit	0	1	3	17	289
descend_shuttle	1	1	4	17	432

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	639.00	368.78	1.00	320.00	639	958.00	1277.00	▇▇▇▇▇
number	1	274.23	148.19	1.00	153.00	278	390.00	565.00	▅▆▇▆▅
nationwide_number	1	128.75	97.26	1.00	47.00	110	204.00	433.00	▇▅▅▂▁
year_of_birth	1	1951.68	11.44	1921.00	1944.00	1952	1959.00	1983.00	▂▃▇▅▁
year_of_selection	1	1985.59	12.22	1959.00	1978.00	1987	1995.00	2018.00	▃▅▇▅▁
mission_number	1	1.99	1.15	1.00	1.00	2	3.00	7.00	▇▂▁▁▁
total_number_of_missions	1	2.98	1.40	1.00	2.00	3	4.00	7.00	▇▅▅▂▁
year_of_mission	1	1994.60	12.58	1961.00	1986.00	1995	2003.00	2019.00	▂▃▇▇▅
hours_mission	1	1050.88	1714.79	0.00	190.03	261	382.00	10505.00	▇▁▁▁▁
total_hrs_sum	1	2968.34	4214.72	0.61	482.00	932	4264.00	21083.52	▇▂▁▁▁
field21	1	0.63	1.17	0.00	0.00	0	1.00	7.00	▇▁▁▁▁
eva_hrs_mission	1	3.66	7.29	0.00	0.00	0	4.72	89.13	▇▁▁▁▁
total_eva_hrs	1	10.76	16.05	0.00	0.00	0	19.52	78.80	▇▂▁▁▁

Roles in spaceflight

Occupation, which seems to mean mission role, looks like an interesting variable to explore. First order to is to eliminate any capitalization inconsistencies - which seem to plague all data. There is a lot of discussion about the pros and cons of SQL, but one of the advantages of a database structure is the ability to create rules that keep inconsistent data from ever getting into the data stream.

astronauts %>%
  mutate(occupation = str_to_lower(occupation)) %>%   # always seem to be capitalization issues
  tabyl(occupation) %>%
  adorn_pct_formatting() %>% 
  gt()

occupation	n	percent
commander	315	24.7%
flight engineer	196	15.3%
msp	498	39.0%
other (journalist)	1	0.1%
other (space tourist)	8	0.6%
pilot	197	15.4%
psp	59	4.6%
space tourist	2	0.2%
spaceflight participant	1	0.1%

I’m not sure what some of these roles are, also a little re-coding is in order to deal with redundant data categories. Some external research indicates that ‘msp’ probably stands for mission specialist, and ‘psp’ stands for payload specialist. This wikipedia page has more details. I’ll recode these and lump all the tourists together – sorry lone space journalist. One thing to watch out for with case_when is to make sure to use double quotes, I was using single quotes and could not figure out why things were not working.

astronauts_clean <- astronauts %>% 
  select(name, sex, occupation, year_of_mission, hours_mission) %>% 
  mutate(occupation = str_to_lower(occupation)) %>%
  mutate(role = case_when(
    occupation == "msp" ~ "mission specialist",
    occupation == "psp" ~ "payload specialist",
    TRUE ~ occupation),
    role_factor = fct_lump_n(role, 5))

Gender balance in space flight roles.

Some of the early pioneers in spaceflight were women. I’m curious about how the current gender roles look.

astronauts_clean %>%
    tabyl(role_factor, sex) %>%
  adorn_totals("row") %>% 
  adorn_percentages("row") %>% 
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined", row_name = "Role", col_name = "gender") %>% 
  gt()

Role/gender	female	male
commander	1.0% (3)	99.0% (312)
flight engineer	10.2% (20)	89.8% (176)
mission specialist	21.3% (106)	78.7% (392)
payload specialist	8.5% (5)	91.5% (54)
pilot	3.6% (7)	96.4% (190)
Other	16.7% (2)	83.3% (10)
Total	11.2% (143)	88.8% (1134)

Not great - only 11% of astronauts have been women. Only 1% of mission commanders have been women, and less than 4% have been pilots. How does this look over time?

set.seed(567)

p1 = astronauts_clean%>% 
  ggplot(aes(x= year_of_mission, y= sex, color = role_factor,
             text = paste(year_of_mission, "<br>",name,"<br>",role_factor))) +
  geom_quasirandom(alpha = .4,
                   size = 2,
                   groupOnX = FALSE) +
  labs(title = 'Gender roles in space flight',
       y = NULL, 
       x = NULL,
       color = "Role",
       caption = 'TidyTuesday 2020-07-14') +
  scale_color_viridis_d(option ="plasma") +
  theme(legend.position = "bottom")

ggplotly(p1, tooltip = "text")

Valentina Tereshkova was the first woman to travel to space - she did a solo flight in 1963 that lasted for 3 days. If you read her wikipedia page, she is clearly a bad ass. There was not another woman pilot in space until 1997 - Eileen Collins. It seems like the role of pilot has disappeared since the Space Shuttle was retired, but there are still male mission commanders, while the last woman commander was in 2007.

Waxing and waning of space flight.

The previous chart shows a several strong pulses in the number of missions.

missions  = astronauts %>% 
  distinct(mission_title, .keep_all = TRUE) %>% 
  select(year_of_mission, mission_title, ascend_shuttle, in_orbit, hours_mission)

p2 = missions %>% group_by(year_of_mission) %>% 
  summarize(mission_count = n()) %>% 
  ggplot(aes(x= year_of_mission,
             y = mission_count,
             color = mission_count,
             size = mission_count,
             text = paste(year_of_mission, ":", mission_count))) +
    geom_point(alpha = .5) + 
    scale_color_viridis_c(option = "plasma") +
  labs(title = str_wrap("After a heyday in the 90s the number of flights \nhas declined to the level of the 1960's."),
       x= NULL,
       y = 'missions') +
  theme(legend.position = "none")

ggplotly(p2, tooltip = "text")

Mission hours have increased dramatically with space stations.

p3 = ggplot(missions, aes(x= year_of_mission,
                          y = hours_mission,
                          size = hours_mission,
                          color = hours_mission,
                          text = paste(in_orbit, "<br>", hours_mission))) +
  geom_jitter(alpha = .5) +
  #geom_smooth(se = FALSE) +
  scale_color_viridis_c(option = "plasma") +
  labs(title = "Although number of missions has declined, \ntime is space has increased dramatically.",
       x= "",
       y = "mission hours") + 
  theme(legend.position = 'none') +
  scale_y_sqrt()


ggplotly(p3, tooltip = "text")