Tidy Tuesday Astronaut data.
First I’ll load the data and examine it with the eminently useful {skmir}.
tuesdata <- tidytuesdayR::tt_load('2020-07-14')
Downloading file 1 of 1: `astronauts.csv`
astronauts = tuesdata$astronauts
skim(astronauts)
| Name | astronauts |
| Number of rows | 1277 |
| Number of columns | 24 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1 | 9 | 34 | 0 | 564 | 0 |
| original_name | 5 | 1 | 2 | 34 | 0 | 560 | 0 |
| sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| nationality | 0 | 1 | 3 | 24 | 0 | 40 | 0 |
| military_civilian | 0 | 1 | 8 | 8 | 0 | 2 | 0 |
| selection | 1 | 1 | 2 | 60 | 0 | 229 | 0 |
| occupation | 0 | 1 | 3 | 23 | 0 | 12 | 0 |
| mission_title | 1 | 1 | 1 | 27 | 0 | 361 | 0 |
| ascend_shuttle | 1 | 1 | 4 | 16 | 0 | 436 | 0 |
| in_orbit | 0 | 1 | 3 | 17 | 0 | 289 | 0 |
| descend_shuttle | 1 | 1 | 4 | 17 | 0 | 432 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 639.00 | 368.78 | 1.00 | 320.00 | 639 | 958.00 | 1277.00 | ▇▇▇▇▇ |
| number | 0 | 1 | 274.23 | 148.19 | 1.00 | 153.00 | 278 | 390.00 | 565.00 | ▅▆▇▆▅ |
| nationwide_number | 0 | 1 | 128.75 | 97.26 | 1.00 | 47.00 | 110 | 204.00 | 433.00 | ▇▅▅▂▁ |
| year_of_birth | 0 | 1 | 1951.68 | 11.44 | 1921.00 | 1944.00 | 1952 | 1959.00 | 1983.00 | ▂▃▇▅▁ |
| year_of_selection | 0 | 1 | 1985.59 | 12.22 | 1959.00 | 1978.00 | 1987 | 1995.00 | 2018.00 | ▃▅▇▅▁ |
| mission_number | 0 | 1 | 1.99 | 1.15 | 1.00 | 1.00 | 2 | 3.00 | 7.00 | ▇▂▁▁▁ |
| total_number_of_missions | 0 | 1 | 2.98 | 1.40 | 1.00 | 2.00 | 3 | 4.00 | 7.00 | ▇▅▅▂▁ |
| year_of_mission | 0 | 1 | 1994.60 | 12.58 | 1961.00 | 1986.00 | 1995 | 2003.00 | 2019.00 | ▂▃▇▇▅ |
| hours_mission | 0 | 1 | 1050.88 | 1714.79 | 0.00 | 190.03 | 261 | 382.00 | 10505.00 | ▇▁▁▁▁ |
| total_hrs_sum | 0 | 1 | 2968.34 | 4214.72 | 0.61 | 482.00 | 932 | 4264.00 | 21083.52 | ▇▂▁▁▁ |
| field21 | 0 | 1 | 0.63 | 1.17 | 0.00 | 0.00 | 0 | 1.00 | 7.00 | ▇▁▁▁▁ |
| eva_hrs_mission | 0 | 1 | 3.66 | 7.29 | 0.00 | 0.00 | 0 | 4.72 | 89.13 | ▇▁▁▁▁ |
| total_eva_hrs | 0 | 1 | 10.76 | 16.05 | 0.00 | 0.00 | 0 | 19.52 | 78.80 | ▇▂▁▁▁ |
Occupation, which seems to mean mission role, looks like an interesting variable to explore. First order to is to eliminate any capitalization inconsistencies - which seem to plague all data. There is a lot of discussion about the pros and cons of SQL, but one of the advantages of a database structure is the ability to create rules that keep inconsistent data from ever getting into the data stream.
astronauts %>%
mutate(occupation = str_to_lower(occupation)) %>% # always seem to be capitalization issues
tabyl(occupation) %>%
adorn_pct_formatting() %>%
gt()
| occupation | n | percent |
|---|---|---|
| commander | 315 | 24.7% |
| flight engineer | 196 | 15.3% |
| msp | 498 | 39.0% |
| other (journalist) | 1 | 0.1% |
| other (space tourist) | 8 | 0.6% |
| pilot | 197 | 15.4% |
| psp | 59 | 4.6% |
| space tourist | 2 | 0.2% |
| spaceflight participant | 1 | 0.1% |
I’m not sure what some of these roles are, also a little re-coding is in order to deal with redundant data categories. Some external research indicates that ‘msp’ probably stands for mission specialist, and ‘psp’ stands for payload specialist. This wikipedia page has more details. I’ll recode these and lump all the tourists together – sorry lone space journalist. One thing to watch out for with case_when is to make sure to use double quotes, I was using single quotes and could not figure out why things were not working.
astronauts_clean <- astronauts %>%
select(name, sex, occupation, year_of_mission, hours_mission) %>%
mutate(occupation = str_to_lower(occupation)) %>%
mutate(role = case_when(
occupation == "msp" ~ "mission specialist",
occupation == "psp" ~ "payload specialist",
TRUE ~ occupation),
role_factor = fct_lump_n(role, 5))
Some of the early pioneers in spaceflight were women. I’m curious about how the current gender roles look.
astronauts_clean %>%
tabyl(role_factor, sex) %>%
adorn_totals("row") %>%
adorn_percentages("row") %>%
adorn_pct_formatting() %>%
adorn_ns() %>%
adorn_title("combined", row_name = "Role", col_name = "gender") %>%
gt()
| Role/gender | female | male |
|---|---|---|
| commander | 1.0% (3) | 99.0% (312) |
| flight engineer | 10.2% (20) | 89.8% (176) |
| mission specialist | 21.3% (106) | 78.7% (392) |
| payload specialist | 8.5% (5) | 91.5% (54) |
| pilot | 3.6% (7) | 96.4% (190) |
| Other | 16.7% (2) | 83.3% (10) |
| Total | 11.2% (143) | 88.8% (1134) |
Not great - only 11% of astronauts have been women. Only 1% of mission commanders have been women, and less than 4% have been pilots. How does this look over time?
set.seed(567)
p1 = astronauts_clean%>%
ggplot(aes(x= year_of_mission, y= sex, color = role_factor,
text = paste(year_of_mission, "<br>",name,"<br>",role_factor))) +
geom_quasirandom(alpha = .4,
size = 2,
groupOnX = FALSE) +
labs(title = 'Gender roles in space flight',
y = NULL,
x = NULL,
color = "Role",
caption = 'TidyTuesday 2020-07-14') +
scale_color_viridis_d(option ="plasma") +
theme(legend.position = "bottom")
ggplotly(p1, tooltip = "text")
Valentina Tereshkova was the first woman to travel to space - she did a solo flight in 1963 that lasted for 3 days. If you read her wikipedia page, she is clearly a bad ass. There was not another woman pilot in space until 1997 - Eileen Collins. It seems like the role of pilot has disappeared since the Space Shuttle was retired, but there are still male mission commanders, while the last woman commander was in 2007.
The previous chart shows a several strong pulses in the number of missions.
missions = astronauts %>%
distinct(mission_title, .keep_all = TRUE) %>%
select(year_of_mission, mission_title, ascend_shuttle, in_orbit, hours_mission)
p2 = missions %>% group_by(year_of_mission) %>%
summarize(mission_count = n()) %>%
ggplot(aes(x= year_of_mission,
y = mission_count,
color = mission_count,
size = mission_count,
text = paste(year_of_mission, ":", mission_count))) +
geom_point(alpha = .5) +
scale_color_viridis_c(option = "plasma") +
labs(title = str_wrap("After a heyday in the 90s the number of flights \nhas declined to the level of the 1960's."),
x= NULL,
y = 'missions') +
theme(legend.position = "none")
ggplotly(p2, tooltip = "text")
p3 = ggplot(missions, aes(x= year_of_mission,
y = hours_mission,
size = hours_mission,
color = hours_mission,
text = paste(in_orbit, "<br>", hours_mission))) +
geom_jitter(alpha = .5) +
#geom_smooth(se = FALSE) +
scale_color_viridis_c(option = "plasma") +
labs(title = "Although number of missions has declined, \ntime is space has increased dramatically.",
x= "",
y = "mission hours") +
theme(legend.position = 'none') +
scale_y_sqrt()
ggplotly(p3, tooltip = "text")