The Bechdel take a look at and the X-Mansion with tidymodels and #TidyTuesday

[This article was first published on rstats | Julia Silge, and kindly contributed to R-bloggers]. (You’ll be able to report challenge in regards to the content material on this web page right here)


Need to share your content material on R-bloggers? click on right here when you’ve got a weblog, or right here in case you do not.

These days I’ve been publishing
screencasts demonstrating methods to use the
tidymodels framework, from first steps in modeling to methods to consider complicated fashions. As we speak’s screencast focuses on utilizing bootstrap resampling with this week’s
#TidyTuesday dataset from the
Claremont Run Undertaking about problems with the comedian guide collection Uncanny X-Males. 🦸

Right here is the code I used within the video, for individuals who choose studying as an alternative of or along with video.

Learn within the information

Our modeling purpose is to make use of details about speech bubbles, thought bubbles, narrative statements, and character depictions
from this week’s #TidyTuesday dataset to grasp extra about traits of particular person comedian guide points. Let’s give attention to two modeling questions.

  • Does a given challenge have the X-Mansion as a location?
  • Does a given challenge go the
    Bechdel take a look at?

We’re going to make use of three of the datasets from this week.

library(tidyverse) character_visualization <- readr::read_csv("https://uncooked.githubusercontent.com/rfordatascience/tidytuesday/grasp/information/2020/2020-06-30/character_visualization.csv")
xmen_bechdel <- readr::read_csv("https://uncooked.githubusercontent.com/rfordatascience/tidytuesday/grasp/information/2020/2020-06-30/xmen_bechdel.csv")
places <- readr::read_csv("https://uncooked.githubusercontent.com/rfordatascience/tidytuesday/grasp/information/2020/2020-06-30/places.csv")

The character_visualization dataset counts up every time one of many fundamental 25 character speaks, thinks, is concerned in narrative statements, or is depicted complete.

character_visualization

## # A tibble: 9,800 x 7
## challenge costume character speech thought narrative depicted
## ## 1 97 Costume Editor narration Zero Zero Zero 0
## 2 97 Costume Omnipresent narration Zero Zero Zero 0
## 3 97 Costume Professor X = Charles Xavier… Zero Zero Zero 0
## 4 97 Costume Wolverine = Logan 7 Zero 0 10
## 5 97 Costume Cyclops = Scott Summers 24 Three 0 23
## 6 97 Costume Marvel Lady/Phoenix = Jean G… Zero Zero Zero 0
## 7 97 Costume Storm = Ororo Munroe 11 Zero Zero 9
## 8 97 Costume Colossus = Peter (Piotr) Ras… 9 Zero 0 17
## 9 97 Costume Nightcrawler = Kurt Wagner 10 Zero 0 17
## 10 97 Costume Banshee = Sean Cassidy Zero Zero Zero 5
## # … with 9,790 extra rows

Let’s combination this dataset to the problem stage so we will construct fashions utilizing per-issue variations in talking, considering, narrative, and complete depictions.

per_issue <- character_visualization %>% group_by(challenge) %>% summarise(throughout(speech:depicted, sum)) %>% ungroup() per_issue

## # A tibble: 196 x 5
## challenge speech thought narrative depicted
## ## 1 97 146 13 71 168
## 2 98 172 9 29 180
## 3 99 105 22 29 124
## 4 100 141 28 7 122
## 5 101 158 27 58 191
## 6 102 78 27 33 133
## 7 103 91 6 25 121
## 8 104 142 15 25 165
## 9 105 83 12 24 128
## 10 106 20 6 20 16
## # … with 186 extra rows

I’m not doing a ton of EDA right here however there are many nice examples on the market to discover on
Twitter!

Which points have the X-Mansion as a location?

Let’s begin with our first mannequin. The X-Mansion is essentially the most incessantly used location, however it doesn’t seem in each episode.

x_mansion <- places %>% group_by(challenge) %>% summarise(mansion = "X-Mansion" %in% location) locations_joined <- per_issue %>% inner_join(x_mansion) locations_joined %>% mutate(mansion = if_else(mansion, "X-Mansion", "No mansion")) %>% pivot_longer(speech:depicted, names_to = "visualization") %>% mutate(visualization = fct_inorder(visualization)) %>% ggplot(aes(mansion, worth, fill = visualization)) + geom_dotplot( binaxis = "y", stackdir = "heart", binpositions = "all", present.legend = FALSE ) + facet_wrap(~visualization, scales = "free_y") + labs( x = NULL, y = NULL, title = "Which points comprise the X-Mansion as a location?", subtitle = "Evaluating the highest 25 characters' speech, thought, narrative portrayal, and complete depictions", caption = "Information from the Claremont Run Undertaking" )

Now let’s create bootstrap resamples and match a logistic regression mannequin to every resample. What are the bootstrap confidence intervals on the mannequin parameters?

library(tidymodels)
set.seed(123)
boots <- bootstraps(locations_joined, occasions = 1000, obvious = TRUE) boot_models <- boots %>% mutate( mannequin = map( splits, ~ glm(mansion ~ speech + thought + narrative + depicted, household = "binomial", information = evaluation(.) ) ), coef_info = map(mannequin, tidy) ) boot_coefs <- boot_models %>% unnest(coef_info) int_pctl(boot_models, coef_info)

## # A tibble: 5 x 6
## time period .decrease .estimate .higher .alpha .methodology ## ## 1 (Intercept) -2.42 -1.29 -0.277 0.05 percentile
## 2 depicted 0.00193 0.0103 0.0196 0.05 percentile
## Three narrative -0.0106 0.00222 0.0143 0.05 percentile
## Four speech -0.0148 -0.00716 0.000617 0.05 percentile
## 5 thought -0.0143 -0.00338 0.00645 0.05 percentile

How are the parameters distributed?

boot_coefs %>% filter(time period != "(Intercept)") %>% mutate(time period = fct_inorder(time period)) %>% ggplot(aes(estimate, fill = time period)) + geom_vline( xintercept = 0, shade = "grey50", alpha = 0.6, lty = 2, dimension = 1.5 ) + geom_histogram(alpha = 0.8, bins = 25, present.legend = FALSE) + facet_wrap(~time period, scales = "free") + labs( title = "Which points comprise the X-Mansion as a location?", subtitle = "Evaluating the highest 25 characters' speech, thought, narrative portrayal, and complete depictions", caption = "Information from the Claremont Run Undertaking" )

  • Points with extra depictions of the principle 25 characters (i.e. massive teams of X-Males) usually tend to happen within the X-Mansion.
  • Points with extra speech bubbles from these characters are much less prone to happen within the X-Mansion.

Apparently points with plenty of speaking usually tend to happen elsewhere!

Now let’s do the Bechdel take a look at

When you haven’t heard in regards to the Bechdel take a look at,
this video (now over 10 years outdated!) is a pleasant explainer. We will use the identical method from the earlier part however substitute the information about challenge places with the Bechdel take a look at information.

bechdel_joined <- per_issue %>% inner_join(xmen_bechdel) %>% mutate(pass_bechdel = if_else(pass_bechdel == "sure", TRUE, FALSE)) bechdel_joined %>% mutate(pass_bechdel = if_else(pass_bechdel, "Passes Bechdel", "Fails Bechdel")) %>% pivot_longer(speech:depicted, names_to = "visualization") %>% mutate(visualization = fct_inorder(visualization)) %>% ggplot(aes(pass_bechdel, worth, fill = visualization)) + geom_dotplot( binaxis = "y", stackdir = "heart", binpositions = "all", present.legend = FALSE ) + facet_wrap(~visualization, scales = "free_y") + labs( x = NULL, y = NULL, title = "Which Uncanny X-Males points go the Bechdel take a look at?", subtitle = "Evaluating the highest 25 characters' speech, thought, narrative portrayal, and complete depictions", caption = "Information from the Claremont Run Undertaking" )

We will once more create bootstrap resamples, match logistic regression fashions, and compute bootstrap confidence intervals.

set.seed(123)
boots <- bootstraps(bechdel_joined, occasions = 1000, obvious = TRUE) boot_models <- boots %>% mutate( mannequin = map( splits, ~ glm(pass_bechdel ~ speech + thought + narrative + depicted, household = "binomial", information = evaluation(.) ) ), coef_info = map(mannequin, tidy) ) boot_coefs <- boot_models %>% unnest(coef_info) int_pctl(boot_models, coef_info)

## # A tibble: 5 x 6
## time period .decrease .estimate .higher .alpha .methodology ## ## 1 (Intercept) -1.18 -0.248 0.699 0.05 percentile
## 2 depicted -0.0232 -0.0111 -0.000509 0.05 percentile
## Three narrative -0.00405 0.00966 0.0260 0.05 percentile
## Four speech 0.00521 0.0151 0.0285 0.05 percentile
## 5 thought 0.000561 0.0155 0.0361 0.05 percentile

How are these parameters distributed?

boot_coefs %>% filter(time period != "(Intercept)") %>% mutate(time period = fct_inorder(time period)) %>% ggplot(aes(estimate, fill = time period)) + geom_vline( xintercept = 0, shade = "grey50", alpha = 0.6, lty = 2, dimension = 1.5 ) + geom_histogram(alpha = 0.8, bins = 25, present.legend = FALSE) + facet_wrap(~time period, scales = "free") + labs( title = "Which Uncanny X-Males points go the Bechdel take a look at?", subtitle = "Evaluating the highest 25 characters' speech, thought, narrative portrayal, and complete depictions", caption = "Information from the Claremont Run Undertaking" )

  • Points with extra depictions of the principle 25 characters (i.e. extra characters in them) are much less prone to go the Bechdel take a look at.
  • Points with extra speech bubbles from these characters usually tend to go the Bechdel take a look at. (Maybe additionally points with extra thought bubbles.)

I feel it is smart that points with plenty of talking usually tend to go the Bechdel take a look at, which is about characters talking to one another. Attention-grabbing that the problems with plenty of character depictions are much less prone to go!

Leave a Reply

Your email address will not be published. Required fields are marked *