I have been continuing my trial and improvement work with the tidyverse “an opinionated collection of R packages designed for data science” (link).
Today, I have been working through a dplyr vignette (link). I have been mindful for some time that this part of my R use needed significant improvement.
The vignette is really helpful and guided me through some fundamental procedures I should have known much earlier in my tidyverse use of data frames and tibbles (link).
The vignette points out that when working with data you must:
- Figure out what you want to do.
- Describe those tasks in the form of a computer program.
- Execute the program.
The dplyr package:
- Constrains options, and helps you think about data manipulation challenges.
- It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to translate your thoughts into code.
- It uses efficient backends.
I have created a GitHub repository (link) to share this example. I have attached the csv file I used for the exercise. It is a file from the 2019 FIFA Women’s World Cup in France (link).
I enjoyed working through each of the basic verbs of data manipulation:
filter()
: select cases based on their values.arrange()
: reorder the cases.select()
andrename()
: select variables based on their names.mutate()
andtransmute()
: add new variables that are functions of existing variables.summarise()
: condense multiple values to a single value.
The syntax and function of all these verbs are very similar in dplyr:
- The first argument is a data frame.
- The subsequent arguments describe what to do with the data frame.
You can refer to columns in the data frame directly without using
$
. - The result is a new data frame
Photo Credit
Opening Game (FIFA)
Final (FIFA)