Data Wrangling

Released on Wednesday, June 7, 2023

Read and preview data

Our data are usually presented as a csv file and after loading a csv file into R studio, we will have a “data frame”. A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the tidyverse suite of packages, we use the read_csv() function to load the NBA stats dataset (SPORTS) or the heart disease dataset (HEALTH) from the other day:

library(tidyverse)

# SPORTS
nba_stats <- read_csv("./files/nba_stats.csv")

# HEALTH
heart_disease <- read_csv("./files/heart_disease.csv")

By default, read_csv() reads in the dataset as a tbl (aka tibble) object instead of a data.frame object. You can read about the differences here, but it’s not that meaningful for purposes.

We can use the functions head() and tail() to view a sample of the data. Use the head() function to view the first 6 rows, then use the tail() function to view the last 3 rows:

# INSERT CODE HERE

View the dimensions of the data with dim():

# INSERT CODE HERE

Quickly view summary statistics for all variables with the summary() function:

# Uncomment the following code by deleting the # at the front:
# summary(nba_stats)
# summary(heart_disease)

View the data structure types with str():

# str(nba_stats)
# str(heart_disease)

What’s the difference between the output from the two functions?

Data manipulation with `dplyr`

An easier way to manipulate the data frame is through the dplyr package, which is in the tidyverse suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The “split-apply-combine” concept can be achieved by dplyr.

Selecting columns with `select()`

The function select() can be use to select certain column with the column names.

(SPORTS)

First create a new table called nba_stats_pg that only contains the player and games columns:

# INSERT CODE HERE

To select all the columns except a specific column, use the - (subtraction) operator. For example, view the output from uncommenting the following line of code:

# head(select(nba_stats, -player))

To select a range of columns by name (that are in consecutive order), use the : (colon) operator. For example, view the output from uncommenting the following line of code:

#head(select(nba_stats, player:games))

To select all columns that start with certain character strings, use the function starts_with(). Ohter matching options are:

ends_with() = Select columns that end with a character string
contains() = Select columns that contain a character string
matches() = Select columns that match a regular expression
one_of() = Select columns names that are from a group of names

# Uncomment the following lines of code
#head(select(nba_stats, starts_with("three")))
#head(select(nba_stats, contains("throw")))

(HEALTH)

First create a new table called heart_disease_ad that only contains the Age and Drugs columns:

# INSERT CODE HERE

To select all the columns except a specific column, use the - (subtraction) operator. For example, view the output from uncommenting the following line of code:

# head(select(heart_disease, -Interventions))

To select a range of columns by name (that are in consecutive order), use the : (colon) operator. For example, view the output from uncommenting the following line of code:

#head(select(heart_disease, Drugs:Duration))

To select all columns that start with certain character strings, use the function starts_with(). Ohter matching options are:

ends_with() = Select columns that end with a character string
contains() = Select columns that contain a character string
matches() = Select columns that match a regular expression
one_of() = Select columns names that are from a group of names

# Uncomment the following lines of code
#head(select(heart_disease, starts_with("Com")))
#head(select(heart_disease, contains("er")))

Selecting rows using filter()

(SPORTS)

We can also select the rows/observations that satisfy certain criteria. Try selecting the rows with more than 500 assists:

# INSERT CODE HERE

We can also filter on mutiple criteria. Select rows with age above 30 and the team is either “HOU” or “GSW”:

# INSERT CODE HERE

(HEALTH)

We can also select the rows/observations that satisfy certain criteria. Try selecting the rows with more than 500 assists:

# INSERT CODE HERE

We can also filter on mutiple criteria. Select rows with Age above 60 and the gender is ‘Male’:

# INSERT CODE HERE

Arrange or re-order rows using `arrange()`

To arrange the data frame by a specific order we need to use the function arrange(). The default is by increasing order and a negative operator will provide the decreasing order.

(SPORTS)

First arrange the nba_stats table by personal_fouls in ascending order:

# INSERT CODE HERE

Next by descending order:

# INSERT CODE HERE

Try combining a pipeline of select(), filter(), and arrange() steps together with the %>% operator by:

Selecting the player, team, age, and games columns,
Filter to select only rows with games above 50,
Sort by age in descending order

# INSERT CODE HERE

(HEALTH)

First arrange the heart_disease table by Duration in ascending order:

# INSERT CODE HERE

Next by descending order:

# INSERT CODE HERE

Try combining a pipeline of select(), filter(), and arrange() steps together with the %>% operator by:

Selecting the Age, Cost, ERVisit, and Duration columns,
Filter to select only rows with Age above 60,
Sort by Duration in descending order

# INSERT CODE HERE

Create new columns using `mutate()`

Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame.

(SPORTS)

Create a new column fouls_per_game by taking the personal_fouls and dividing by games (reassign this output to the nba_stats table following the commented code chunk so this column is added to the table):

# nba_stats <- nba_stats %>%
#   mutate(INSERT CODE HERE)

(HEALTH)

Create a new column cost_per_day by taking the Cost and dividing by Duration (reassign this output to the heart_disease table following the commented code chunk so this column is added to the table):

# heart_disease <- heart_disease %>%
#   mutate(INSERT CODE HERE)

Create summaries of the data with `summarize()`

To create summary statistics for a given column in the data frame, we can use summarize() function.

(SPORTS)

Compute the mean, min, and max number of assists:

# INSERT CODE HERE

The advantage of summarize is more obvious if we combine it with the group_by(), the group operators. Since players at the different position tend to have very different statistics, first group_by() position and then compute the same summary statistics:

# INSERT CODE HERE

(HEALTH)

Compute the mean, min, and max number of Cost:

# INSERT CODE HERE

The advantage of summarize is more obvious if we combine it with the group_by(), the group operators. Try to group_by() the Gender column first and then compute the same summary statistics:

# INSERT CODE HERE

--- title: "Data Wrangling" date: "June 7, 2023" format: html: code-tools: source: true toggle: false caption: none --- ## Read and preview data Our data are usually presented as a csv file and after loading a csv file into R studio, we will have a "data frame". A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the `tidyverse` suite of packages, we use the [`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) function to load the NBA stats dataset (SPORTS) or the heart disease dataset (HEALTH) from the other day: ```{r warning = FALSE, message = FALSE} library(tidyverse) # SPORTS nba_stats <- read_csv("./files/nba_stats.csv") # HEALTH heart_disease <- read_csv("./files/heart_disease.csv") ``` By default, `read_csv()` reads in the dataset as a `tbl` (aka `tibble`) object instead of a `data.frame` object. You can read about the differences [here](https://www.rstudio.com/blog/tibble-1-0-0/#:~:text=Tibbles%20vs%20data%20frames,to%20work%20with%20large%20data.), but it's not that meaningful for purposes. We can use the functions `head()` and `tail()` to view a sample of the data. Use the `head()` function to view the first 6 rows, then use the `tail()` function to view the last 3 rows: ```{r} # INSERT CODE HERE ``` View the dimensions of the data with `dim()`: ```{r} # INSERT CODE HERE ``` Quickly view summary statistics for all variables with the `summary()` function: ```{r} # Uncomment the following code by deleting the # at the front: # summary(nba_stats) # summary(heart_disease) ``` View the data structure types with `str()`: ```{r} # str(nba_stats) # str(heart_disease) ``` What's the difference between the output from the two functions? ## Data manipulation with `dplyr` An easier way to manipulate the data frame is through the `dplyr` package, which is in the `tidyverse` suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The "split-apply-combine" concept can be achieved by `dplyr`. ### Selecting columns with `select()` The function `select()` can be use to select certain column with the column names. #### (SPORTS) First create a new table called `nba_stats_pg` that only contains the `player` and `games` columns: ```{r} # INSERT CODE HERE ``` To select all the columns except a specific column, use the `-` (subtraction) operator. For example, view the output from uncommenting the following line of code: ```{r} # head(select(nba_stats, -player)) ``` To select a range of columns by name (that are in consecutive order), use the `:` (colon) operator. For example, view the output from uncommenting the following line of code: ```{r} #head(select(nba_stats, player:games)) ``` To select all columns that start with certain character strings, use the function `starts_with()`. Ohter matching options are: 1. `ends_with()` = Select columns that end with a character string 2. `contains()` = Select columns that contain a character string 3. `matches()` = Select columns that match a regular expression 4. `one_of()` = Select columns names that are from a group of names ```{r} # Uncomment the following lines of code #head(select(nba_stats, starts_with("three"))) #head(select(nba_stats, contains("throw"))) ``` #### (HEALTH) First create a new table called `heart_disease_ad` that only contains the `Age` and `Drugs` columns: ```{r} # INSERT CODE HERE ``` To select all the columns except a specific column, use the `-` (subtraction) operator. For example, view the output from uncommenting the following line of code: ```{r} # head(select(heart_disease, -Interventions)) ``` To select a range of columns by name (that are in consecutive order), use the `:` (colon) operator. For example, view the output from uncommenting the following line of code: ```{r} #head(select(heart_disease, Drugs:Duration)) ``` To select all columns that start with certain character strings, use the function `starts_with()`. Ohter matching options are: 1. `ends_with()` = Select columns that end with a character string 2. `contains()` = Select columns that contain a character string 3. `matches()` = Select columns that match a regular expression 4. `one_of()` = Select columns names that are from a group of names ```{r} # Uncomment the following lines of code #head(select(heart_disease, starts_with("Com"))) #head(select(heart_disease, contains("er"))) ``` ### Selecting rows using filter() #### (SPORTS) We can also select the rows/observations that satisfy certain criteria. Try selecting the rows with more than 500 assists: ```{r} # INSERT CODE HERE ``` We can also filter on mutiple criteria. Select rows with `age` above 30 and the team is either "HOU" or "GSW": ```{r} # INSERT CODE HERE ``` #### (HEALTH) We can also select the rows/observations that satisfy certain criteria. Try selecting the rows with more than 500 assists: ```{r} # INSERT CODE HERE ``` We can also filter on mutiple criteria. Select rows with `Age` above 60 and the `gender` is 'Male': ```{r} # INSERT CODE HERE ``` ### Arrange or re-order rows using `arrange()` To arrange the data frame by a specific order we need to use the function `arrange()`. The default is by increasing order and a negative operator will provide the decreasing order. #### (SPORTS) First arrange the `nba_stats` table by `personal_fouls` in ascending order: ```{r} # INSERT CODE HERE ``` Next by descending order: ```{r} # INSERT CODE HERE ``` Try combining a pipeline of `select()`, `filter()`, and `arrange()` steps together with the `%>%` operator by: 1. Selecting the `player`, `team`, `age`, and `games` columns, 2. Filter to select only rows with `games` above 50, 3. Sort by `age` in descending order ```{r} # INSERT CODE HERE ``` #### (HEALTH) First arrange the `heart_disease` table by `Duration` in ascending order: ```{r} # INSERT CODE HERE ``` Next by descending order: ```{r} # INSERT CODE HERE ``` Try combining a pipeline of `select()`, `filter()`, and `arrange()` steps together with the `%>%` operator by: 1. Selecting the `Age`, `Cost`, `ERVisit`, and `Duration` columns, 2. Filter to select only rows with `Age` above 60, 3. Sort by `Duration` in descending order ```{r} # INSERT CODE HERE ``` ### Create new columns using `mutate()` Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. #### (SPORTS) Create a new column `fouls_per_game` by taking the `personal_fouls` and dividing by `games` (reassign this output to the `nba_stats` table following the commented code chunk so this column is added to the table): ```{r} # nba_stats <- nba_stats %>% # mutate(INSERT CODE HERE) ``` #### (HEALTH) Create a new column `cost_per_day` by taking the `Cost` and dividing by `Duration` (reassign this output to the `heart_disease` table following the commented code chunk so this column is added to the table): ```{r} # heart_disease <- heart_disease %>% # mutate(INSERT CODE HERE) ``` ### Create summaries of the data with `summarize()` To create summary statistics for a given column in the data frame, we can use `summarize()` function. #### (SPORTS) Compute the `mean`, `min`, and `max` number of `assists`: ```{r} # INSERT CODE HERE ``` The advantage of `summarize` is more obvious if we combine it with the `group_by()`, the group operators. Since players at the different position tend to have very different statistics, first `group_by()` position and then compute the same summary statistics: ```{r} # INSERT CODE HERE ``` #### (HEALTH) Compute the `mean`, `min`, and `max` number of `Cost`: ```{r} # INSERT CODE HERE ``` The advantage of `summarize` is more obvious if we combine it with the `group_by()`, the group operators. Try to `group_by()` the `Gender` column first and then compute the same summary statistics: ```{r} # INSERT CODE HERE ```

Read and preview data

Data manipulation with dplyr

Selecting columns with select()

(SPORTS)

(HEALTH)

Selecting rows using filter()

(SPORTS)

(HEALTH)

Arrange or re-order rows using arrange()

(SPORTS)

(HEALTH)

Create new columns using mutate()

(SPORTS)

(HEALTH)

Create summaries of the data with summarize()

(SPORTS)

(HEALTH)

Data manipulation with `dplyr`

Selecting columns with `select()`

Arrange or re-order rows using `arrange()`

Create new columns using `mutate()`

Create summaries of the data with `summarize()`