EDA Mini-Project Requirements and Data

Overview

This mini-project will begin on Thursday, June 8 and conclude with a 10 minute (maximum) presentation one week later on Friday, June 12. Students will be paired into groups of two or three and randomly assigned one of six sports datasets. The goal of this project is to practice understanding the data structure of a dataset, generating hypotheses and using exploratory data analysis and data visualization to attempt to answer these hypotheses.

Deliverables

Each team is expected to produce slides to accompany their 5 minute presentation with the following information:

  • Explanation of the data structure of the dataset
  • Three hypotheses about the dataset
  • Three data visualizations exploring the hypothesis, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
  • One clustering example
  • Conclusions reached for the hypothesis using the data visualizations

Timeline

Groups and datasets will be assigned during the lab session on Thursday 6/8. You will then have the rest of that lab session, as well as the lab sessions Monday, Wednesday, and Thursday of the following week to complete the mini-project

Thursday, June 15 @ 5:00pm EST - Slides must be completed and ready for presentation. Send your slides to Meg’s email (mellingw@andrew.cmu.edu). All code and visualizations must be done in R, but the slides may be created in any program.

Data

EDA projects data overview

There are six different datasets for the EDA projects (linked here):

These datasets were curated by Ron Yurko as part of the SCORE project, and his description of each dataset can be found below.

NBA Player Statistics

The National Basketball Association (NBA) is the top men’s professional basketball league in the world. While players have predefined positions, the sport is becoming increasingly positionless - with centers attempting more three point shots and guards driving the ball inside to dunk. With this dataset, you can explore clustering NBA players based on various types of statistics and compare your players labels to the predefined positions.

This dataset contains statistics about 812 player-team stints for during the 2021-2022 NBA regular season. For players that played for \(T\) teams during the season (due to trade), there are \(T+1\) rows with one row for their performance with each of the \(T\) teams and another row indicating their total performance (where tm = TOT) across the full season regardless of team. The counting stats are reported on a per 100 team possessions scale, to normalize for playing time differences.

The data was collected using the ballr package in R., which gathers data from basketball-reference.com.

Variable Description
player Name of player
pos Player’s designated position
age Player’s age on February 1st of the season
tm Name of team
g Number of games
gs Number of games started
mp Number of minutes played
fg Field goals per 100 team possessions
fga Field goal attempts per 100 team possessions
fgpercent Field goal percentage
x3p 3 point field goals per 100 team possessions
x3pa 3 point field goal attempts per 100 team possessions
x3ppercent 3 point field goal percentage
x2p 2 point field goals per 100 team possessions
x2pa 2 point field goal attempts per 100 team possessions
x2ppercent 2 point field goal percentage
ft Free throws per 100 team possessions
fta Free throw attempts per 100 team possessions
ftpercent Free throw percentage
orb Offensive rebounds per 100 team possessions
drb Defensive rebounds per 100 team possessions
trb Total rebounds per 100 team possessions
ast Assists per 100 team possessions
stl Steals per 100 team possessions
blk Blocks per 100 team possessions
tov Turnovers per 100 team possessions
pf Personal fouls per 100 team possessions
pts Points per 100 team possessions
ortg Offensive Rating - an estimate of points produced per 100 possessions scale
drtg Defensive Rating - an estimate of points allowed per 100 possessions scale

WNBA Shots

The Women’s National Basketball Association (WNBA) is the top professional women’s basketball league in the world. The league records every shot players take along with contextual information about the shot such as its location, a description of the shot type, as well as the outcome. With this dataset, you can predict the success of each shot attempt to compute the expected value of shot types and compare team decision making.

This dataset contains information about 41,497 shots during the 2021-2022 WNBA season.

The data was collected using the wehoop package in R.

Variable Description
game_id Unique integer ID for each WNBA game
game_play_number Integer indicating the recorded play number for the shot attempt, where 1 indicates the first play of the game
desc String detailed description of shot attempt
shot_type String description of the shot type (e.g., dunk, layup, jump shot, etc.)
made_shot Boolean denoting if the shot was made (TRUE) or not (FALSE)
shot_value Numeric value of the shot outcome (0 for shots that were not made, and a positive value for made shots)
coordinate_x Horizontal location in feet of shot attempt where the hoop would be located at 25 feet
coordinate_y Vertical location in feet of shot attempt with respect to the target hoop (the hoop should be a little in front of 0 but the coordinate system is not exact)
shooting_team String name of the team taking the shot
home_name String name of the home team
away_name String name of the away team
home_score Integer value of the home team score after the shot
away_score Integer value of the away team score after the shot
qtr Integer denoting the quarter/period in the game
quarter_seconds_remaining Numeric integer value for number of seconds remaining in quarter/period
game_seconds_remaining Numeric integer value for number of seconds remaining in game

NFL Team Statistics

The National Football League (NFL) is the top professional American football league in the world. While a team’s record ultimately determines whether or not they make the playoffs, their score differential (points for - points against) is often a better indicator of a team’s ability. But what aspects of a team’s performance are related to their point differential? Is passing more important than rushing? What about offense in comparison to defense? The NFL records a variety of statistics, and the public NFL analytics community have developed advanced metrics such as expected points added (EPA) that provide deeper insight into a team’s performance. With this dataset of statistics dating back to 1999, you can explore variation between teams since as well as which types of statistics are relevant predictor variables of record and point differential.

This dataset contains statistics about the regular season performance for each NFL team from 1999 to 2022 team. The data was collected using the nflreadr package in R.

Each row in the dataset corresponds to a single NFL team in a single regular season. There are a total of 765 team-seasons, with 56 total columns. The column names are organized below by the type of information they contain, with the first set of columns being self-explanatory:

Variable Description
season Regular season year of team’s statistics
team NFL team three letter abbreviation

There are also columns with season level outcomes:

Variable Description
points_score Total number of points scored by the team
points_allowed Total number of points allowed by the team
wins Number of games the team won
losses Number of games the team lost
ties Number of games the team tied
score_differential points scored - points allowed

There are also several columns corresponding to offensive and defensive summaries of the team’s performance in the season separated by play type (either pass or run):

Variable Description
offense/defense_completion_percentage Passing completion percentage either for (offense) or against (defense)
offense/defense_total_yards_gained_pass/run Total number of yards gained (offense) or allowed (defense) by play type (pass or run)
offense/defense_ave_yards_gained_pass/run Average number of yards gained (offense) or allowed (defense) per play by play type (pass or run)
offense/defense_total_air_yards Total number of air yards gained (offense) or allowed (defense), where air yards correspond to perpendicular yards traveled from the line of scrimmage to location of catch for passing plays
offense/defense_ave_air_yards Average number of air yards gained (offense) or allowed (defense) per passing play
offense/defense_total_yac Total number of yards after catch gained (offense) or allowed (defense)
offense/defense_ave_yac Average number of yards after catch gained (offense) or allowed (defense) per passing play
offense/defense_n_plays_pass/run Total number of plays by the team (offense) or against (defense) by play type (pass or run)
offense/defense_n_interceptions Total number of interceptions thrown (offense) or caught (defense)
offense/defense/n_fumbles_lost_pass/run Total number of fumbles lost (offense) or forced (defense) by play type (pass or run)
offense/defense_total_epa_pass/run Total expected points added (offense) or allowed (defense) by play type (pass or run)
offense/defense_ave_epa_pass/run Average expected points added (offense) or allowed (defense) per play by play type (pass or run)
offense/defense_total_wpa_pass/run Total win probability added (offense) or allowed (defense) by play type (pass or run)
offense/defense_ave_wpa_pass/run Average win probability added (offense) or allowed (defense) per play by play type (pass or run)
offense/defense_total_epa_pass/run Total expected points added (offense) or allowed (defense) by play type (pass or run)
offense/defense_success_rate_pass/run Proportion of plays with positive expected points added (offense) or allowed (defense) by play type (pass or run)

The EPA variables are advanced NFL statistics, conveying how much value a team is adding over the average team in a given situation. It’s on a points scale instead of the typically used yards, because not all yards are created equal in American football (10 yard gain on 3rd and 15 is much less valuable than a 2 yard gain on 4th and 1). For offensive stats the higher the EPA the better, but for defensive stats the lower (more negative) the EPA the better. The WPA variables are similar except they are measuring play value in terms of win probability.

NHL Shots

The National Hockey League (NHL) is the top professional men’s hockey league in the world. The league records every shot players take along with contextual information about the shot such as its location, the player’s distance and angle to the goal when attempting the shot, as well as the outcome (blocked, missed, or goal). Using this information, the hockey analytics community have developed measures of shot quality known as expected goals. With this dataset, you can create your own expected goals model to predict the shot outcome given relevant features.

This dataset contains information about 104,316 shots during the 2021-2022 NHL season.

The data was collected using the hockeyR package in R.

Variable Description
description String detailed description of event
shot_outcome String denoting the outcome of the shot, either BLOCKED_SHOT (meaning blocked by a non-goalie), GOAL, MISSED_SHOT (shot that missed the net), or SHOT (shot on net that was saved by a goalie)
period Integer value of the game period
period_seconds_remaining Numeric value of the seconds remaining in the period
game_seconds_remaining Numeric value of the seconds remaining in the game; negative for overtime periods
home_score Integer value of the home team score after the event
away_score Integer value of the away team score after the event
home_name String name of the home team
away_name String name of the away team
event_team String defining the team taking the shot
event_player_1_name String name of the primary event player
event_player_1_type String indicator for the role of event_player_1 (typically the shooter)
event_player_2_name String name of the secondary event player
event_player_2_type String indicator for the role of event_player_2 (blocker, assist, or goalie)
strength_code String indicator for game strength: EV (Even), SH (Shorthanded), or PP (Power Play)
x_fixed Numeric transformed x-coordinate of event in feet, where the home team always shoots to the right, away team to the left
y_fixed Numeric transformed y-coordinate of event in feet, where the home team always shoots to the right, away team to the left
shot_distance Numeric distance (in feet) to center of net for unblocked shot events
shot_angle Numeric angle (in degrees) to center of net for unlocked shot events

NWSL Team Statistics

The National Women’s Soccer League (NWSL) is the top professional women’s soccer league in the United States. While a team’s record ultimately determines their ranking, goal differential (goals scored - goals conceded) is often a better indicator of a team’s ability. But what aspects of a team’s performance are related to their goal differential? The NWSL records a variety of statistics describing a team’s performance, such as the percentage of time they maintain possession, percentage of shots on target, etc. With this dataset, you can explore variation between teams as well as which statistics are relevant predictor variables of goal differential.

This dataset contains statistics about the regular season performance for each NWSL team from 2016 to 2022 (excluding 2020 which was cancelled due to COVID).

The data was collected using the nwslR package in R.

Variable Description
team_name Name of NWSL team
season Regular season year of team’s statistics
games_played Number of games team played in season
goal_differential Goals scored - goals conceded
goals Number of goals scores
goals_conceded Number of goals conceded
cross_accuracy Percent of crosses that were successful
goal_conversion_pct Percent of shots scored
pass_pct Pass accuracy
pass_pct_opposition_half Pass accuracy in opposition half
possession_pct Percentage of overall ball possession the team had during the season
shot_accuracy Percentage of shots on target
tackle_success_pct Percent of successful tackles

WTA Grand Slam Matches

The Women’s Tennis Associate (WTA) organizes the top women’s professional tennis tour in the world. Throughout the year, there are four major tournaments yielding the most ranking points, prize money, and fame. These are known as the Grand Slam tournaments, consisting of (in order): Australian Open, French Open (aka Roland Garros), Wimbledon, and the US Open. With this dataset of information about winners and losers in WTA Grand Slam matches from 2018 to 2022, you’ll be able to explore statistics collected during matches and information about the athletes to predict match outcomes.

This dataset contains all WTA matches between 2018 and 2022, courtesy of Jeff Sackmann’s famous tennis repository.

There are 2,413 rows in this dataset where each row corresponds to a single WTA Grand Slam match. Each row has 38 columns with general information about the matches, as well as columns describing the winner and loser of the matches:

Variable Description
tourney_name name of the Grand Slam Tournament (French Open is recorded as ROLAND GARROS)
surface type of court surface
tourney_date eight digits, YYYYMMDD, usually the Monday of the tournament week
winner/loser_seed seed of winning/losing player
winner/loser_name Name of the winning/losing player
winner/loser_hand R = right, L = left, U = unknown. For ambidextrous players, this is their serving hand
winner/loser_ht height in centimeters, where available
winner/loser_ioc three-character country code
winner/loser_age age, in years, as of the tourney_date
score final match score
round tournament round
minutes match length in minutes
w/l_ace winner/loser’s number of aces
w/l_df winner/loser’s number of doubles faults
w/l_svpt winner/loser’s number of serve points
w/l_1stIn winner/loser’s number of first serves made
w/l_1stWon winner/loser’s number of first-serve points won
w/l_2ndWon winner/loser’s number of second-serve points won
w/l_SvGms winner/loser’s number of serve games
w/l_bpSaved winner/loser’s number of break points saved
w/l_bpFaced winner/loser’s number of break points faced
winner/loser_rank winner/loser’s WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_date

Note that a full glossary of the features available for match data can be found here.

References

Dror A (2023). nwslR: Compiles dataset for the National Women’s Soccer League (NWSL). R package version 0.0.0.9001.

Elmore R (2020). ballr: Access to Current and Historical Basketball Data. R package version 0.2.6.

Gilani S, Hutchinson G (2022). wehoop: Access Women’s Basketball Play by Play Data. R package version 1.5.0, https://CRAN.R-project.org/package=wehoop.

Ho T, Carl S (2022). nflreadr: Download ‘nflverse’ Data. R package version 1.3.1, https://CRAN.R-project.org/package=nflreadr.

Howell B, Gilani S (2022). fastRhockey: Functions to Access Premier Hockey Federation and National Hockey League Play by Play Data. R package version 0.4.0, https://CRAN.R-project.org/package=fastRhockey.

Morse D (2023). hockeyR: Collect and Clean Hockey Stats. R package version 1.3.1, https://github.com/danmorse314/hockeyR.

WTA data accessed from Jeff Sackmann’s tennis GitHub repository