EDA Mini-Project Requirements and Data
Overview
This mini-project will begin on Thursday, June 8 and conclude with a 10 minute (maximum) presentation one week later on Friday, June 12. Students will be paired into groups of two or three and randomly assigned one of six sports datasets. The goal of this project is to practice understanding the data structure of a dataset, generating hypotheses and using exploratory data analysis and data visualization to attempt to answer these hypotheses.
Deliverables
Each team is expected to produce slides to accompany their 5 minute presentation with the following information:
- Explanation of the data structure of the dataset
- Three hypotheses about the dataset
- Three data visualizations exploring the hypothesis, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
- One clustering example
- Conclusions reached for the hypothesis using the data visualizations
Timeline
Groups and datasets will be assigned during the lab session on Thursday 6/8. You will then have the rest of that lab session, as well as the lab sessions Monday, Wednesday, and Thursday of the following week to complete the mini-project
Thursday, June 15 @ 5:00pm EST - Slides must be completed and ready for presentation. Send your slides to Meg’s email (mellingw@andrew.cmu.edu). All code and visualizations must be done in R
, but the slides may be created in any program.
Data
EDA projects data overview
There are six different datasets for the EDA projects (linked here):
- NBA Player Statistics
- WNBA Shots
- NFL Team Statistics
- NHL Shots
- NWSL Team Statistics
- WTA Grand Slam Matches
These datasets were curated by Ron Yurko as part of the SCORE project, and his description of each dataset can be found below.
NBA Player Statistics
The National Basketball Association (NBA) is the top men’s professional basketball league in the world. While players have predefined positions, the sport is becoming increasingly positionless - with centers attempting more three point shots and guards driving the ball inside to dunk. With this dataset, you can explore clustering NBA players based on various types of statistics and compare your players labels to the predefined positions.
This dataset contains statistics about 812 player-team stints for during the 2021-2022 NBA regular season. For players that played for \(T\) teams during the season (due to trade), there are \(T+1\) rows with one row for their performance with each of the \(T\) teams and another row indicating their total performance (where tm = TOT
) across the full season regardless of team. The counting stats are reported on a per 100 team possessions scale, to normalize for playing time differences.
The data was collected using the ballr
package in R
., which gathers data from basketball-reference.com.
Variable | Description |
---|---|
player | Name of player |
pos | Player’s designated position |
age | Player’s age on February 1st of the season |
tm | Name of team |
g | Number of games |
gs | Number of games started |
mp | Number of minutes played |
fg | Field goals per 100 team possessions |
fga | Field goal attempts per 100 team possessions |
fgpercent | Field goal percentage |
x3p | 3 point field goals per 100 team possessions |
x3pa | 3 point field goal attempts per 100 team possessions |
x3ppercent | 3 point field goal percentage |
x2p | 2 point field goals per 100 team possessions |
x2pa | 2 point field goal attempts per 100 team possessions |
x2ppercent | 2 point field goal percentage |
ft | Free throws per 100 team possessions |
fta | Free throw attempts per 100 team possessions |
ftpercent | Free throw percentage |
orb | Offensive rebounds per 100 team possessions |
drb | Defensive rebounds per 100 team possessions |
trb | Total rebounds per 100 team possessions |
ast | Assists per 100 team possessions |
stl | Steals per 100 team possessions |
blk | Blocks per 100 team possessions |
tov | Turnovers per 100 team possessions |
pf | Personal fouls per 100 team possessions |
pts | Points per 100 team possessions |
ortg | Offensive Rating - an estimate of points produced per 100 possessions scale |
drtg | Defensive Rating - an estimate of points allowed per 100 possessions scale |
WNBA Shots
The Women’s National Basketball Association (WNBA) is the top professional women’s basketball league in the world. The league records every shot players take along with contextual information about the shot such as its location, a description of the shot type, as well as the outcome. With this dataset, you can predict the success of each shot attempt to compute the expected value of shot types and compare team decision making.
This dataset contains information about 41,497 shots during the 2021-2022 WNBA season.
The data was collected using the wehoop
package in R
.
Variable | Description |
---|---|
game_id | Unique integer ID for each WNBA game |
game_play_number | Integer indicating the recorded play number for the shot attempt, where 1 indicates the first play of the game |
desc | String detailed description of shot attempt |
shot_type | String description of the shot type (e.g., dunk, layup, jump shot, etc.) |
made_shot | Boolean denoting if the shot was made (TRUE) or not (FALSE) |
shot_value | Numeric value of the shot outcome (0 for shots that were not made, and a positive value for made shots) |
coordinate_x | Horizontal location in feet of shot attempt where the hoop would be located at 25 feet |
coordinate_y | Vertical location in feet of shot attempt with respect to the target hoop (the hoop should be a little in front of 0 but the coordinate system is not exact) |
shooting_team | String name of the team taking the shot |
home_name | String name of the home team |
away_name | String name of the away team |
home_score | Integer value of the home team score after the shot |
away_score | Integer value of the away team score after the shot |
qtr | Integer denoting the quarter/period in the game |
quarter_seconds_remaining | Numeric integer value for number of seconds remaining in quarter/period |
game_seconds_remaining | Numeric integer value for number of seconds remaining in game |
NFL Team Statistics
The National Football League (NFL) is the top professional American football league in the world. While a team’s record ultimately determines whether or not they make the playoffs, their score differential (points for - points against) is often a better indicator of a team’s ability. But what aspects of a team’s performance are related to their point differential? Is passing more important than rushing? What about offense in comparison to defense? The NFL records a variety of statistics, and the public NFL analytics community have developed advanced metrics such as expected points added (EPA) that provide deeper insight into a team’s performance. With this dataset of statistics dating back to 1999, you can explore variation between teams since as well as which types of statistics are relevant predictor variables of record and point differential.
This dataset contains statistics about the regular season performance for each NFL team from 1999 to 2022 team. The data was collected using the nflreadr
package in R
.
Each row in the dataset corresponds to a single NFL team in a single regular season. There are a total of 765 team-seasons, with 56 total columns. The column names are organized below by the type of information they contain, with the first set of columns being self-explanatory:
Variable | Description |
---|---|
season | Regular season year of team’s statistics |
team | NFL team three letter abbreviation |
There are also columns with season level outcomes:
Variable | Description |
---|---|
points_score | Total number of points scored by the team |
points_allowed | Total number of points allowed by the team |
wins | Number of games the team won |
losses | Number of games the team lost |
ties | Number of games the team tied |
score_differential | points scored - points allowed |
There are also several columns corresponding to offensive and defensive summaries of the team’s performance in the season separated by play type (either pass or run):
Variable | Description |
---|---|
offense/defense_completion_percentage | Passing completion percentage either for (offense) or against (defense) |
offense/defense_total_yards_gained_pass/run | Total number of yards gained (offense) or allowed (defense) by play type (pass or run) |
offense/defense_ave_yards_gained_pass/run | Average number of yards gained (offense) or allowed (defense) per play by play type (pass or run) |
offense/defense_total_air_yards | Total number of air yards gained (offense) or allowed (defense), where air yards correspond to perpendicular yards traveled from the line of scrimmage to location of catch for passing plays |
offense/defense_ave_air_yards | Average number of air yards gained (offense) or allowed (defense) per passing play |
offense/defense_total_yac | Total number of yards after catch gained (offense) or allowed (defense) |
offense/defense_ave_yac | Average number of yards after catch gained (offense) or allowed (defense) per passing play |
offense/defense_n_plays_pass/run | Total number of plays by the team (offense) or against (defense) by play type (pass or run) |
offense/defense_n_interceptions | Total number of interceptions thrown (offense) or caught (defense) |
offense/defense/n_fumbles_lost_pass/run | Total number of fumbles lost (offense) or forced (defense) by play type (pass or run) |
offense/defense_total_epa_pass/run | Total expected points added (offense) or allowed (defense) by play type (pass or run) |
offense/defense_ave_epa_pass/run | Average expected points added (offense) or allowed (defense) per play by play type (pass or run) |
offense/defense_total_wpa_pass/run | Total win probability added (offense) or allowed (defense) by play type (pass or run) |
offense/defense_ave_wpa_pass/run | Average win probability added (offense) or allowed (defense) per play by play type (pass or run) |
offense/defense_total_epa_pass/run | Total expected points added (offense) or allowed (defense) by play type (pass or run) |
offense/defense_success_rate_pass/run | Proportion of plays with positive expected points added (offense) or allowed (defense) by play type (pass or run) |
The EPA variables are advanced NFL statistics, conveying how much value a team is adding over the average team in a given situation. It’s on a points scale instead of the typically used yards, because not all yards are created equal in American football (10 yard gain on 3rd and 15 is much less valuable than a 2 yard gain on 4th and 1). For offensive stats the higher the EPA the better, but for defensive stats the lower (more negative) the EPA the better. The WPA variables are similar except they are measuring play value in terms of win probability.
NHL Shots
The National Hockey League (NHL) is the top professional men’s hockey league in the world. The league records every shot players take along with contextual information about the shot such as its location, the player’s distance and angle to the goal when attempting the shot, as well as the outcome (blocked, missed, or goal). Using this information, the hockey analytics community have developed measures of shot quality known as expected goals. With this dataset, you can create your own expected goals model to predict the shot outcome given relevant features.
This dataset contains information about 104,316 shots during the 2021-2022 NHL season.
The data was collected using the hockeyR
package in R
.
Variable | Description |
---|---|
description | String detailed description of event |
shot_outcome | String denoting the outcome of the shot, either BLOCKED_SHOT (meaning blocked by a non-goalie), GOAL, MISSED_SHOT (shot that missed the net), or SHOT (shot on net that was saved by a goalie) |
period | Integer value of the game period |
period_seconds_remaining | Numeric value of the seconds remaining in the period |
game_seconds_remaining | Numeric value of the seconds remaining in the game; negative for overtime periods |
home_score | Integer value of the home team score after the event |
away_score | Integer value of the away team score after the event |
home_name | String name of the home team |
away_name | String name of the away team |
event_team | String defining the team taking the shot |
event_player_1_name | String name of the primary event player |
event_player_1_type | String indicator for the role of event_player_1 (typically the shooter) |
event_player_2_name | String name of the secondary event player |
event_player_2_type | String indicator for the role of event_player_2 (blocker, assist, or goalie) |
strength_code | String indicator for game strength: EV (Even), SH (Shorthanded), or PP (Power Play) |
x_fixed | Numeric transformed x-coordinate of event in feet, where the home team always shoots to the right, away team to the left |
y_fixed | Numeric transformed y-coordinate of event in feet, where the home team always shoots to the right, away team to the left |
shot_distance | Numeric distance (in feet) to center of net for unblocked shot events |
shot_angle | Numeric angle (in degrees) to center of net for unlocked shot events |
NWSL Team Statistics
The National Women’s Soccer League (NWSL) is the top professional women’s soccer league in the United States. While a team’s record ultimately determines their ranking, goal differential (goals scored - goals conceded) is often a better indicator of a team’s ability. But what aspects of a team’s performance are related to their goal differential? The NWSL records a variety of statistics describing a team’s performance, such as the percentage of time they maintain possession, percentage of shots on target, etc. With this dataset, you can explore variation between teams as well as which statistics are relevant predictor variables of goal differential.
This dataset contains statistics about the regular season performance for each NWSL team from 2016 to 2022 (excluding 2020 which was cancelled due to COVID).
The data was collected using the nwslR
package in R
.
Variable | Description |
---|---|
team_name | Name of NWSL team |
season | Regular season year of team’s statistics |
games_played | Number of games team played in season |
goal_differential | Goals scored - goals conceded |
goals | Number of goals scores |
goals_conceded | Number of goals conceded |
cross_accuracy | Percent of crosses that were successful |
goal_conversion_pct | Percent of shots scored |
pass_pct | Pass accuracy |
pass_pct_opposition_half | Pass accuracy in opposition half |
possession_pct | Percentage of overall ball possession the team had during the season |
shot_accuracy | Percentage of shots on target |
tackle_success_pct | Percent of successful tackles |
WTA Grand Slam Matches
The Women’s Tennis Associate (WTA) organizes the top women’s professional tennis tour in the world. Throughout the year, there are four major tournaments yielding the most ranking points, prize money, and fame. These are known as the Grand Slam tournaments, consisting of (in order): Australian Open, French Open (aka Roland Garros), Wimbledon, and the US Open. With this dataset of information about winners and losers in WTA Grand Slam matches from 2018 to 2022, you’ll be able to explore statistics collected during matches and information about the athletes to predict match outcomes.
This dataset contains all WTA matches between 2018 and 2022, courtesy of Jeff Sackmann’s famous tennis repository.
There are 2,413 rows in this dataset where each row corresponds to a single WTA Grand Slam match. Each row has 38 columns with general information about the matches, as well as columns describing the winner and loser of the matches:
Variable | Description |
---|---|
tourney_name |
name of the Grand Slam Tournament (French Open is recorded as ROLAND GARROS) |
surface |
type of court surface |
tourney_date |
eight digits, YYYYMMDD, usually the Monday of the tournament week |
winner/loser_seed |
seed of winning/losing player |
winner/loser_name |
Name of the winning/losing player |
winner/loser_hand |
R = right, L = left, U = unknown. For ambidextrous players, this is their serving hand |
winner/loser_ht |
height in centimeters, where available |
winner/loser_ioc |
three-character country code |
winner/loser_age |
age, in years, as of the tourney_date |
score |
final match score |
round |
tournament round |
minutes |
match length in minutes |
w/l_ace |
winner/loser’s number of aces |
w/l_df |
winner/loser’s number of doubles faults |
w/l_svpt |
winner/loser’s number of serve points |
w/l_1stIn |
winner/loser’s number of first serves made |
w/l_1stWon |
winner/loser’s number of first-serve points won |
w/l_2ndWon |
winner/loser’s number of second-serve points won |
w/l_SvGms |
winner/loser’s number of serve games |
w/l_bpSaved |
winner/loser’s number of break points saved |
w/l_bpFaced |
winner/loser’s number of break points faced |
winner/loser_rank |
winner/loser’s WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_date |
Note that a full glossary of the features available for match data can be found here.
References
Dror A (2023). nwslR: Compiles dataset for the National Women’s Soccer League (NWSL). R package version 0.0.0.9001.
Elmore R (2020). ballr: Access to Current and Historical Basketball Data. R package version 0.2.6.
Gilani S, Hutchinson G (2022). wehoop: Access Women’s Basketball Play by Play Data. R package version 1.5.0, https://CRAN.R-project.org/package=wehoop.
Ho T, Carl S (2022). nflreadr: Download ‘nflverse’ Data. R package version 1.3.1, https://CRAN.R-project.org/package=nflreadr.
Howell B, Gilani S (2022). fastRhockey: Functions to Access Premier Hockey Federation and National Hockey League Play by Play Data. R package version 0.4.0, https://CRAN.R-project.org/package=fastRhockey.
Morse D (2023). hockeyR: Collect and Clean Hockey Stats. R package version 1.3.1, https://github.com/danmorse314/hockeyR.
WTA data accessed from Jeff Sackmann’s tennis GitHub repository