```
library("tidyverse")
<- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/intro_r/nba_2022_player_stats.csv")
nba_data_2022
<- nba_data_2022 %>%
nba_data_2022 # Summarize player stats across multiple teams they played for:
group_by(player) %>%
summarize(age = first(age),
position = first(position),
games = sum(games, na.rm = TRUE),
minutes_played = sum(minutes_played, na.rm = TRUE),
field_goals = sum(field_goals, na.rm = TRUE),
field_goal_attempts = sum(field_goal_attempts, na.rm = TRUE),
three_pointers = sum(three_pointers, na.rm = TRUE),
three_point_attempts = sum(three_point_attempts, na.rm = TRUE),
free_throws = sum(free_throws, na.rm = TRUE),
free_throw_attempts = sum(free_throw_attempts, na.rm = TRUE)) %>%
ungroup() %>%
mutate(field_goal_percentage = field_goals / field_goal_attempts,
three_point_percentage = three_pointers / three_point_attempts,
free_throw_percentage = free_throws / free_throw_attempts,
min_per_game = minutes_played / games) %>%
# Remove rows with missing missing values
drop_na() %>%
filter(minutes_played > 250)
```

# Linear Regression

Released on Monday, June 26, 2023

## Goals

We will briefly review linear modeling, focusing on building and assessing linear models in R. We have four main goals in this lab:

use exploratory data analysis (EDA) and visualization to determine a) whether two variables have a linear relationship, and b) among a set of explanatory variables, which one(s) seem like the best candidates for predicting a given output variable.

fit and interpret simple regression models,

look at diagnostic plots to determine whether a linear model is a good fit for our data,

assess our fitted linear models.

## Data

#### (SPORTS)

Execute the following code chunk to (a) load the necessary data for this lab, (b) compute four variables we will use in this lab, (c) remove players with missing data (just to simplify things), and (d) subset out players with low minute totals (fewer than 250 minutes played in a season):

##### Which players play the most minutes / game?

In the National Basketball Association (NBA), and more generally in team sports, a coach must make decisions about how many minutes each player should play. Typically, these decisions are informed by a player’s skills, along with other factors such as fatigue, matchups, etc. Our goal is to use measurements of a few (quantifiable) player attributes to predict the minutes per game a player plays. In particular, we will focus on the following data, measured over the 2022 NBA regular season for over 400 players:

`player`

: names of each player (not useful for modeling purposes, but just for reference)`min_per_game`

: our**response variable**, measuring the minutes per game a player played during the 2022 NBA regular season.`field_goal_percentage`

: potential (continuous)**explanatory variable**, calculated as (number of made field goals) / (number of field goals attempted).`free_throw_percentage`

: potential (continuous)**explanatory variable**, calculated as (number of made free throws) / (number of free throws attempted).`three_point_percentage`

: potential (continuous)**explanatory variable**, calculated as (number of made 3 point shots) / (number of 3 point shots attempted),`age`

: potential (continuous / discrete)**explanatory variable**, player’s reported age for the 2022 season,`position`

: potential (categorical) explanatory variable, one of`SG`

(shooting guard),`PG`

(point guard),`C`

(center),`PF`

(power forward) or`SF`

(small forward).

#### (HEALTH)

Execute the following code chunk to load the heart disease dataset we worked with before:

```
library(tidyverse)
<- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease.csv") heart_disease
```

This dataset consists of 788 heart disease patients (608 women, 180 men). Your goal is to predict the `Cost`

column, which corresponds to the patient’s total cost of claims by subscriber (i.e., `Cost`

is the response variable). You have access to the following explanatory variables:

`Age`

: Age of subscriber (years)`Gender`

: Gender of subscriber`Interventions`

: Total number of interventions or procedures carried out`Drugs`

: Categorized number of drugs prescribed: 0 if none, 1 if one, 2 if more than one`ERVisit`

: Number of emergency room visits`Complications`

: Whether or not the subscriber had complications: 1 if yes, 0 if no`Comorbidities`

: Number of other diseases that the subscriber had`Duration`

: Number of days of duration of treatment condition

## Exercises

### 1. **EDA**

Spend time exploring the dataset, to visually assess which of the **explanatory** variables listed above is most associated with our response: the minutes played per game (`min_per_game`

) for (SPORTS), or `Cost`

for (HEALTH). Create scatterplots between the response and each continuous explanatory variable. **Do any of the relationship appear to be linear?** Describe the direction and strength of the association between the explanatory and response variables.

`# INSERT CODE HERE`

In your opinion, **which of the possible continuous explanatory variables displays the strongest relationship with our response variable**?

#### (SPORTS)

Create an appropriate visualization comparing the distribution of minutes per game *by position*.

`# INSERT CODE HERE`

**Do you think there is a relationship between minutes per game and position?**

#### (HEALTH)

Create an appropriate visualization comparing the distribution of cost *by gender*.

`# INSERT CODE HERE`

**Do you think there is a relationship between cost and gender?**

### 2. **Fit a simple linear model**

Now that you’ve performed some EDA, it’s time to actually fit some linear models to the data. Start the variable you think displays the strongest relationship with the response variable. **Update the following code by replacing INSERT_VARIABLE with your selected variable, and run to fit the model**:

#### (SPORTS)

`<- lm(min_per_game ~ INSERT_VARIABLE, data = nba_data_2022) init_nba_lm `

Before checking out the `summary()`

of this model, **you need to check the diagnostics** to see if it meets the necessary assumptions. To do this you can try running `plot(init_nba_lm)`

in the console (what happens?). Equivalently, another way to make the same plots but with `ggplot2`

perks is with the `ggfortify`

package by running the following code:

```
# First install the package by running the following line (uncomment it!) in the console
# install.packages("ggfortify")
library(ggfortify)
autoplot(init_nba_lm) +
theme_bw()
```

#### (HEALTH)

`<- lm(Cost ~ INSERT_VARIABLE, data = heart_disease) init_cost_lm `

Before check out the `summary()`

of this model, **you need to check the diagnostics** to see if it meets the necessary assumptions. To do this you can try running `plot(init_cost_lm)`

in the console (what happens?). Equivalently, another way to make the same plots but with `ggplot2`

perks is with the `ggfortify`

package by running the following code:

```
# First install the package by running the following line (uncomment it!) in the console
# install.packages("ggfortify")
library(ggfortify)
autoplot(init_cost_lm) +
theme_bw()
```

#### (ALL TOGETHER NOW)

The first plot is **residuals vs. fitted**: this plot should NOT display any clear patterns in the data, no obvious outliers, and be symmetric around the horizontal line at zero. The smooth line provided is just for reference to see how the residual average changes. **Do you see any obvious patterns in your plot for this model?**

The second plot is a Q-Q plot (p. 93). Without getting too much into the math behind them, **the closer the observations are to the dashed reference line, the better your model fit is.** It is bad for the observations to diverge from the dashed line in a systematic way - that means we are violating the assumption of normality discussed in lecture. **How do your points look relative to the dashed reference line?**

The third plot looks at the square root of the absolute value of the standardized residiuals. We want to check for homoskedascity of errors (equal, constant variance). **If we did have constant variance, what would we expect to see?** **What does your plot look like?**

The fourth plot is residuals vs. leverage which helps us identify **influential** points. **Leverage** quanitifies the influence the observed response for a particular observation has on its predicted value, i.e. if the leverage is small then the observed response has a small role in the value of its predicted reponse, while a large leverage indicates the observed response plays a large role in the predicted response. Its a value between 0 and 1, where the sum of all leverage values equals the number of coefficients (including the intercept). Specifically the leverage for observation \(i\) is computed as:

\[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_i^n (x_i - \bar{x})^2}\] where \(\bar{x}\) is the average value for variable \(x\) across all observations. See page 191 for more details on leverage and the regression hat matrix. We’re looking for points in the upper right or lower right corners, where dashed lines for Cook’s distance values would indicate potential outlier points that are displaying too much influence on the model results. **Do you observed any such influential points in upper or lower right corners?**

**What is your final assessment of the diagnostics, do you believe all assumptions are met? Any potential outlier observations to remove?**

#### (HEALTH)

**SPOILER ALERT** An obvious result from looking at the residual diagnostics above is that we are clearly violating the assumption of Normality. **Why do you think we’re violating this assumption?** (HINT: Display a histogram of the `Cost`

variable.)

One way of addressing this concern is to apply a transformation to the response variable, in this case `Cost`

. A common transformation for any type of dollar amount is to use the `log()`

transformation. Run the following code chunk to create a new `log_cost`

variable that we will use for the remainder of the lab.

```
<- heart_disease %>%
heart_disease mutate(log_cost = log(Cost + 1))
```

**Why did we need to + 1 before taking the log()?** (HINT: Look at the minimum of

`Cost`

.) Now make another histogram, this time for the new `log_cost`

variable - what happened to the distribution?Now fit the same model as before using the following code chunk. **Update the following code by replacing INSERT_VARIABLE with your selected variable, and run to fit the model**:

`<- lm(log_cost ~ INSERT_VARIABLE, data = heart_disease) log_cost_lm `

### 3. **Assess the model summary**

Following the example in lecture, interpret the results from the `summary()`

function on your initial model. **Do you think there is sufficient evidence to reject the null hypothesis that the coefficient is 0? What is the interpretation of the** \(R^2\) value? Compare the square root of the raw (unadjusted) \(R^2\) of your linear model to the correlation between that explanatory variable and the response using the `cor()`

function (e.g. `cor(nba_data_2022$INSERT_VARIABLE, nba_data_2022$min_per_game)`

or `cor(heart_disease$INSERT_VARIABLE, heart_disease$log_cost)`

- but replace `INSERT_VARIABLE`

with your variable). **What do you notice?**

To assess the fit of a linear model, we can also plot the predicted values vs the actual values, to see how closely our predictions align with reality, and to decide whether our model is making any systematic errors. Execute the following code chunk to show the actual response values against our model’s predictions

#### (SPORTS)

```
%>%
nba_data_2022 mutate(init_preds = predict(init_nba_lm)) %>%
ggplot(aes(x = init_preds, y = min_per_game)) +
geom_point(alpha = 0.75) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "red") +
theme_bw() +
labs(x = "Predictions", y = "Observed minutes / game")
```

#### (HEALTH)

```
%>%
heart_disease mutate(model_preds = predict(log_cost_lm)) %>%
ggplot(aes(x = model_preds, y = log_cost)) +
geom_point(alpha = 0.75) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "red") +
theme_bw() +
labs(x = "Predictions", y = "Observed log(Cost + 1)")
```

### 4. **Repeat steps 2 and 3 above for each of the different continuous variables**

Which of the variables do you think is the most appropriate variable for modeling the minutes per game?

### 5. **Include multiple covariates in your regression**

Repeat steps 2 and 3 above but including more than one variable in your model. You can easily do this in the `lm()`

function by adding another variable to the formula with the `+`

operator as so (but just replace the `INSERT_VARIABLE_X`

parts):

#### (SPORTS)

```
<- lm(min_per_game ~ INSERT_VARIABLE_1 + INSERT_VARIABLE_2,
multi_nba_lm data = nba_data_2022)
```

#### (HEALTH)

```
<- lm(log_cost ~ INSERT_VARIABLE_1 + INSERT_VARIABLE_2,
multi_cost_lm data = heart_disease)
```

**Experiment with different sets of the continuous variables**. What sets of continuous variables do you think model minutes per game best? (Remember to use the **Adjusted** \(R^2\) when comparing models that have different numbers of variables).

Beware collinearity! Load the `car`

library (install it if necessary!) and use the `vif()`

function to check for possible (multi)collinearity. The `vif()`

function computes the **variance inflation factor (VIF)** where for predictor \(x_j\) for \(j \in 1,\dots, p\):

\[
VIF_j = \frac{1}{1 - R^2_j}
\] where \(R^2_j\) is the \(R^2\) from a variable with variable \(x_j\) as the response and the other \(p-1\) predictors as the explanatory variables. VIF values close to 1 indicate the variable is not correlated with other predictors, while VIF values over 5 indicate strong presence of collinearity. If present, remove a variable with VIF over 5, and redo the fit. Rinse, lather, and repeat until the `vif()`

outputs are all less than 5. The follow code chunk displays an example of using this function:

```
# First install the package by uncommenting out the following line:
# install.packages("car")
library(car)
# (SPORTS)
vif(multi_nba_lm)
# (HEALTH)
vif(multi_cost_lm)
```

### 6. Linear model with one categorical variable

Now we will start to deal with the discrete variables in our dataset, allowing for different kinds of models.

#### (SPORTS)

Run the following code to fit a model using only the `position`

variable:

`<- lm(min_per_game ~ position, data = nba_data_2022) pos_nba_lm `

Next, use the following code to first create a column called `pos_preds`

containing the predictions of the model above, to display the predictions of this model against the actual observed minutes / game, but `facet`

by the player’s position:

```
%>%
nba_data_2022 mutate(pos_preds = predict(pos_nba_lm)) %>%
ggplot(aes(x = min_per_game, y = pos_preds)) +
geom_point(alpha = 0.5) +
facet_wrap(~ position, ncol = 3) +
theme_bw() +
labs(x = "Actual minutes / game",
y = "Predicted minutes / game")
```

As shown in the figure above, categorical variables make it **so we are changing the intercept of our regression line**. To make this more clear, view the output of the summary:

`summary(pos_nba_lm)`

#### (HEALTH)

Run the following code to fit a model using only the `Gender`

variable:

`<- lm(log_cost ~ Gender, data = heart_disease) gender_cost_lm `

Next, use the following code to first create a column called `model_preds`

containing the predictions of the model above, to display the predictions of this model against the actual `log_cast`

, but `facet`

by the patient’s `Gender`

:

```
%>%
heart_disease mutate(model_preds = predict(gender_cost_lm)) %>%
ggplot(aes(x = log_cost, y = model_preds)) +
geom_point(alpha = 0.5) +
facet_wrap(~ Gender, ncol = 2) +
theme_bw() +
labs(x = "Actual log(Cost + 1)",
y = "Predicted log(Cost + 1)")
```

As shown in the figure above, categorical variables make it **so we are changing the intercept of our regression line**. To make this more clear, view the output of the summary:

`summary(gender_cost_lm)`

#### (ALL TOGETHER NOW)

**Notice how only four coefficients are provided in addition to the intercept.** This is because, by default, `R`

turns the categorical variables of \(m\) levels (e.g. there are 5 positions in the NBA dataset, and there are 2 genders in the heart disease dataset) into \(m - 1\) indicator variables (binary with values of 1 if in that level versus 0 if not that level) for different categories relative to a **baseline level**.

For the (SPORTS) group, `R`

has created an indicator for four positions: PF, PG, SF, and SG. By default, `R`

will use alphabetical order to determine the baseline category, which in this example the center position C.

In the (HEALTH) example, we end up with an indicator for `Male`

because `R`

has chosen `Female`

as the baseline category, since it comes first in alphabetical order.

The values for the coefficient estimates indicate the expected change in the response variable relative to the baseline. In other words, **the intercept term gives us the baseline’s average y**, e.g. the average minutes / game for centers, or average log(`Cost`

) for male patients. This matches what you displayed in the predictions against observed response scatterplots by position above.

**Beware the default baseline R picks for categorical variables!** We typically want to choose the baseline level to be the group

**with the most observations**.

#### (SPORTS)

In this example, each position has a similar number of observations so the results are reasonable. But in general, we can change the reference level by modifying the `factor`

levels of the categorical variables (similar to how we reorder things in `ggplot2`

). For example, after viewing `table(nba_data_2022$position)`

we see how the SG position has the most observations. We can use the following code to modify the position variable so that SG is the baseline (we use `fct_relevel()`

to update `position`

so that SG is the first factor level - and we do not need to modify the order of the remaining levels):

```
<- nba_data_2022 %>%
nba_data_2022 mutate(position = fct_relevel(position, "SG"))
```

**Refit the linear regression model using position above, how has the summary changed?**

#### (HEALTH)

In this example, `Female`

has the most number of observations so the default was appropriate. But in general, we can change the reference level by modifying the `factor`

levels of the categorical variables (similar to how we reorder things in `ggplot2`

). For example, we can use the following code to modify the `Gender`

variable so that `Male`

is the baseline (we use `fct_relevel()`

to update `Gender`

so that `Male`

is the first factor level - and we do not need to modify the order of the remaining levels):

```
<- heart_disease %>%
heart_disease mutate(Gender = fct_relevel(Gender, "Male"))
```

**Refit the linear regression model using Gender above, how has the summary changed?**

After you refit the model above, change the reference level back to `Female`

witht the following code:

```
<- heart_disease %>%
heart_disease mutate(Gender = fct_relevel(Gender, "Female"))
```

### 7. Linear model with one categorical AND one continuous variable

Pick a single continuous variable from earlier, use it to replace INSERT_VARIABLE below, then run the code to fit a model with your categorical variable included:

#### (SPORTS)

`<- lm(min_per_game ~ position + INSERT_VARIABLE, data = nba_data_2022) x_pos_nba_lm `

**Create scatterplots with your predictions on the y-axis, your INSERT_VARIABLE on the x-asis, and color by position**. What do you observe?

Given similarities between different types of positions, we can easily collapse the positions together into a smaller number of categories using `fct_collapse()`

:

```
<- nba_data_2022 %>%
nba_data_2022 mutate(position_group = fct_collapse(position,
Guard = c("SG", "PG"),
Forward = c("SF", "PF"),
Center = "C"))
```

**Refit the model with this new position_group variable**, but assign it to a different name, e.g.

`x_pos_group_nba_lm`

. What changed in the summary?#### (HEALTH)

`<- lm(log_cost ~ Gender + INSERT_VARIABLE, data = heart_disease) x_gender_cost_lm `

**Create scatterplots with your predictions on the y-axis, your INSERT_VARIABLE on the x-asis, and color by Gender**. What do you observe?

Another categorical we have access to is the `Drugs`

variable, which is currently coded as numeric. We can first use the `fct_recode()`

function to modify the `Drugs`

variable so that the integers are relabeled:

```
<- heart_disease %>%
heart_disease mutate(Drugs = fct_recode(as.factor(Drugs),
"None" = "0",
"One" = "1",
"> One" = "2"))
```

Run the following code to fit a model using only the `Drugs`

variable:

`<- lm(log_cost ~ Drugs, data = heart_disease) drugs_cost_lm `

Repeat the same from above that you considered for the `Gender`

variable, viewing the predictions facetted by `Drugs`

and assess the `summary()`

output. **Do you think an appropriate reference level was used?** (HINT: Use the `table()`

function on the `Drugs`

variable to view the overall frequency of each level and determine if the most frequent level was used as the reference.)

Given the similar values, we may decide to collapse the level of `One`

and `> One`

into a single level `>= One`

. We can easily collapse the levels together into a smaller number of categories using `fct_collapse()`

:

```
<- heart_disease %>%
heart_disease mutate(drugs_group = fct_collapse(Drugs,
None = c("None"),
`>= One` = c("One", "> One")))
```

**Refit the model with this new drugs_group variable**, but assign it to a different name, e.g.

`drugs_group_cost_lm`

. What changed in the summary?### 8. Interactions

Remember with `ggplot2`

you can directly compute and plot the results from running linear regression using `geom_smooth()`

or `stat_smooth()`

and specifying that `method = "lm"`

. Try running the following code (replace INSERT_VARIABLE!) to generate the linear regression fits with `geom_smooth`

versus your own model’s predictions (note the different `y`

mapping for the point versus smooth layers):

#### (SPORTS)

```
%>%
nba_data_2022 mutate(pos_preds = predict(x_pos_nba_lm)) %>%
ggplot(aes(x = INSERT_VARIABLE,
color = position)) +
geom_point(aes(y = pos_preds),
alpha = 0.5) +
theme_bw() +
facet_wrap(~ position, ncol = 3) +
labs(x = "INSERT YOUR LABEL HERE",
y = "Predicted minutes / game") +
geom_smooth(aes(y = min_per_game),
method = "lm")
```

#### (HEALTH)

```
%>%
heart_disease mutate(model_preds = predict(x_gender_cost_lm)) %>%
ggplot(aes(x = INSERT_VARIABLE,
color = Gender)) +
geom_point(aes(y = model_preds),
alpha = 0.5) +
theme_bw() +
facet_wrap(~ Gender, ncol = 3) +
labs(x = "INSERT YOUR LABEL HERE",
y = "Predicted log(Cost + 1)") +
geom_smooth(aes(y = log_cost),
method = "lm")
```

#### (ALL TOGETHER NOW)

**The geom_smooth() regression lines do NOT match!** This is because

`ggplot2`

is fitting **separate regressions for each value of the categorical variable**, meaning the slope for the continuous variable on the x-axis is changing for each position/gender. We can match the output of the

`geom_smooth()`

results with **interactions**. We can use interaction terms to build more complex models. Interaction terms allow for a different linear model to be fit for each category; that is, they allow for different slopes across different categories. If we believe relationships between continuous variables, and outcomes, differ across categories, we can use interaction terms to better model these relationships.

To fit a model with an interaction term between two variables, include the interaction via the `*`

operator like so:

#### (SPORTS)

```
<- lm(min_per_game ~ position + INSERT_VARIABLE +
pos_int_nba_lm * INSERT_VARIABLE,
position data = nba_data_2022)
```

#### (HEALTH)

```
<- lm(log_cost ~ Gender + INSERT_VARIABLE +
gender_int_cost_lm * INSERT_VARIABLE,
Gender data = heart_disease)
```

#### (ALL TOGETHER NOW)

**Replace the predictions in the previous plot’s mutate code with this interaction model’s predictions.** How do they compare to the results from

`geom_smooth()`

now?You can model interactions between any type of variables using the `*`

operator, feel free to experiment on your different possible continuous variables.

### 9. Polynomials

Another way to increase the explanatory power of your model is to include transformations of continuous variables. For instance you can directly create a column that is a square of a variable with `mutate()`

and then fit the regression with the original variable and its squared term:

#### (SPORTS)

```
<- nba_data_2022 %>%
nba_data_2022 mutate(fg_perc_squared = field_goal_percentage^2)
<- lm(min_per_game ~ field_goal_percentage + fg_perc_squared,
squared_fg_lm data = nba_data_2022)
summary(squared_fg_lm)
```

**What are some difficulties with interpreting this model fit?** View the predictions for this model or other covariates you squared.

The `poly()`

function allows us to build higher-order polynomial transformations of variables easily. Run the following code chunk to fit a 9th-order polynomial model (i.e. \(Y = \beta_0 + \beta_1x + \beta_2x^2 + \ldots + \beta_9x^9\)) between minutes / game and field goal percentage.

```
<- lm(min_per_game ~ poly(field_goal_percentage, 9),
poly_nine_fg_lm data = nba_data_2022)
summary(poly_nine_fg_lm)
```

**Do you think this is appropriate, how did this change your predictions compared to the previous plot or when only using the variable without any transformation?**

#### (HEALTH)

```
<- heart_disease %>%
heart_disease mutate(duration_squared = Duration^2)
<- lm(log_cost ~ Duration + duration_squared,
squared_duration_lm data = heart_disease)
summary(squared_duration_lm)
```

**What are some difficulties with interpreting this model fit?** View the predictions for this model or other covariates you squared.

The `poly()`

function allows us to build higher-order polynomial transformations of variables easily. Run the following code chunk to fit a 9th-order polynomial model (i.e. \(Y = \beta_0 + \beta_1x + \beta_2x^2 + \ldots + \beta_9x^9\)) between `log_cost`

and `Duration`

.

```
<- lm(log_cost ~ poly(Duration, 9),
poly_nine_duration_lm data = heart_disease)
summary(poly_nine_duration_lm)
```

**Do you think this is appropriate, how did this change your predictions compared to the previous plot or when only using the variable without any transformation?**

### 10. Training and testing

As we’ve seen, using transformations such as higher-order polynomials may decrease the interpretability and increase the potential for overfitting associated with our models; however, they can also dramatically improve the explanatory power.

We need a way for making sure our more complicated models have not overly fit to the noise present in our data. Another way of saying this is that a good model should generalize to a different sample than the one on which it was fit. This intuition motivates the idea of training/testing. We split our data into two parts, use one part – the training set – to fit our models, and the other part – the testing set – to evaluate our models. Any model which happens to fit to the noise present in our training data should perform poorly on our testing data.

NOTE: We split our data randomly using the `sample()`

function, which means that each time you run this code you’ll get a different result… UNLESS you set a seed. This makes the code reproducible, as the seed starts the (pseudo)random number generator. Always remember to `set.seed()`

any time your methods involve randomness. Feel free to use any numbers you like (Prof. Yurko generally uses the years that the Pittsburgh Pirates won the world series, but you do whatever you like).

#### (SPORTS)

The first thing we will need to do is split our sample. Run the following code chunk to divide our data into two halves, which we will refer to as a training set and a test set. Briefly summarize what each line in the code chunk is doing.

```
set.seed(101) # CHANGE THIS NUMBER
<- nrow(nba_data_2022)
n_players <- sample(n_players, n_players / 2, replace = FALSE)
train_i <- (1:n_players)[-train_i]
test_i <- nba_data_2022[train_i,]
nba_train <- nba_data_2022[test_i,] nba_test
```

We will now compare three candidate models for predicting minutes played using position and field goal percentage. We will fit these models on the **training data** only, ignoring the testing data for the moment. Run the below two code chunks to create two candidate models:

```
<- lm(min_per_game ~ poly(field_goal_percentage, 2) + position +
candidate_model_1 * poly(field_goal_percentage, 2),
position data = nba_train)
```

```
<- lm(min_per_game ~ poly(field_goal_percentage, 2) + position,
candidate_model_2 data = nba_train)
```

Using `summary()`

, which of these models has more explanatory power according to the training data? Which of the models is less likely to overfit?

**Fit another model to predict minutes / game using a different set of variables / polynomials**.

Now that we’ve built our candidate models, we will evaluate them on our test set, using the criterion of mean squared error (MSE). Run the following code chunk to compute, on the test set, the MSE of predictions given by the first model compared to the actual minutes played.

```
<- predict(candidate_model_1, newdata = nba_test)
model_1_preds <- mean((model_1_preds - nba_test$min_per_game)^2) model_1_mse
```

**Do this for each of your candidate models**. Compare the MSE on the test set, which model performed best (lowest test MSE)?

#### (HEALTH)

The first thing we will need to do is split our sample. Run the following code chunk to divide our data into two halves, which we will refer to as a training set and a test set. Briefly summarize what each line in the code chunk is doing.

```
set.seed(101) # CHANGE THIS NUMBER
<- nrow(heart_disease)
n_patients <- sample(n_patients, n_patients / 2, replace = FALSE)
train_i <- (1:n_patients)[-train_i]
test_i <- heart_disease[train_i,]
heart_train <- heart_disease[test_i,] heart_test
```

We will now compare three candidate models for predicting `log_cost`

using `Gender`

and `Duration`

. We will fit these models on the **training data** only, ignoring the testing data for the moment. Run the below two code chunks to create two candidate models:

```
<- lm(log_cost ~ poly(Duration, 2) + Gender +
candidate_model_1 * poly(Duration, 2),
Gender data = heart_train)
```

```
<- lm(log_cost ~ poly(Duration, 2) + Gender,
candidate_model_2 data = heart_train)
```

Using `summary()`

, which of these models has more explanatory power according to the training data? Which of the models is less likely to overfit?

**Fit another model to predict log_cost using a different set of variables / polynomials**.

Now that we’ve built our candidate models, we will evaluate them on our test set, using the criterion of mean squared error (MSE). Run the following code chunk to compute, on the test set, the MSE of predictions given by the first model compared to the actual `log_cost`

.

```
<- predict(candidate_model_1, newdata = heart_test)
model_1_preds <- mean((model_1_preds - heart_test$log_cost)^2) model_1_mse
```

**Do this for each of your candidate models**. Compare the MSE on the test set, which model performed best (lowest test MSE)?