#R codes here
print("Hello World")
[1] "Hello World"
Released on Monday, June 5, 2023
To install R
(latest release: 4.3.0), please go to https://www.r-project.org and choose your system. Click the download R link in the middle of the page under “Getting Started.” Download and install the installer files (executable, pkg, etc) that correspond to your system.
Although you can use R
without any integrated development environment (IDE), you will need to install RStudio, by far the most popular IDE for R
, for this summer. Basically, it makes your life with R much easier and we will be using it throughout the program. To install RStudio, please go to https://www.rstudio.com/products/rstudio/download/#download and choose your system. The installer is preferred. If you have RStudio installed but not the latest version, just download the latest installer and install.
Follow the TA for a walkthrough of each component in R Studio.
You can type commands directly into the Console (lower left pane), but this can become quite tedious and annoying when your work becomes more complex. Instead, you will be spending most of your time writing code in R Scripts. An R Script is a file type which R recognizes as storing R commands and is saved as a .R file. R Scripts are useful as we can edit our code before sending it to be run in the console. You can create a new R Script by clicking on the top left symbol in RStudio and selecting R Script.
An R Markdown(.Rmd
) file is a dynamic document combining the R
and the markdown. It contains the reproducible R
code along with the narration that a reader needs to understand your work. (This file itself is generated by R Markdown.) If you are familiar with the laTeX syntax, math mode works like a charm in almost the same way:
The chunks of embedded R
codes are:
Running a Rmarkdown file and converting it into a reader-friendly documents require the rmarkdown
and knitr
package. All the lab documents will be Rmarkdown files so you need to know how to knit them. We recommend to knit as html
file but if you have LaTex installed, you can knit as PDF.
For more detailed information, R Markdown Cheatsheet and the online book are helpful. (For RStudio users, easily accessible from Help -> Cheatsheets)
Alternatively, you can create a R Notebook file which is similar to R Markdown, but instead allows you to knit code chunks separately without having to re-knit (aka re-run!) the entire file. Notebooks will only render HTML files, but can they be very useful for simultaneously coding, generating results, and making a presentable file to others.
R
performs a wide variety of functions, such as data manipulation, modeling, and visualization. The extensive code base beyond the built-in functions are managed by packages created from numerous statisticians and developers. The Comprehensive R Archive Network (CRAN) manages the open-source distribution and the quality control of the R
packages.
To install a R
package, using the function install.packages
and put the package name in the parentheses and the quote. While this is preferred, for those using RStudio, you can also go to “Tools” then “Install Packages” and then input the package name.
Important Note: NEVER install new packages in a code block in an .Rmd file. Always install new packages at the command line / Console. That is, the install.packages()
function should NEVER be in your code chunks (unless they are commented out using #). The library()
function, however, will be used throughout your code: The library()
function loads packages only after they are installed.
If in any time you get a message says: “Do you want to install from sources the package which needs compilation?” Choose “No” will tend to bring less troubles. (Note: This happens when the bleeding-edge version package is available, but not yet compiled for each OS distribution. In many case, you can just proceed without the source compilation.)
Each package only needs to be installed once. Whenever you want to use functions defined in the package, you need to load the package with the statement:
Here is a list of packages that we may need (but not limited to) in the following lectures and/or labs. Please make sure you can install all of them. If you fail to install any package, please update the R
and RStudio first and check the error message for any other packages that need to install first.
The basic unit of R
is a vector. A vector v
is a collection of values of the same type and the type could be:
Note: Oftentimes, factor
is used to encode a character vector into unique numeric vector.
Vector can store only single data type:
List is a vector of vectors which can store different data types of vectors:
roster <- list(
name = c("Shamindra", "Meg", "Quang", "Nick", "YJ", "Beomjo"),
role = c("Instructor", "Instructor", "TA", "TA", "TA", "TA"),
is_TA = c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
)
str(roster)
List of 3
$ name : chr [1:6] "Shamindra" "Meg" "Quang" "Nick" ...
$ role : chr [1:6] "Instructor" "Instructor" "TA" "TA" ...
$ is_TA: logi [1:6] FALSE FALSE TRUE TRUE TRUE TRUE
R
uses a specific type of list, data frame, containing the same number of rows with unique row names.
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
[1] "list"
We can preform element-wise actions on vectors through the operators:
Most of the data provided to you are in csv format. In the code chunk below, we use the read_csv()
function to load a dataset that is saved in a folder located in the SURE website. In quotations "file_path"
is the file path where the dataset is located, which in this case is online. However, typically you’ll save csv files locally first and put them in an organized folder to access later.
If you have any R problem, the best step is to use the help()
function (or equivalently the ?
). For example, we can find out what the help()
function does by typing help(help)
:
Or you can use the statement “?” + “function name”, for example:
Double question marks can lead to a more general search.
You should ALWAYS consult the R
help pages first before attempting to google around for a solution.
v1
and v2
are numeric vectors, v3
is a character vector and v4
is a logic vector. Make sure the length of v1
and v2
are the same. (Hint: a way to check the length is to use the function length()
)v1
and v2
.TRUE
and 2 of them return FALSE
.Batting
dataset containing historical MLB statistics from the Lahman
package (you will need to install it first!). How many of the players are in Double-A league (coded as ‘AA’ in lgID
)? How about American League(‘AL’)? Can you neatly summarize the counts for all leagues? (HINT: table()
)gapminder
dataset containing health and income outcomes for 184 countries from 1960 to 2016 from the dslabs
package. How many of the rows in the dataset are from the Caribbean (coded as ‘Caribbean’ in region
)? How about Eastern Europe? Can you neatly summarize the counts for all regions? (HINT: table()
)There are a lot of ways to format text within RStudio, e.g., italics and bold (just look at the .Rmd file to see how I did this). See here for more tips/tricks on how to format things in R Markdown. As you’ll see throughout this summer (and especially with your project), well-formatted .html files can be a great way to showcase data science results to the public online.
Within RStudio there are several panes that contain various things (Console, Help, Environment, History, Plots, etc). Here we discuss how you can customize how these panes are displayed.
If you’re using Mac, go to RStudio / Preferences / Pane Layout. If you’re using Windows, go to Tools / Global Options. Change the menu options to arrange the panes as you see fit. Click Apply and OK.
Now (still within the RStudio / Preferences menu), click Appearance and choose an appropriate font, font size, and theme. Click Apply and OK. Minimizing the bottom-left and bottom-right panes is a nice trick, which gives more vertical space to see your code and the output it’s generating. (Minimize/maximize buttons are in the top-right of each pane.)
R
Primers on RStudio CloudIf you are struggling to install R
and Rstudio on your computer, and/or having difficulties with installing the tidyverse
then you should make a free RStudio Cloud account at https://rstudio.cloud/
. This is a free, browser-based version of R
and RStudio that also provides access to a growing number of R
tutorials / primers relevant to this summer.
After you create a RStudio Cloud account, click on the navigation menu by “Your Workspace”. Then click on “Primers” to bring up a menu of tutorials, with code primers you can choose to work through. RStudio Cloud is a great practical alternative to use in case we are unable to resolve errors with regards to installation on your own personal computer (an unlikely scenario). We strongly encourage you to use an installed version of R
and RStudio throughout the summer, due to RStudio Cloud data limitations that are important for your projects.
---
title: "Getting Started with R"
date: "June 5, 2023"
format:
html:
code-tools:
source: true
toggle: false
caption: none
---
## Install R and RStudio
To install **`R`** (latest release: 4.3.0), please go to <https://www.r-project.org> and choose your system. Click the *download R* link in the middle of the page under "Getting Started." Download and install the installer files (executable, pkg, etc) that correspond to your system.
Although you can use `R` without any integrated development environment (IDE), you will need to install **RStudio**, by far the most popular IDE for `R`, for this summer. Basically, it makes your life with R much easier and we will be using it throughout the program. To install RStudio, please go to <https://www.rstudio.com/products/rstudio/download/#download> and choose your system. The installer is preferred. If you have RStudio installed but not the latest version, just download the latest installer and install.
## Typical workflow
**Follow the TA for a walkthrough of each component in R Studio.**
### Writing R scripts
You can type commands directly into the **Console** (lower left pane), but this can become quite tedious and annoying when your work becomes more complex. Instead, you will be spending most of your time writing code in **R Scripts**. An R Script is a file type which R recognizes as storing R commands and is saved as a **.R** file. R Scripts are useful as we can edit our code before sending it to be run in the console. You can create a new R Script by clicking on the top left symbol in RStudio and selecting **R Script**.
### Using R Markdown
An R Markdown(`.Rmd`) file is a dynamic document combining the `R` and the markdown. It contains the reproducible `R` code along with the narration that a reader needs to understand your work. (This file itself is generated by R Markdown.) If you are familiar with the laTeX syntax, math mode works like a charm in almost the same way:
$$
f (x) = \frac{1}{\sqrt{2\pi}} \exp \left( - \frac{x^2}{2} \right)
$$
The chunks of embedded `R` codes are:
```{r}
#R codes here
print("Hello World")
```
Running a Rmarkdown file and converting it into a reader-friendly documents require the `rmarkdown` and `knitr` package. All the lab documents will be Rmarkdown files so you need to know how to knit them. We recommend to knit as `html` file but if you have LaTex installed, you can knit as PDF.
For more detailed information, [R Markdown Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) and the [online book](https://bookdown.org/yihui/rmarkdown/) are helpful. (For RStudio users, easily accessible from Help -\> Cheatsheets)
Alternatively, you can create a **R Notebook** file which is similar to R Markdown, but instead allows you to knit code chunks separately without having to re-knit (aka re-run!) the entire file. Notebooks will only render HTML files, but can they be very useful for simultaneously coding, generating results, and making a presentable file to others.
## Install R packages
`R` performs a wide variety of functions, such as data manipulation, modeling, and visualization. The extensive code base beyond the built-in functions are managed by **packages** created from numerous statisticians and developers. The Comprehensive R Archive Network (CRAN) manages the open-source distribution and the quality control of the `R` packages.
To install a `R` package, using the function `install.packages` and put the package name in the parentheses and the quote. While this is preferred, for those using RStudio, you can also go to "Tools" then "Install Packages" and then input the package name.
```{r, eval = FALSE}
install.packages("tidyverse")
```
**Important Note**: NEVER install new packages in a code block in an .Rmd file. Always install new packages at the command line / Console. That is, the `install.packages()` function should NEVER be in your code chunks (unless they are commented out using #). The `library()` function, however, will be used throughout your code: The `library()` function loads packages only after they are installed.
If in any time you get a message says: "Do you want to install from sources the package which needs compilation?" Choose "No" will tend to bring less troubles. (Note: This happens when the bleeding-edge version package is available, but not yet compiled for each OS distribution. In many case, you can just proceed without the source compilation.)
Each package only needs to be installed once. Whenever you want to use functions defined in the package, you need to *load* the package with the statement:
```{r, message = FALSE, eval = FALSE}
library(tidyverse)
```
Here is a list of packages that we may need (but not limited to) in the following lectures and/or labs. Please make sure you can install all of them. If you fail to install any package, please update the `R` and RStudio first and check the error message for any other packages that need to install first.
```{r, message=FALSE, eval = FALSE}
library(tidyverse)
library(devtools)
library(rmarkdown)
library(knitr)
library(ranger)
library(glmnet)
```
## Basic data type and operators
### Data type: Vector
The basic unit of `R` is a vector. A vector `v` is a collection of values of the same type and the type could be:
1. numeric (double/integer number): digits with optional decimal point
```{r}
v1 <- c(1, 5, 8.3, 0.02, 99999)
typeof(v1)
```
2. character: a string (or word) in double or single quotes, "..." or '...'.
```{r}
v2 <- c("apple", "banana", "3 chairs", "dimension1", ">-<")
typeof(v2)
```
3. logical: TRUE and FALSE
```{r}
v3 <- c(TRUE, FALSE, FALSE)
typeof(v3)
```
Note: Oftentimes, `factor` is used to encode a character vector into unique numeric vector.
```{r}
player_type <- c("Batter", "Batter", "Hitter", "Batter", "Hitter")
player_type <- factor(player_type)
str(player_type)
typeof(player_type)
```
### Data type: Lists
Vector can store only single data type:
```{r}
typeof(c(1, TRUE, "apple"))
```
**List** is a vector of vectors which can store different data types of vectors:
```{r}
roster <- list(
name = c("Shamindra", "Meg", "Quang", "Nick", "YJ", "Beomjo"),
role = c("Instructor", "Instructor", "TA", "TA", "TA", "TA"),
is_TA = c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
)
str(roster)
```
`R` uses a specific type of list, **data frame**, containing the same number of rows with unique row names.
```{r}
str(iris)
typeof(iris)
```
### Operators
We can preform element-wise actions on vectors through the operators:
1. Arithmetic: + - \* / \^ (and, for integer division, %/% is quotient, %% is remainder)
```{r}
v1 <- c(1,2,3)
v2 <- c(4,5,6)
v1 + v2
v1 * v2
v2 %% v1
```
2. Relation: \> \>= \< \<= == != (last two are equals and is not equal to )
```{r}
5 > 4
5 <= 4
33 == 22
33 != 22
```
3. Logic: !(not) &(and) \|(or)
```{r}
(5 > 6) | (2 < 3)
(5 > 6) & (2 < 3)
!(5 > 6) & (2 < 3)
```
4. Sequence: from:to (Colon operator)
```{r}
1:5
5:1
-1:-5
-1:5
```
## Read the csv file
Most of the data provided to you are in csv format. In the code chunk below, we use the [`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) function to load a dataset that is saved in a folder located in the SURE website. In quotations `"file_path"` is the file path where the dataset is located, which in this case is online. However, typically you'll save csv files locally first and put them in an organized folder to access later.
```{r, eval = FALSE}
nba <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/intro_r/nba_2022_player_stats.csv")
head(nba)
```
## Looking for help
If you have any R problem, the best step is to use the `help()` function (or equivalently the `?`). For example, we can find out what the `help()` function does by typing `help(help)`:
```{r, eval = FALSE}
help(help)
```
Or you can use the statement "?" + "function name", for example:
```{r, eval = FALSE}
?help
```
Double question marks can lead to a more general search.
```{r, eval = FALSE}
??help
```
You should **ALWAYS** consult the `R` help pages first before attempting to google around for a solution.
## Exercise
1. Create four vectors, `v1` and `v2` are numeric vectors, `v3` is a character vector and `v4` is a logic vector. Make sure the length of `v1` and `v2` are the same. (Hint: a way to check the length is to use the function `length()`)
```{r}
#R code here
```
2. Preform add, minus, product and division on `v1` and `v2`.
```{r}
#R code here
```
3. Create four statements with both relation and logic operators, that 2 of them return `TRUE` and 2 of them return `FALSE`.
```{r}
#R code here
```
4. Create 2 sequences with length 20, one in an increasing order and the other in a decreasing order.
```{r}
#R code here
```
5. (SPORTS) Following is the `Batting` dataset containing historical MLB statistics from the `Lahman` package (you will need to install it first!). How many of the players are in Double-A league (coded as 'AA' in `lgID`)? How about American League('AL')? Can you neatly summarize the counts for all leagues? (HINT: `table()`)
```{r}
library(Lahman)
data(Batting)
#R code here
```
5. (HEALTH) Following is the `gapminder` dataset containing health and income outcomes for 184 countries from 1960 to 2016 from the `dslabs` package. How many of the rows in the dataset are from the Caribbean (coded as 'Caribbean' in `region`)? How about Eastern Europe? Can you neatly summarize the counts for all regions? (HINT: `table()`)
```{r}
library(dslabs)
data(gapminder)
#R code here
```
# OPTIONAL: Formating Text within RStudio
There are a lot of ways to format text within RStudio, e.g., *italics* and **bold** (just look at the .Rmd file to see how I did this). [See here](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) for more tips/tricks on how to format things in R Markdown. As you'll see throughout this summer (and especially with your project), well-formatted .html files can be a great way to showcase data science results to the public online.
# OPTIONAL: Customizing the RStudio User Interface
Within RStudio there are several panes that contain various things (Console, Help, Environment, History, Plots, etc). Here we discuss how you can customize how these panes are displayed.
If you're using Mac, go to RStudio / Preferences / Pane Layout. If you're using Windows, go to Tools / Global Options. Change the menu options to arrange the panes as you see fit. Click Apply and OK.
Now (still within the RStudio / Preferences menu), click Appearance and choose an appropriate font, font size, and theme. Click Apply and OK. Minimizing the bottom-left and bottom-right panes is a nice trick, which gives more vertical space to see your code and the output it's generating. (Minimize/maximize buttons are in the top-right of each pane.)
# OPTIONAL: Additional Customization Advice
- Under Preferences / Code / Display, you might consider adding the margin column and setting it to 80 characters, since most style guides suggest that you should keep lines of code at 80 characters or less when possible.
- You can set your background color, font, font size, etc. under Preferences / Appearance. (I use Cobalt). "Dark displays" are often easier on the eyes and are environmentally-friendly in that they conserve energy on devices. Of course, this is strictly a personal preference, and people can get unnecessarily dogmatic about it, so you should just choose something that you like.
- Under Preferences / Packages, you can opt to change your CRAN mirror to the "Global (CDN) - RStudio" option, as it is very reliable.
- Under Preferences / Git / SVN, you can configure your version control preferences. People often recommend using Git with RStudio for version control purposes. The interface is easy to use even if you're a beginner programmer or Git user. [This link](https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN) and [this link](https://jennybc.github.io/2014-05-12-ubc/ubc-r/session03_git.html) have more information on this, if you're interested.
# OPTIONAL: `R` Primers on RStudio Cloud
If you are struggling to install `R` and Rstudio on your computer, and/or having difficulties with installing the `tidyverse` then you **should make a free RStudio Cloud account at [`https://rstudio.cloud/`](https://rstudio.cloud/).** This is a free, browser-based version of `R` and RStudio that also provides access to a growing number of `R` tutorials / primers relevant to this summer.
After you create a RStudio Cloud account, click on the navigation menu by "Your Workspace". Then click on ["Primers"](https://rstudio.cloud/learn/primers) to bring up a menu of tutorials, with code primers you can choose to work through. RStudio Cloud is a great practical alternative to use **in case we are unable to resolve errors with regards to installation on your own personal computer** (an unlikely scenario). We strongly encourage you to use an installed version of `R` and RStudio throughout the summer, due to RStudio Cloud data limitations that are important for your projects.