--- title: "Worksheet 2" output: pdf_document: default html_notebook: default --- ## Author: Enter name of the author here ## Discussants: Enter the names of other people who you discussed this problem set with $\\$ In the following exercises we will examine some descriptive statistics and plots for categorical and quantitative variables, and then we explore the concept of estimating a property of a population from a statistic. Functions you will use in this worksheet include: get.Lahman.batting.data(), dim(), plot(), c(), sum(), names(), barplot(), pie(), hist(), mean(), sd(), median(), min(), max(), fivenum(), boxplot() and sample(). As always, please use [Piazza](https://piazza.com/class/iy3nflk2izi6np) if you have any questions. ```{r message=FALSE, warning=FALSE, tidy=TRUE, echo = FALSE} library(knitr) # makes sure the code is wrapped to fit when it creats a pdf opts_chunk$set(tidy.opts=list(width.cutoff=60)) set.seed(1) # set the random number generator to always give the same sequence of random numbers source('/home/shared/baseball_stats_2017/baseball_class_functions.R') # this will also load the Master data frame ``` $\\$ **Exercise 1:** Let us start by examining data from Miguel Cabrera who plays first base for the Detroit Tigers. Miguel is a player who hits a lot of home runs so he should be a fun player to look at. Start off by using the get.Lahman.batting.data() function to get a data frame that has Miguel's data, and assign this data frame to a variable called Miguel.data. Then use the dim() function to determine how many cases there are in this data frame. Report this number and state what a case corresponds to. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 2:** Now plot Miguel's home runs as a function of year using the function plot(x, y, ylab='', xlab='', type = 'o', main=''). What are the most home runs Miguel hit in a season? (Hint: remember we can extract columns from a data frame using the $ symbol). ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 3:** In baseball there are four different types of hits a batter can get: singles (1B), doubles (2B), triples (3B) and home runs (HR). Create a vector with four values in it that called Miguel.hit.types that has Miguel's career total number of hits, doubles (X2B), triples (X3B) and home runs (note the R puts and X in front of 2B and 3B because variabels in R can not start with numbers). Display this data and report his total double numbers. Hints: the sum() function will be useful as will the vector creation function c(). ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 4:** Notice that singles (1B) are not a reported statistic in the Miguel's data frame. However since hits = singles + doubles + triples + homeruns (i.e., H = 1B + 2B + 3B+ HR) we can manipulate this equation to calculate how many singles Miguel had. Use this equation to calculate how many singles Miguel had in total, and create a new vector Miguel.hit.types.with.singles that has singles, doubles, triples and home runs (i.e., replace hits with singles). Report how many singles Miguel had. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 5:** R allows you to add names to different entries in a vector. These names are then used by plotting functions to add names to figures automatically. Let's add names to the Miguel.hit.types.with.singles that describe what the different values in that vector mean using the names() function as follows: ```{r message = FALSE, warning = FALSE, tidy = TRUE} # uncomment and modify the code below # names(Miguel.hit.types.with.singles) <- c("1B", "2B", 'Add appropriate name here', 'Add appropriate name here') ``` $\\$ **Exercise 6:** Now create bar plots (barplot(x, xlab ='', ylab='', main ='')) and pie charts (pie(x)) of different types of hits that Miguel had. Don't forget to label your axes in the bar plot. Also report the proportion of the different types of hits that he had. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 7:** Let's move on from analyzing Miguel's data and look at the heights of different major league players. The data frame *Master* contains information on all baseball players from 1871 to 2014. There is a variable in this data frame called *height* that has all the players heights in inches (the Master data frame was loaded when you loaded the functions I wrote). Use the dim() function and report the number of cases in this data frame, and then create a histogram of heights of all players using the hist() function. Set the optional argument n in the hist function to be 30 to make the histogram have smaller bins (i.e., hist(x, n = 30)). Describe the shape of this histogram. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 8:** Now use the mean(), sd() and median functions to get the mean, the standard deviation and the median of the baseball player heights and report these values. Note that the heights of some players in this data set are not known, so there are coded as NA which signifies a missing value. If you try to take the mean (or standard deviation or median) of a vector that has NAs the result will be NA. To get around this we can remove the NA when taking the mean and standard deviation using the argument na.rm = TRUE (i.e., use mean(x, na.rm = TRUE)). Report what the mean, standard deviation and the median are. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 9:** The quantile() function gives quantiles of data. For example if we wanted to get the .2 quantile we would use quantile(x, .2). Use the quantile function, to find the .3 and .8 quantile of player heights, and report what these are. Make sure to set the na.rm = TRUE argument here to remove NAs. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 10**: Finally, let's examine a few concepts related to *statistical inference*. As discussed in class, *statistical inference* is the process of estimating a property of a population from a sample of data (remember that a sample is a subset of the population). Let us examine this concept by examining the mean height of 50 randomly choosen baseball players and compare it to the mean height from all players who every played. Use the sample() function to get the heights of 50 random players and assign it to the data frame called height.of.50.random.players using the code below. Then take the mean of these 50 random players and compare it to the mean of all the baseball players. Repeat this process generating a second set of random players and again take the mean of this second set of players. Does the mean of 50 randomly sampled players seem to be close to the mean of the population? ```{r message = FALSE, warning = FALSE, tidy = TRUE} # uncomment the code below to sample 50 random baseball # height.of.50.random.players <- sample(Master$height, 50) ``` **Answers**: