--- title: "Worksheet 3" output: pdf_document: default html_notebook: default --- ## Author: Enter name of the author here ## Discussants: Enter the names of other people who you discussed this problem set with $\\$ Derek Jeter is a baseball player who retired in 2014. Derek had spent is whole career on the New York Yankees and is considered to be a great baseball player. However, since there have been so many great players on the Yankees, [some news sources have questioned whether Derek Jeter is really as good as other historically great players on the Yankees](http://www.theonion.com/articleslideshow/derek-jeter-onion-sports-pays-tribute-to-the-47th--35253). In these exercises we will examine Derek Jeter's statistics and compare them to other historically great Yankees to see how he compares. Functions you will use in this worksheet include: get.Lahman.batting.data(), dim(), max(), fivenum(), boxplot(), c(), plot() and cor(). ```{r message=FALSE, warning=FALSE, tidy=TRUE, echo = FALSE} library(knitr) # makes sure the code is wrapped to fit when it creats a pdf opts_chunk$set(tidy.opts=list(width.cutoff=60)) set.seed(1) # set the random number generator to always give the same sequence of random numbers source('/home/shared/baseball_stats_2017/baseball_class_functions.R') ``` $\\$ **Exercise 1:** Let's start off by getting Derek Jeter's batting data using the get.Lahman.batting.data() function, and let's store this data in a variable called Jeter.data. Report how many seasons Derek Jeter played for, and what was the maximum number of home runs he hit in a season. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 2:** As we discussed last class, we can calculate a 5 five number summary to summarize a players performance. Please use the fivenum() function to report the 5 number summary of Derek's batting average (BA). Also report the range and inter-quartile range of Derke's batting average. Hint: remember you can get the 5th value out of a vector v using the syntax v[5]. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 3:** Now create a box plot of Derek's batting average numbers using the boxplot() function. Please label your y-axes using the ylab = "my label" argument. Are there any outliers in his batting average data? ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 4:** Now it's time to compare Jeter's statistics to other great Yankees. Four other great Yankees are: Babe Ruth, Lou Gehrig, Mickey Mantle, and Don Mattingly. Use the get.Lahman.batting.data() function to create a data frames of the batting statistics for each of these players (i.e., for Babe Ruth you should have a data frame called Ruth.data that has all his hitting statistics, etc.). Side-by-side box plots for vectors x, y, z can be made by calling boxplot(x, y, z, names = c('xname', 'yname', 'zname')). Create side-by-size boxplots comparing these 5 players' home runs and batting averages. Describe how Derek's statistics compare to these other players. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers**: $\\$ **Exercise 5:** But wait, as we discussed in class comparing raw statistics from different eras could be misleading since the distribution of particular statistics might be quite different in different eras. Instead it would be useful to see how well each player did compared to their peers using z-scores. Let's look at the 7th season that each player played in, and compare their batting average to the batting averages of their peers during each player's 7th season. To determine what year Derek Jeter's 7th season was, we can use the syntax Jeter.data[7, ]$yearID . This works by getting the 7th entry from the Jeter.data data frame and returning the value of the variable yearID. Once you have determined the year of Jeter's 7th season, get the statistics from all other players from that year who had more than 500 plate appearances (PA) using the get.Lahman.batting.data(year = YYY, min.PA = 500), where YYY corresponds to the appropriate year. Save this data frame in a variable called PA500.Jeter7th. Also, report how many other players had 500 plate apparences in Jeter's 7th year. Create similar data frames for the 7th years of Ruth, Gehrig, Mantle and Mattingly, and similarly report the number of players that had 500 plate apparences in the 7th season for these other Yankees players. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers:** $\\$ **Exercise 6:** Now compute the mean and standard deviation of the batting averages from each of these 5 years (using the mean() and sd() functions, and create z-scores of the 5 players batting averages. Remeber to use parentheses to first subtract the mean and then divide by the standard deviation. Describe how the players compare in their 7th seasons? ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers:** $\\$ **Exercise 7:** Finally let's look at the correlation between statistics. In class we saw that there was a positive correlation between home runs (HR) and strike outs (SO). Now let's examine the correlation between walks (BB) and strike outs (SO) using the data frame you created above that has all the players data from Jeter's 7th season that have over 500 plate apparences. Create a scatter plot of the data of walks (BB) as a function of strikeouts (SO) using the plot() function and label the x and y axes appropriately. Also calculate the correlation between strikeouts and walks, and write a brief interpretation of what this correlation value means. ```{r message = FALSE, warning = FALSE, tidy = TRUE} ``` **Answers:**