Introducing Rstudio (live demo)
- What is the console?
- What is the source?
- What is your environment?
- Using git within Rstudio
SideNote: What is an .Rmd file? What about .r file?
This is an R Markdown document (.Rmd). Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It facilitates the easy combination of both code and text. Everything you see in the grey chunks will run as R code. Everything outside a code chunk is treated as plain text.
For complete formatting with plots and code output and the text, one must knit
a .Rmd file into their final file formats (generally html, pdf, md, or doc). Which is why you may see “duplicate” files in our github repo.
For more details on using R Markdown see http://rmarkdown.rstudio.com.
Comparatively, a simple R script saves as a .r file. Everything inside a .r file is treated (and run) as code unless commented out with a #
sign.
The TidyVerse
The tidyverse is a collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.
Install the packages: install.packages("tidyverse")
Load the packages each time you open a new session:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.3.4 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Access the book! R for Data Science
Other Resources:
Importing Data
In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.
Big Data Borat @BigDataBorat, February 27, 2013
Parsing
Our first task is to read this data into our R environment. To this, we will use the read_table
function. Reading in a data file is called parsing, which sounds much more sophisticated. For good reason too – parsing different data files and formats is a cornerstone of all pratical data science research, and can often be the hardest step.
So what do we need to know about this file in order to read it into R?
The first thing we should notice is that there is a large comment block of documentation. This can be ignored when parsing by using the comment
arg.
## Let's try:
co2 <- read.table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", comment='#')
head(co2)
## V1 V2 V3 V4 V5 V6 V7
## 1 1958 3 1958.208 315.71 315.71 314.62 -1
## 2 1958 4 1958.292 317.45 317.45 315.29 -1
## 3 1958 5 1958.375 317.50 317.50 314.71 -1
## 4 1958 6 1958.458 -99.99 317.10 314.85 -1
## 5 1958 7 1958.542 315.86 315.86 314.98 -1
## 6 1958 8 1958.625 314.93 314.93 315.94 -1
#or if the ftp causes problems, use local:
#co2 <- read_csv("NOAA_CO2.csv")
# parsing required may be slightly different
Almost there, but things are still a bit messy. Our first row is being interpreted as column names. The documentation also notes that certain values are used to indicate missing data, which we would be better off converting to explicitly missing so we don’t get confused.
co2 <- read.table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt",
sep = "", comment = "#",
col.names = c("year", "month", "decimal_date",
"average", "interpolated",
"trend", "days"),
na.strings = c("-1", "-99.99"))
head(co2)
## year month decimal_date average interpolated trend days
## 1 1958 3 1958.208 315.71 315.71 314.62 NA
## 2 1958 4 1958.292 317.45 317.45 315.29 NA
## 3 1958 5 1958.375 317.50 317.50 314.71 NA
## 4 1958 6 1958.458 NA 317.10 314.85 NA
## 5 1958 7 1958.542 315.86 315.86 314.98 NA
## 6 1958 8 1958.625 314.93 314.93 315.94 NA
Importing Data with tidyverse
Alternately, with readr::read_table
from tidyverse
It seems that comment
arg is not yet fully implemented in CRAN version of readr
so we must rely on skip
to avoid the documentation block:
co2 <- read_table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt",
col_names = c("year", "month", "decimal_date",
"average", "interpolated", "trend", "days"),
col_types = c("iiddddi"),
na = c("-1", "-99.99"),
skip = 72)
co2
## # A tibble: 717 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1958 3 1958.208 315.71 315.71 314.62 NA
## 2 1958 4 1958.292 317.45 317.45 315.29 NA
## 3 1958 5 1958.375 317.50 317.50 314.71 NA
## 4 1958 6 1958.458 NA 317.10 314.85 NA
## 5 1958 7 1958.542 315.86 315.86 314.98 NA
## 6 1958 8 1958.625 314.93 314.93 315.94 NA
## 7 1958 9 1958.708 313.20 313.20 315.91 NA
## 8 1958 10 1958.792 NA 312.66 315.61 NA
## 9 1958 11 1958.875 313.33 313.33 315.31 NA
## 10 1958 12 1958.958 314.67 314.67 315.61 NA
## # ... with 707 more rows
Success! We have read in the data. Now we’re ready to rock and roll.
Viewing data
Once parsed and imported, it’s a good idea to take a look at your data, both to get a sense of it’s size, names, and shape but also to keep an eye out for missing value or errors.
For this stage, using a combination of str
, names
, summar
, View
, head
and tail
functions can be helpful.
# to get the names of the columns
names(co2)
## [1] "year" "month" "decimal_date" "average"
## [5] "interpolated" "trend" "days"
# to check out the full structure of the R object
str(co2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 717 obs. of 7 variables:
## $ year : int 1958 1958 1958 1958 1958 1958 1958 1958 1958 1958 ...
## $ month : int 3 4 5 6 7 8 9 10 11 12 ...
## $ decimal_date: num 1958 1958 1958 1958 1959 ...
## $ average : num 316 317 318 NA 316 ...
## $ interpolated: num 316 317 318 317 316 ...
## $ trend : num 315 315 315 315 315 ...
## $ days : int NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 7
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ month : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ decimal_date: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ average : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ interpolated: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ trend : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ days : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
nrow(co2)
## [1] 717
ncol(co2)
## [1] 7
# to get a summary of the object
summary(co2) # here we can get a good sense of the missing values in the days column and average column.
## year month decimal_date average
## Min. :1958 Min. : 1.000 Min. :1958 Min. :313.2
## 1st Qu.:1973 1st Qu.: 4.000 1st Qu.:1973 1st Qu.:328.6
## Median :1988 Median : 7.000 Median :1988 Median :350.9
## Mean :1988 Mean : 6.506 Mean :1988 Mean :353.2
## 3rd Qu.:2002 3rd Qu.: 9.000 3rd Qu.:2003 3rd Qu.:374.4
## Max. :2017 Max. :12.000 Max. :2018 Max. :409.6
## NA's :7
## interpolated trend days
## Min. :312.7 Min. :314.6 Min. : 0.00
## 1st Qu.:328.3 1st Qu.:328.9 1st Qu.:24.00
## Median :350.2 Median :350.4 Median :26.00
## Mean :352.9 Mean :352.9 Mean :25.34
## 3rd Qu.:374.2 3rd Qu.:374.5 3rd Qu.:28.00
## Max. :409.6 Max. :407.2 Max. :31.00
## NA's :194
# for the first or last `n` lines of the data frame
head(co2) # check out r help - shows us that the default argument is 10 lines
## # A tibble: 6 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1958 3 1958.208 315.71 315.71 314.62 NA
## 2 1958 4 1958.292 317.45 317.45 315.29 NA
## 3 1958 5 1958.375 317.50 317.50 314.71 NA
## 4 1958 6 1958.458 NA 317.10 314.85 NA
## 5 1958 7 1958.542 315.86 315.86 314.98 NA
## 6 1958 8 1958.625 314.93 314.93 315.94 NA
tail(co2, 20)
## # A tibble: 20 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 2016 4 2016.292 407.42 407.42 404.52 25
## 2 2016 5 2016.375 407.70 407.70 404.30 29
## 3 2016 6 2016.458 406.81 406.81 404.48 26
## 4 2016 7 2016.542 404.39 404.39 403.97 28
## 5 2016 8 2016.625 402.25 402.25 404.13 23
## 6 2016 9 2016.708 401.03 401.03 404.57 24
## 7 2016 10 2016.792 401.57 401.57 404.95 29
## 8 2016 11 2016.875 403.53 403.53 405.62 27
## 9 2016 12 2016.958 404.42 404.42 405.20 29
## 10 2017 1 2017.042 406.13 406.13 405.89 26
## 11 2017 2 2017.125 406.42 406.42 405.61 26
## 12 2017 3 2017.208 407.18 407.18 405.61 23
## 13 2017 4 2017.292 409.00 409.00 406.10 25
## 14 2017 5 2017.375 409.65 409.65 406.24 27
## 15 2017 6 2017.458 408.84 408.84 406.51 26
## 16 2017 7 2017.542 407.07 407.07 406.65 28
## 17 2017 8 2017.625 405.07 405.07 406.94 29
## 18 2017 9 2017.708 403.38 403.38 406.93 26
## 19 2017 10 2017.792 403.64 403.64 407.03 27
## 20 2017 11 2017.875 405.14 405.14 407.22 26
# to see the whole table in a Rstudio window, run the following line, uncommented.
# View(co2)
# also double click from the environment
Subsetting data
Subsetting can be done a variety of ways through baseR and tidyverse. Here we are going to cover the following ways: - select()
, tidyverse - filter()
, tidyverse - bracket []
notation, baseR - dollar sign $
notation, baseR - subset function, baseR
co2[,"year"]
## # A tibble: 717 x 1
## year
## <int>
## 1 1958
## 2 1958
## 3 1958
## 4 1958
## 5 1958
## 6 1958
## 7 1958
## 8 1958
## 9 1958
## 10 1958
## # ... with 707 more rows
co2[,1]
## # A tibble: 717 x 1
## year
## <int>
## 1 1958
## 2 1958
## 3 1958
## 4 1958
## 5 1958
## 6 1958
## 7 1958
## 8 1958
## 9 1958
## 10 1958
## # ... with 707 more rows
co2$year # what's the difference here?
## [1] 1958 1958 1958 1958 1958 1958 1958 1958 1958 1958 1959 1959 1959 1959
## [15] 1959 1959 1959 1959 1959 1959 1959 1959 1960 1960 1960 1960 1960 1960
## [29] 1960 1960 1960 1960 1960 1960 1961 1961 1961 1961 1961 1961 1961 1961
## [43] 1961 1961 1961 1961 1962 1962 1962 1962 1962 1962 1962 1962 1962 1962
## [57] 1962 1962 1963 1963 1963 1963 1963 1963 1963 1963 1963 1963 1963 1963
## [71] 1964 1964 1964 1964 1964 1964 1964 1964 1964 1964 1964 1964 1965 1965
## [85] 1965 1965 1965 1965 1965 1965 1965 1965 1965 1965 1966 1966 1966 1966
## [99] 1966 1966 1966 1966 1966 1966 1966 1966 1967 1967 1967 1967 1967 1967
## [113] 1967 1967 1967 1967 1967 1967 1968 1968 1968 1968 1968 1968 1968 1968
## [127] 1968 1968 1968 1968 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969
## [141] 1969 1969 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970
## [155] 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1972 1972
## [169] 1972 1972 1972 1972 1972 1972 1972 1972 1972 1972 1973 1973 1973 1973
## [183] 1973 1973 1973 1973 1973 1973 1973 1973 1974 1974 1974 1974 1974 1974
## [197] 1974 1974 1974 1974 1974 1974 1975 1975 1975 1975 1975 1975 1975 1975
## [211] 1975 1975 1975 1975 1976 1976 1976 1976 1976 1976 1976 1976 1976 1976
## [225] 1976 1976 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977
## [239] 1978 1978 1978 1978 1978 1978 1978 1978 1978 1978 1978 1978 1979 1979
## [253] 1979 1979 1979 1979 1979 1979 1979 1979 1979 1979 1980 1980 1980 1980
## [267] 1980 1980 1980 1980 1980 1980 1980 1980 1981 1981 1981 1981 1981 1981
## [281] 1981 1981 1981 1981 1981 1981 1982 1982 1982 1982 1982 1982 1982 1982
## [295] 1982 1982 1982 1982 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983
## [309] 1983 1983 1984 1984 1984 1984 1984 1984 1984 1984 1984 1984 1984 1984
## [323] 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 1986 1986
## [337] 1986 1986 1986 1986 1986 1986 1986 1986 1986 1986 1987 1987 1987 1987
## [351] 1987 1987 1987 1987 1987 1987 1987 1987 1988 1988 1988 1988 1988 1988
## [365] 1988 1988 1988 1988 1988 1988 1989 1989 1989 1989 1989 1989 1989 1989
## [379] 1989 1989 1989 1989 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990
## [393] 1990 1990 1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 1991
## [407] 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 1993 1993
## [421] 1993 1993 1993 1993 1993 1993 1993 1993 1993 1993 1994 1994 1994 1994
## [435] 1994 1994 1994 1994 1994 1994 1994 1994 1995 1995 1995 1995 1995 1995
## [449] 1995 1995 1995 1995 1995 1995 1996 1996 1996 1996 1996 1996 1996 1996
## [463] 1996 1996 1996 1996 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997
## [477] 1997 1997 1998 1998 1998 1998 1998 1998 1998 1998 1998 1998 1998 1998
## [491] 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 2000 2000
## [505] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001 2001
## [519] 2001 2001 2001 2001 2001 2001 2001 2001 2002 2002 2002 2002 2002 2002
## [533] 2002 2002 2002 2002 2002 2002 2003 2003 2003 2003 2003 2003 2003 2003
## [547] 2003 2003 2003 2003 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004
## [561] 2004 2004 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005
## [575] 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 2007 2007
## [589] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2008 2008 2008 2008
## [603] 2008 2008 2008 2008 2008 2008 2008 2008 2009 2009 2009 2009 2009 2009
## [617] 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010 2010 2010 2010 2010
## [631] 2010 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011
## [645] 2011 2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012
## [659] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2014 2014
## [673] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2015 2015 2015 2015
## [687] 2015 2015 2015 2015 2015 2015 2015 2015 2016 2016 2016 2016 2016 2016
## [701] 2016 2016 2016 2016 2016 2016 2017 2017 2017 2017 2017 2017 2017 2017
## [715] 2017 2017 2017
co2 %>% select(year, average)
## # A tibble: 717 x 2
## year average
## <int> <dbl>
## 1 1958 315.71
## 2 1958 317.45
## 3 1958 317.50
## 4 1958 NA
## 5 1958 315.86
## 6 1958 314.93
## 7 1958 313.20
## 8 1958 NA
## 9 1958 313.33
## 10 1958 314.67
## # ... with 707 more rows
co2[, c("year", "average")]
## # A tibble: 717 x 2
## year average
## <int> <dbl>
## 1 1958 315.71
## 2 1958 317.45
## 3 1958 317.50
## 4 1958 NA
## 5 1958 315.86
## 6 1958 314.93
## 7 1958 313.20
## 8 1958 NA
## 9 1958 313.33
## 10 1958 314.67
## # ... with 707 more rows
co2[, c(1,4)]
## # A tibble: 717 x 2
## year average
## <int> <dbl>
## 1 1958 315.71
## 2 1958 317.45
## 3 1958 317.50
## 4 1958 NA
## 5 1958 315.86
## 6 1958 314.93
## 7 1958 313.20
## 8 1958 NA
## 9 1958 313.33
## 10 1958 314.67
## # ... with 707 more rows
co2 %>% select(-days)# select all columns except year
## # A tibble: 717 x 6
## year month decimal_date average interpolated trend
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1958 3 1958.208 315.71 315.71 314.62
## 2 1958 4 1958.292 317.45 317.45 315.29
## 3 1958 5 1958.375 317.50 317.50 314.71
## 4 1958 6 1958.458 NA 317.10 314.85
## 5 1958 7 1958.542 315.86 315.86 314.98
## 6 1958 8 1958.625 314.93 314.93 315.94
## 7 1958 9 1958.708 313.20 313.20 315.91
## 8 1958 10 1958.792 NA 312.66 315.61
## 9 1958 11 1958.875 313.33 313.33 315.31
## 10 1958 12 1958.958 314.67 314.67 315.61
## # ... with 707 more rows
co2[,-7]
## # A tibble: 717 x 6
## year month decimal_date average interpolated trend
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1958 3 1958.208 315.71 315.71 314.62
## 2 1958 4 1958.292 317.45 317.45 315.29
## 3 1958 5 1958.375 317.50 317.50 314.71
## 4 1958 6 1958.458 NA 317.10 314.85
## 5 1958 7 1958.542 315.86 315.86 314.98
## 6 1958 8 1958.625 314.93 314.93 315.94
## 7 1958 9 1958.708 313.20 313.20 315.91
## 8 1958 10 1958.792 NA 312.66 315.61
## 9 1958 11 1958.875 313.33 313.33 315.31
## 10 1958 12 1958.958 314.67 314.67 315.61
## # ... with 707 more rows
co2 %>% filter(year >= 1980, month == 12) # comma functions as "and" in the filter function
## # A tibble: 37 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1980 12 1980.958 338.32 338.32 339.26 19
## 2 1981 12 1981.958 339.88 339.88 340.79 19
## 3 1982 12 1982.958 340.90 340.90 341.79 26
## 4 1983 12 1983.958 343.05 343.05 343.96 19
## 5 1984 12 1984.958 344.70 344.70 345.57 12
## 6 1985 12 1985.958 345.88 345.88 346.81 25
## 7 1986 12 1986.958 347.21 347.21 348.12 24
## 8 1987 12 1987.958 349.16 349.16 350.05 27
## 9 1988 12 1988.958 351.41 351.41 352.34 28
## 10 1989 12 1989.958 352.85 352.85 353.80 27
## # ... with 27 more rows
co2 %>% subset(year >= 1980 & month == 12) # but and must be explicit in the subset function
## # A tibble: 37 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1980 12 1980.958 338.32 338.32 339.26 19
## 2 1981 12 1981.958 339.88 339.88 340.79 19
## 3 1982 12 1982.958 340.90 340.90 341.79 26
## 4 1983 12 1983.958 343.05 343.05 343.96 19
## 5 1984 12 1984.958 344.70 344.70 345.57 12
## 6 1985 12 1985.958 345.88 345.88 346.81 25
## 7 1986 12 1986.958 347.21 347.21 348.12 24
## 8 1987 12 1987.958 349.16 349.16 350.05 27
## 9 1988 12 1988.958 351.41 351.41 352.34 28
## 10 1989 12 1989.958 352.85 352.85 353.80 27
## # ... with 27 more rows
co2 %>% filter(month == 11 | month == 11) # | is equivalent to "or"
## # A tibble: 60 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1958 11 1958.875 313.33 313.33 315.31 NA
## 2 1959 11 1959.875 314.80 314.80 316.78 NA
## 3 1960 11 1960.875 315.00 315.00 316.98 NA
## 4 1961 11 1961.875 316.10 316.10 318.13 NA
## 5 1962 11 1962.875 316.69 316.69 318.62 NA
## 6 1963 11 1963.875 317.12 317.12 319.10 NA
## 7 1964 11 1964.875 317.79 317.79 319.72 NA
## 8 1965 11 1965.875 318.87 318.87 320.87 NA
## 9 1966 11 1966.875 319.79 319.79 321.84 NA
## 10 1967 11 1967.875 320.72 320.72 322.78 NA
## # ... with 50 more rows
co2[co2$month==12,]
## # A tibble: 59 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1958 12 1958.958 314.67 314.67 315.61 NA
## 2 1959 12 1959.958 315.58 315.58 316.52 NA
## 3 1960 12 1960.958 316.19 316.19 317.13 NA
## 4 1961 12 1961.958 317.01 317.01 317.94 NA
## 5 1962 12 1962.958 317.69 317.69 318.61 NA
## 6 1963 12 1963.958 318.31 318.31 319.25 NA
## 7 1964 12 1964.958 318.71 318.71 319.59 NA
## 8 1965 12 1965.958 319.42 319.42 320.26 NA
## 9 1966 12 1966.958 321.08 321.08 321.95 NA
## 10 1967 12 1967.958 321.96 321.96 322.86 NA
## # ... with 49 more rows
co2[co2$year>=1980 & co2$month==12,]
## # A tibble: 37 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1980 12 1980.958 338.32 338.32 339.26 19
## 2 1981 12 1981.958 339.88 339.88 340.79 19
## 3 1982 12 1982.958 340.90 340.90 341.79 26
## 4 1983 12 1983.958 343.05 343.05 343.96 19
## 5 1984 12 1984.958 344.70 344.70 345.57 12
## 6 1985 12 1985.958 345.88 345.88 346.81 25
## 7 1986 12 1986.958 347.21 347.21 348.12 24
## 8 1987 12 1987.958 349.16 349.16 350.05 27
## 9 1988 12 1988.958 351.41 351.41 352.34 28
## 10 1989 12 1989.958 352.85 352.85 353.80 27
## # ... with 27 more rows
co2[co2$month==12 | co2$month==11,]
## # A tibble: 119 x 7
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1958 11 1958.875 313.33 313.33 315.31 NA
## 2 1958 12 1958.958 314.67 314.67 315.61 NA
## 3 1959 11 1959.875 314.80 314.80 316.78 NA
## 4 1959 12 1959.958 315.58 315.58 316.52 NA
## 5 1960 11 1960.875 315.00 315.00 316.98 NA
## 6 1960 12 1960.958 316.19 316.19 317.13 NA
## 7 1961 11 1961.875 316.10 316.10 318.13 NA
## 8 1961 12 1961.958 317.01 317.01 317.94 NA
## 9 1962 11 1962.875 316.69 316.69 318.62 NA
## 10 1962 12 1962.958 317.69 317.69 318.61 NA
## # ... with 109 more rows
# Note: logical operations (those that produce True or False) require the double equal sign `==`
SideNote: Whats that little c()
do?
Try running ?c
in your R console to find out. We use this function regularly to create vectors or lists of objects.
SideNote: The Power of the Pipe %>%
You might be thinking what is that weird symbol we just used? This is a pipe, a function in the magittr
package loaded in the tidyverse, Pipes are a powerful way to perform sequential operations on an R object. Using the pipe, allows use to push the output of our first operation into our next operation seamlessly, without using intermediate objects or overwriting our original object.
co2 %>%
filter(year >= 1980, month == 12) %>%
select(year, average)
## # A tibble: 37 x 2
## year average
## <int> <dbl>
## 1 1980 338.32
## 2 1981 339.88
## 3 1982 340.90
## 4 1983 343.05
## 5 1984 344.70
## 6 1985 345.88
## 7 1986 347.21
## 8 1987 349.16
## 9 1988 351.41
## 10 1989 352.85
## # ... with 27 more rows
# This is the same but much more readable and much cleaner than the following:
co2_filter <- filter(co2, year >= 1980, month == 12)
co2_subset <- select(co2_filter, year, average)
co2_subset
## # A tibble: 37 x 2
## year average
## <int> <dbl>
## 1 1980 338.32
## 2 1981 339.88
## 3 1982 340.90
## 4 1983 343.05
## 5 1984 344.70
## 6 1985 345.88
## 7 1986 347.21
## 8 1987 349.16
## 9 1988 351.41
## 10 1989 352.85
## # ... with 27 more rows
It’s also worth noting that the piped version does not create an additional object unless you ask it to. This is super useful in the early stages of exploring and visualizing your data.
More information about pipes and the alternatives found here
Sorting data
Often data is not in the exact form we want or we need additional information from our data. When this is the case, the tidyverse library has some helpful functions that, when combined, are powerful tools for rearranging and summarizing our data.
Group By & Summarise
group_by
allows us to invisibly partition our data into groups which can be powerful when we later want to applied functions or look at statistics on groups together. Take a look, you’ll notice the only thing that changes when group_by years in the co2 dataframe, is the addition of a small line in the tibble header: “# Groups: year [60]”
co2 %>% group_by(year)
## # A tibble: 717 x 7
## # Groups: year [60]
## year month decimal_date average interpolated trend days
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1958 3 1958.208 315.71 315.71 314.62 NA
## 2 1958 4 1958.292 317.45 317.45 315.29 NA
## 3 1958 5 1958.375 317.50 317.50 314.71 NA
## 4 1958 6 1958.458 NA 317.10 314.85 NA
## 5 1958 7 1958.542 315.86 315.86 314.98 NA
## 6 1958 8 1958.625 314.93 314.93 315.94 NA
## 7 1958 9 1958.708 313.20 313.20 315.91 NA
## 8 1958 10 1958.792 NA 312.66 315.61 NA
## 9 1958 11 1958.875 313.33 313.33 315.31 NA
## 10 1958 12 1958.958 314.67 314.67 315.61 NA
## # ... with 707 more rows
Everything else appears the same! We still have 716 rows and 10 columns. All the names are the same. BUT…. if we pass this new grouped dataframe into another function like summarise
, check out what happens:
co2 %>% group_by(year) %>% summarise(`Number of measurements` = n(), `Average year's trend` = mean(trend))
## # A tibble: 60 x 3
## year `Number of measurements` `Average year's trend`
## <int> <int> <dbl>
## 1 1958 10 315.2830
## 2 1959 12 315.9742
## 3 1960 12 316.9075
## 4 1961 12 317.6367
## 5 1962 12 318.4500
## 6 1963 12 318.9942
## 7 1964 12 319.6217
## 8 1965 12 320.0433
## 9 1966 12 321.3833
## 10 1967 12 322.1567
## # ... with 50 more rows
The summarise
function allows you to build a new table with completely new columns, based upon any operations you want to run on your original table. Without the group by, this same summarise
command would return only 1 line:
co2 %>% summarise(`Number of measurements` = n(), `Average trend` = mean(trend))
## # A tibble: 1 x 2
## `Number of measurements` `Average trend`
## <int> <dbl>
## 1 717 352.8727
But once we “group” the dataframe, R knows to compute our functions across the groups we specify.
Mutating
The mutate
function is similar to summarise
in that it allows you to take values from within a data table, compute something new, but in this case, the R will append it as a new column to the original dataframe. For instance, perhaps we wanted to make a column combining the year and month for our dataset
co2 %>% mutate(month_year = paste0(month,"/", year))
## # A tibble: 717 x 8
## year month decimal_date average interpolated trend days month_year
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <chr>
## 1 1958 3 1958.208 315.71 315.71 314.62 NA 3/1958
## 2 1958 4 1958.292 317.45 317.45 315.29 NA 4/1958
## 3 1958 5 1958.375 317.50 317.50 314.71 NA 5/1958
## 4 1958 6 1958.458 NA 317.10 314.85 NA 6/1958
## 5 1958 7 1958.542 315.86 315.86 314.98 NA 7/1958
## 6 1958 8 1958.625 314.93 314.93 315.94 NA 8/1958
## 7 1958 9 1958.708 313.20 313.20 315.91 NA 9/1958
## 8 1958 10 1958.792 NA 312.66 315.61 NA 10/1958
## 9 1958 11 1958.875 313.33 313.33 315.31 NA 11/1958
## 10 1958 12 1958.958 314.67 314.67 315.61 NA 12/1958
## # ... with 707 more rows
group by
functions also work to group things beforemutate
functions. FOr instance, if we wanted a column that averaged the temperature across each year?
co2 %>% group_by(year) %>% mutate(year_average= mean(average, na.rm=TRUE))
## # A tibble: 717 x 8
## # Groups: year [60]
## year month decimal_date average interpolated trend days year_average
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1958 3 1958.208 315.71 315.71 314.62 NA 315.3313
## 2 1958 4 1958.292 317.45 317.45 315.29 NA 315.3313
## 3 1958 5 1958.375 317.50 317.50 314.71 NA 315.3313
## 4 1958 6 1958.458 NA 317.10 314.85 NA 315.3313
## 5 1958 7 1958.542 315.86 315.86 314.98 NA 315.3313
## 6 1958 8 1958.625 314.93 314.93 315.94 NA 315.3313
## 7 1958 9 1958.708 313.20 313.20 315.91 NA 315.3313
## 8 1958 10 1958.792 NA 312.66 315.61 NA 315.3313
## 9 1958 11 1958.875 313.33 313.33 315.31 NA 315.3313
## 10 1958 12 1958.958 314.67 314.67 315.61 NA 315.3313
## # ... with 707 more rows
Together, group_by, mutate, and summarise are some of your most powerful tools for data manipulation.
Plotting data
Plotting Data with ggplot
Effective visualizations are an integral part of data science, poorly organized or poorly labelled figures can be as much a source of peril as understanding. Nevertheless, the ability to generate plots quickly with minimal tinkering is an essential skill. As standards for visualizations have increased, too often visualization is seen as an ends rather than a means of data analysis. See Fox & Hendler (2011) for more discussion of this.
Plotting Data with ggplot
ggplot(co2, aes(decimal_date, average)) + geom_line()
Plotting multiple series
We often would like to plot several data values together for comparison, for example the average, interpolated and trend co2 data. We can do this in three steps:
-
subsetting the dataset to the columns desired for plotting
co2_sub <- co2 %>% select(decimal_date, average, interpolated, trend) co2_sub %>% head()
## # A tibble: 6 x 4 ## decimal_date average interpolated trend ## <dbl> <dbl> <dbl> <dbl> ## 1 1958.208 315.71 315.71 314.62 ## 2 1958.292 317.45 317.45 315.29 ## 3 1958.375 317.50 317.50 314.71 ## 4 1958.458 NA 317.10 314.85 ## 5 1958.542 315.86 315.86 314.98 ## 6 1958.625 314.93 314.93 315.94
-
rearranging the data into a “long” data table where the data values are stacked together in one column and there is a separate column that keeps track of the whether the data came from the average, interpolated, or trend column. Notice by using the same name, we overwrite the original co2_sub
co2_sub <- co2_sub %>% gather(series, ppmv, -decimal_date) co2_sub %>% head()
## # A tibble: 6 x 3 ## decimal_date series ppmv ## <dbl> <chr> <dbl> ## 1 1958.208 average 315.71 ## 2 1958.292 average 317.45 ## 3 1958.375 average 317.50 ## 4 1958.458 average NA ## 5 1958.542 average 315.86 ## 6 1958.625 average 314.93
-
plotting
co2_sub %>% ggplot(aes(decimal_date, ppmv, col = series)) + geom_line()
Plotting multiple series
Or, even better, we can take advantage of dplyr’s nifty pipping abilities and accomplish all of these steps in one block of code. Beyond being more succinct, this has the added benefit of avoiding creating a new object for the subsetted data.
co2 %>%
select(decimal_date, average, interpolated, trend) %>%
gather(series, ppmv, -decimal_date) %>%
ggplot(aes(decimal_date, ppmv, col = series)) + geom_line()
Writing out Data or objects
Often after doing all the work to clean up your data you want to write out the clean file, this is simple with the write_*
functions.
write_csv(co2_sub, "co2clean")
We can even write out our ggplot images:
# defaults to saving your last plot. can be specified
ggsave("plot1", device = "png")
## Saving 7 x 5 in image