#'
#' [Download the source](week2.r)
#'
#' # Class miscellanea
#'
#' For those who are enrolled:
#'
#' 1. Attendance taken every class
#' - If you're enrolled, email us and we will give you
#' access to the attendance Google Doc.
#' 2. Final, brief summary due to Sanjay at course end
#'
#' If you need help outside of class, first go through the `swirl` tutorials
#' and then contact us (or contact us to help you get started on `swirl`.
#'
#' ```
#' install.packages("swirl")
#' library(swirl)
#' swirl() # have fun! :)
#' ```
#'
#' # Recap of what we've learned
#'
#'
#' 1. R is a giant calculator
#' - More properly, a Turing Machine
#' - Let's thank Alan and Ada
#'
#' ![Alan](turing.jpg)
#' ![Ada](ada.jpg)
#'
#' 2. R keeps data in vectors, matrices, and (**new to you**) data frames
#' - Use special syntax to index, or reference parts of
#' each of these structures.
#' 3. R uses functions to do things
#' - ploting
#' - random number generation
#' - stats
#'
#' # R is programming
#'
#' R stores data, and functions that do things with that data, in your **Environment**.
#'
#' You can use functions by typing commands directly into the console, or you can write
#' commands in a script, like this one, and run commands from there (the preferred method).
#'
#' Let's compartmentalize things a bit:
#'
#' 1. Raw data stored as text in a comma separated value file on your computer
#' - SPSS files work too \*
#' 2. Your R script saves all of your procedures for
#' 1. Loading your raw data
#' 2. Cleaning your raw data
#' - saving your cleaned data for faster loading
#' 3. Analyzing your cleaned data
#' 4. Saving your results to text, html, pdf, and so forth.
#' 3. The R environment holds in working memory (RAM) your
#' - loaded data (like, so loaded man)
#' - base functions
#' - loaded package functions (after using, e.g., `library(AwesomePackage`)
#'
#' Here's an example, in flowchart, of what this looks like put together:
#'
#' ![R Workflow](rworkflow.png)
#'
#' If this doesn't fully grock right now, don't worry, it will. For now just
#' be thinking about how R sort of keeps separate the data on your hard drive,
#' the copy of that data you're working on, and all of the functions you're
#' applying to that data, either to fix it up, analyze it, or present your
#' analyses.
#'
#' # Functions: The Basics
#'
#' Functions take input (data, other functions, option flags) and often give you output.
#'
#' You can save this output, as we saw, using `<-`.
#'
#' You can write your own functions and save them like this:
#'
reverseScore <- function(aVector, minValue, maxValue){
reversed <- maxValue - aVector + minValue
return(reversed)
}
#Let's test:
reverseScore(c(1,2,3,4,5), 1, 5)
#'
#' Functions are stored using variable names just like data. Every function
#' you'll be using is written in this way. If you want to look inside a function,
#' just type the function name without `()`.
#'
reverseScore
#'
#' # The structure of Data Frames
#'
#' ## Quick aside into the land of prepackaged data
#'
#' To get started, let's load the 'psych' package by
#' the preeminent personality psychologist, Bill Ravel, at Northwestern
#'
library(psych)
#'
#' Aside from providing a host of useful functions, we will also get access
#' to a ton of neat data.
#'
# Run this to see all available data in `psych`: data(package='psych')
#'
#' We're going to load the data about vegetable preferences...
#'
data(vegetables)
str(veg)
#'
#' ## How to `data.frame`
#'
#' Unlike vectors and matrices, data frames can hold data of all different
#' types: a column of numbers, another column of words (i.e., character strings),
#' and maybe another column of a special type called factors.
#'
#' To demonstrate this, I'm going to quickly add a column of random grouping
#' to the `veg` data frame:
#'
veg$group <- sample(c('Awesome', 'NotSoMuch'), size=dim(veg)[1], replace=T)
str(veg)
#'
#' Just like matrices, you can reference rows and columns numerically
#'
# first row, all columns:
veg[1, ]
# first column, all rows:
veg[, 1] # Notice we just get back a vector (no 'orientation')
# the cell at row 8, column 9:
veg[8, 9]
#'
#' Have you noticed the columns are named?
#'
names(veg)
#'
#' You can also reference columns by these names:
#'
veg[, 'Asp']
veg[1, 'group']
#'
#' Here's another way, for some reason:
#'
veg$Beet
veg$Peas[4]
#'
#' You can also refer to ranges:
#'
veg[1:3, 'Beet']
veg[1:3, 1:4]
veg$Car[5:6]
#'
#' There are row names too because this is actually a matrix of the proportion
#' of times one vegetable was preferred over another -- ignore that for now.
#'
#' You can also index the ranges you don't want:
#'
#who cares about Turnips?
veg[-1, -1]
#'
#' ## Some classic R subsetting
#'
#' You'll see this stuff a lot, and it's convenient, but *ugly* shorthand.
#' At least it's not MATLAB.
#'
#' In addition to putting the number(s) indexing the rows and columns you want,
#' you can also index using a vector of `TRUE` or `FALSE` values:
#'
TFVector <- rep(c(TRUE, FALSE), length=dim(veg)[1])
TFVector
veg[TFVector, ]
#'
#' This seems dumb, but you can use it to subset your data by whether
#' or not some row has a certain value:
#'
awesomeVector <- veg$group == 'Awesome'
awesomeVector
veg[awesomeVector, ]
# Or just combine it:
veg[veg$group == 'Awesome', ]
#'
#' There are better ways, a lot of which we'll see next week.
#'
#' # R Ain't Loopy
#'
#' `for` loops are an incredibly necessary and useful programming topic to master.
#' However, R wants you to perform all your operations on those collections
#' of numbers called vectors. That is, R works faster if you don't visit each
#' cell of each data frame using a for loop, but instead ask it to do something to
#' each column. Even basic arithmetic works this way:
#'
aVec <- 1:10
aVec
aVec + 2.5
aVec*100
aVec + c(-.5, .25)
#'
#' In other words, most R functions already know how to do something to each element
#' in a vector.
#'
#' Combine that with the fact that each column in a data.frame is just a vector, and that
#' R has built in functions for doing something to each column in a data frame,
#' and you can write more R-like code that runs faster, and is easier to read.
#'
#' These functions are called the `apply` functions. We'll show you recent
#' innovations on these functions next week.
#'
#' I'm first just going to show you how to take the mean of each column in `veg`,
#' except for the group column:
#
vegMeans <- sapply(veg[, -dim(veg)[2]], mean)
vegMeans
#'
#' If you wanted to transform each cell in the data, say by log transforming it,
#' you could do this:
#'
logVeg <- sapply(veg[, -dim(veg)[2]], log)
logVeg
#'
#' This is where writing your own functions comes in handy. I want to get back both the mean
#' and SD of each column:
#'
meanAndSD <- function(aVec){
aMean <- mean(aVec)
aSD <- sd(aVec)
return(c(mean=aMean, sd=aSD))
}
vegMeanAndSD <- sapply(veg[, -dim(veg)[2]], meanAndSD)
vegMeanAndSD
#'
#' # Save that clean data
#'
#' Let's pretend that the log transformed data is our 'cleaned' data, and
#' we want to save it for later ('cause log transforming takes SOOO LOOOONG):
#'
write.csv(logVeg, file='logTransVegetables.csv')
#to read it later:
read.csv('logTransVegetables.csv')
#'
#' #Some notes on EDA
#'
#' What you want to do is look at all your data, however you can make
#' that happen.
#'
#' We could use `sapply` to get a histogram for every column of the veg data:
#'
sapply(veg[, -dim(veg)[2]], hist)
#'
#' A preview of things to come: `ggplot2` is awesome, and someone made an
#' EDA package using it that does this:
#'
library(GGally) #install.packages('GGally')
data(iris)
ggpairs(iris)
#'
#' Using some of the functions we'll talk about next week, you can automate
#' a lot of the creation of custom plots and analyses (like, if you want
#' to look at ICCs for all of your variables).
#'
#' # On deck
#'
#' - Next week: `dplyr`, `tidyr` for manipulating your data (cause some times you gotta).
#' - Future times: You try!, Massively awesome plotting, ????
#'