-
Notifications
You must be signed in to change notification settings - Fork 0
/
Introduction.Rmd
137 lines (74 loc) · 6.53 KB
/
Introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "R for Humans, Ensemble Modeling, and Deep learning"
author: " "
date: "February 15th, 2018"
output:
ioslides_presentation:
widescreen: true
logo: images/Rlogo.png
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Tonights talks
A potpourri of talks, spanning a range of R topics!
**_Best practices in writing readable & maintainable R code_ - [Peter Hurford](https://github.com/iveksl2)**
**_Ensemble modeling, sports analytics & hedge funds_ - [Igor Veksler](https://github.com/iveksl2)**
**_Deep Learning with R using Keras_ - [Rajiv Shah](http://RajivShah.com)**
Special Thanks to IBM for sponsoring this meetup and making it possible!
```{r, out.width = "400px"}
path <- paste0(getwd(), "/images/IBM-Logo.jpg")
knitr::include_graphics(path)
```
# Intro to tonight's talks
## 1880: Statistician solves a Big Data problem {.columns-2 .smaller}
**1880** Since its inception in 1790, the U.S. census has expanded dramatically. It takes 8 years to process.
Herman Hollerith, **_a Statistician of the 1880 census_**, sets out to process it faster. Inspired by railway conductors, he creates punch cards to collect data. Inserted into a machine, the holes in each card allow an electrical circuit to complete, advancing a counting wheel for each topic and card.
**1890** The first census to use Hollerith's Tabulating Machine is a success, completing in just 1 year.
**1896** Hollerith forms the "Tabulating Machine Company", which merges with others to become **International Business Machines**, or **IBM**.
Sources: [Census.gov/history](https://www.census.gov/history/www/census_then_now/notable_alumni/herman_hollerith.html) and pg. 36 _The Innovators_, Walter Isaacson
```{r, out.width = "400px"}
path <- paste0(getwd(), "/images/HollerithMachine.jpg")
knitr::include_graphics(path)
```
## John Tukey, a Data Science pioneer {.columns-2 .smaller}
**1962** [_The Future of Data Analysis_](https://www.jstor.org/stable/2237638)
**_For a long time I have thought I was a statistician, interested in inferences from the particular to the general. ... I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data._**
**1964** [_The technical tools of statistics_](http://ect.bell-labs.com/sl/tukey/memo/techtools.html)
**_Today, software and hardware together provide far more powerful factories than most statisticians realize, factories that many of today's most able young people find exciting and worth learning about on their own._**
**1977**: [Exploratory Data Analysis](https://www.amazon.com/Exploratory-Data-Analysis-John-Tukey/dp/0201076160)
**_Since the aim of exploratory data analysis is to learn what seems to be, it should be no surprise that pictures play a vital role in doing it well. There is nothing better for making you think of questions you had forgotten to ask (even mentally)_**
Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems 'S'-PLUS and **`R`**. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study.
[Source: Wiki Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis)
## R (and S) for Humans, always the goal! {.columns-2 .smaller}
**1975** Richard Becker and John Chambers, both statisticians working on data analysis at Bell Labs and both computer scientists, were frustrated with having to interrupt their data analysis to do the required programming.
They wanted a more interactive process, where they could just type an expression at the command line and get back an immediate response, or better yet, a bar graph or other visual.
**1977: _Analalysis that was once difficult became easy in S_.**
**1991** `R`, based entirely on `S`, is created by Ross Ihaka and Robert Gentleman, borrowing all the hard-won lessons developed by the S community.
[Source: From S to R: 35 Years of Leadership in Statistical Computing](http://www.research.att.com/articles/featured_stories/2013_09/201309_SandR.html)
```{r, out.width = "150px"}
path <- paste0(getwd(), "/images/201309_RandS_timeline.jpg")
knitr::include_graphics(path)
```
## A rich history of extending R, prior to Keras {.columns-2 .smaller}
R succeeds today because it is very easy to extend.
If the driving force behind S was interactivity, for R it was extensibility. Partly this extensibility stems from S, which was designed to interface easily with outside elements (specifically with subroutine libraries). R retains this interfacing capability so it easily incorporates native code as well as objects, analysis, and visualizations created in other programs, giving R users options outside of R. And R is extensible also because it's open source.
Objects created in other systems and then imported into R are accessed in the same way R objects are accessed. It's easy to write native code in Java, C++, or any other language, and then have R call the code directly, with almost no loss in speed or performance.
[Source: From S to R: 35 Years of Leadership in Statistical Computing](http://www.research.att.com/articles/featured_stories/2013_09/201309_SandR.html)
```{r, out.width = "150px"}
path <- paste0(getwd(), "/images/201309_RandS_timeline.jpg")
knitr::include_graphics(path)
```
## Ensemble Methods, some early history {.columns-2 .smaller}
**1984** Classification and Regression Trees, Leo Breiman and Jerome H. Friedman
**1994** Bagging Predictors, Leo Breiman
**1995** Random Decision Forests, Tin Kam Ho at **Bell Labs**
**1999** Random Forest - _Random Features_, Leo Breiman
**2000 AdaBoost** Additive Logistic Regression: A Statistical View of Boosting. Jerome Friedman, Trevor Hastie, Rob Tibshirani.
**2001 _The Two Cultures_**. Leo Breiman.
**2002** `randomForest` package. Breiman & Cutler's Fortran Code ported to R by Andy Liaw and Matthew Wiener.
[Source Revolution Analytics blog: A Thumbnail History of Ensemble Methods](http://blog.revolutionanalytics.com/2014/03/a-thumbnail-history-of-ensemble-methods.html)
```{r, out.width = "300px"}
path <- paste0(getwd(), "/images/CART-Book.jpg")
knitr::include_graphics(path)
```