-
Notifications
You must be signed in to change notification settings - Fork 3
/
dbms2.html
1026 lines (766 loc) · 22.6 KB
/
dbms2.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>Relational Databases 2</title>
<meta charset="utf-8">
<style>
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body { font-family: 'Droid Serif'; }
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: normal;
}
.remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
</style>
</head>
<body>
<textarea id="source">
class: center, middle
# Relational Databases - 2
### Duncan Temple Lang
<div style="clear: both"/>
<!-- <hr width="50%"/> -->
---
# Previously.....
+ Motivation for relational databases
+ sqlite3 command line in shell/terminal
+ DBI and RSQLite packages in R
+ dbConnect()
+ `SELECT variables FROM table`
+ dbGetQuery()
---
# dbListTables() & dbListFields()
+ Names of the tables in the database
```
dbListTables(db)
```
+ Names of the fields in a table
```
dbListFields(db, "moz_cookies")
```
+ name of table
+ List fields of all tables
```
lapply(dbListTables(db), function(tbl) dbListFields(db, tbl))
```
---
# SELECT Query - number of rows
+ R Command dbGetQuery()
```
dbGetQuery(db, "SELECT COUNT(*) FROM moz_cookies")
```
+ 1st Argument - Connection to the database
+ Created one time
```
db = dbConnect(SQLite(), "/path/to/sqlite/database/file")
```
+ 2nd Argument is SQL command as a string
+ Analogous to regular expression in `grep(rx, str)`
+ interpreted by regular expression engine.
+ not interpreted by R
---
# Single Table versus Two or more tables
+ Today we'll start with one table and build up the elements of SQL
+ Then move to 2+ tables.
---
# Subsetting
+ WHERE
```sql
SELECT * FROM moz_cookies WHERE expiry > 1632529635
```
+ equality operator `=`
```sql
SELECT * FROM moz_cookies WHERE host = 'www.nytimes.com'
```
+ Also `<`, `<=`, `>=`, etc.
+ Also !=, but can also use NOT
```sql
SELECT COUNT(*) FROM moz_cookies WHERE host != 'www.nytimes.com'
```
+ But there is a more general NOT operator
```sql
SELECT count(*) FROM moz_cookies WHERE NOT host = 'www.nytimes.com'
```
---
# Other Predicate Tests
+ `BETWEEN val1 AND val2`
```sql
SELECT * FROM moz_cookies
WHERE expiry BETWEEN 1000000 AND 2000000
```
+ ` IN (val, val, ...)`
```sql
SELECT * FROM moz_cookies
WHERE host IN ('www.nytimes.com', 'www.washingtonpost.com')
```
+ IS NULL
+ for detecting values that are NULL or empty.
+ `WHERE host LIKE '%.org'`
---
# Simple Pattern Matching - LIKE
+ `LIKE 'pattern'`
+ Different from regular expressions
+ Some DBMS provide regular expression functions.
+ built-in or via extensions
+ _ - matches any character
+ % - matches 0 or more (any) characters
+ e.g. host values that end with .org
```sql
SELECT host FROM moz_cookies
WHERE host LIKE '%.org'
```
+ e.g. host name that starts with `www.`, ends with `.com` and has three characters in between
```sql
SELECT host FROM moz_cookies
WHERE host LIKE 'www.___.com'
```
---
# NOT
+ With these operators beyond =, we need NOT as more general version than !=
```sql
WHERE NOT expiry BETWEEN 1000000 AND 2000000
WHERE NOT host IN ('www.nytimes.com', 'www.washingtonpost.com')
WHERE NOT host LIKE ('%.org')
```
---
# Combining Conditions
+ Combine logical tests/predicates
+ AND, OR
```sql
SELECT *
FROM moz_cookies
WHERE host = 'www.nytimes.com'
OR
host = 'www.washingtonpost.com'
```
---
# Naming Columns in the Result - `AS`
+ Sometimes names of columns are inconvenient
```sql
a = dbGetQuery(con, "SELECT host, COUNT(*) FROM moz_cookies GROUP BY host")
names(a)
```
+ second coulmn's name is `"COUNT(*)"`
+ Use AS keyword in query to map name to column in results
+ Only in the results, not in the original table(s) in the DBMS.
```sql
SELECT var AS name
FROM table;
```
```sql
SELECT COUNT(*) AS num
FROM table;
```
<!--
```
a = dbGetQuery(con, "SELECT host, COUNT(*) AS num FROM moz_cookies GROUP BY host")
names(a)
```
-->
---
# Operations on Columns
+ Can transform variables with functions and arithmetic
```sql
SELECT age / 10, log (weight), weight/height, substr(IPAddress, 1, 3)
FROM table;
```
+ NOTE: SQLite has very few built-in math functions
+ These work tuple by tuple and return a value for each tuple
+ Scalar functions
+ Different from aggregate functions such as COUNT(), MIN(), MAX(), AVG()
+ operate one tuple at a time but only return a single result, not per tuple.
---
# Operations on Columns (ctd.)
+ Use AS to rename these columns, as desired
+ Can also use the new variable name in the WHERE clause
```sql
SELECT DISTINCT host, length(host) AS Len
FROM moz_cookies
WHERE Len > 10
```
+ `COUNT (DISTINCT var)`
```r
dbGetQuery(con, "SELECT COUNT(name) AS Num FROM moz_cookies ")
dbGetQuery(con, "SELECT COUNT( DISTINCT host) AS Num FROM moz_cookies ")
```
---
# Subset based on value computed from table
+ Suppose we want to get the rows that have a value of expiry greater than the log of the mean of expiry
+ In R
```r
df[ df$expiry > mean(log(df$expiry)), ]
```
or
```
expiry.mean = mean(log(df$expiry))
df[ df$expiry > expiry.mean, ]
```
+ Tempting to do in SQL similarly as
```
dbGetQuery(con, "SELECT * FROM moz_cookies WHERE log(expiry) > AVG(log(expiry))")
```
+ Doesn't work - get an error message.
```
Parse error: misuse of aggregate function AVG()
SELECT * FROM moz_cookies WHERE log(expiry) > AVG(log(expiry));
error here ---^
```
---
# Nested Queries
+ Instead
```
SELECT * FROM moz_cookies
WHERE log(expiry) > (SELECT AVG(log(expiry)) from moz_cookies)
```
+ Another approach is 2 queries from R
+ Get the mean
+ Insert that value into a second query
```{r}
expiry.mean = dbGetQuery(con, "SELECT AVG(log(expiry)) FROM moz_cookies")[1, 1]
qry = sprintf("SELECT * FROM moz_cookies WHERE log(expiry) > %lf", expiry.mean)
ans = dbGetQuery(con, qry)
```
---
# Grouping
+ Want to operate separately on groups of rows group by one or more variables
+ Like tapply(), by(), aggregate() in R
+ Not just want the rows from the DBMS arranged by this grouping variable
+ Then just use ORDER BY to arrange the rows
+ `GROUP BY`
```sql
SELECT COUNT(host)
FROM moz_cookies
GROUP BY host;
```
+ Almost always also want the group label/value in the result
```
dbGetQuery(con, "SELECT host, COUNT( host) AS Num FROM moz_cookies GROUP BY host")
```
---
# `ORDER BY`
+ And often want to order the results
+ could do it in R, but also in DBMS
```sql
SELECT host, COUNT( host) AS Num
FROM moz_cookies
GROUP BY host
ORDER BY Num;
```
+ Control order
```sql
SELECT host, COUNT( host) AS Num
FROM moz_cookies
GROUP BY host
ORDER BY Num DESC;
```
---
# Subsetting the Groups
+ Can limit to a subset of the grouping values
+ Can subset/post-process the results in R.
+ In SQL, can use nested SELECT
```sql
SELECT host, COUNT(*)
FROM
(SELECT * FROM moz_cookies
WHERE
host IN ('www.nytimes.com',
'www.washingtonpost.com',
'.wikipedia.org'))
GROUP BY host
```
or
```sql
SELECT host, COUNT(*)
FROM moz_cookies
WHERE host IN ('www.nytimes.com',
'www.washingtonpost.com',
'.wikipedia.org'))
GROUP BY host
```
---
# HAVING
+ Can't always subset the original table with WHERE, but instead need to
subset the result of the GROUP BY
+ Subset the results based on a dynamic/data-based value
+ e.g. groups where the count > 10 or average count
```sql
# WRONG
SELECT host, COUNT(*) As Num
FROM moz_cookies
GROUP BY host
WHERE Num > 10
```
+ Syntax error
+ WHERE needs to apply to the rows created via the SELECT-FROM
+ Instead, use HAVING for subsetting the results of GROUP BY
```sql
SELECT host, COUNT(*) As Num
FROM moz_cookies
GROUP BY host
HAVING Num > 10
```
---
# HAVING clause
+ Often use HAVING on COUNT() which has aggregated the records in a sub-group to a single tuple
+ However, HAVING works row by row within each sub-group
+ COUNT() is a special case where only one tuple in each sub-group.
+ See havingEg.sql
---
# Test Database
+ Create database with a table
```sql
CREATE TABLE Foo (
g INTEGER,
age INTEGER
);
```
+ Populate with sample values
```sql
INSERT INTO Foo VALUES
(1, 20),
(1, 30),
(1, 10),
(2, 55),
(2, 27);
```
+ Test query
```sql
select * FROM Foo
GROUP BY g
HAVING age = MIN(age);
1|10
2|27
```
---
# Working in Chunks
+ Query that generates large amount of data
+ Don't want to have it all in R at the same time
+ Or want to process each group and do something with related other data
+ Or want to work asynchronously
+ Have database work on query while we continue do something else (in R session) as we wait
+ Check back with the database to
+ `rs = dbSendQuery(db, sql)`
+ Send query and immediately return to R
+ don't wait for the result
---
# Working with Chunks (ctd.)
+ Returned object from `dbSendQuery()` is a `result set`
+ Can query it to see if there are rows available
+ read `n` of those rows.
+ `df = dbFetch(rs, n)`
+ Repeat and consume rows in blocks
---
# VIEW
+ Ordinarily, the result set disappears after query.
+ Sometimes we want to operate on the result set with several subsequent queries on that temporary result set.
+ Use a VIEW - virtual table that wasn't in defined in the original DBMS
+ Will disappear when we disconnect from the DBMS
```sql
CREATE VIEW OrgCookies
AS SELECT * FROM moz_cookies WHERE host LIKE '%.org';
SELECT COUNT(*) FROM OrgCookies;
```
+ Can explicitly discard view with `DROP VIEW`
```sql
DROP VIEW OrgCookies
```
---
# Relational Databases
## Multiple Tables and JOIN
+ *Thanks to Nick Ulle and Clark Fitzgerald for examples.*
+ Simple example
+ 2 tables
+ Company name, stock ticker string, income and SIC code(?)
+ `fang_info`
```
ticker company_name sic_code net_income
1 AMZN AMAZON COM INC 5961 2371000000
2 GOOGLE ALPHABET INC. 7370 19478000000
3 FB FACEBOOK INC 7370 10188000000
4 NFLX NETFLIX INC 7841 186678000
```
---
+ SIC code is a Standard Industrial Classification for an industry
+ Table of SIC codes and descriptions
+ fang_sic
```
SIC Description
1 3292 Abestos Products
2 5182 Wine and Distilled Alcoholic Beverages
3 5961 Catalog and Mail-Order Houses
4 7841 Video Tape Rental
```
---
# Goal
+ For each row in fang_info, get the corresponding Description value from fang_sic
+ Different possible outcomes
+ Only the rows where there is a match on the SIC code in the two tables
```
ticker company_name sic_code net_income SIC Description
1 AMZN AMAZON COM INC 5961 2371000000 5961 Catalog and Mail-Order Houses
2 NFLX NETFLIX INC 7841 186678000 7841 Video Tape Rental
```
+ Dropped the two rows in fang_info with sic_code value 7370 since not in fang_sic
+ matched the sic_code to SIC and got Description
+ **INNER JOIN**
---
# Alternative, may want
```
ticker company_name sic_code net_income SIC Description
1 AMZN AMAZON COM INC 5961 2371000000 5961 Catalog and Mail-Order Houses
2 GOOGLE ALPHABET INC. 7370 19478000000 NA NA
3 FB FACEBOOK INC 7370 10188000000 NA NA
4 NFLX NETFLIX INC 7841 186678000 7841 Video Tape Rental
```
+ all rows from fang_info
+ Description and SIC from fang_sic when they match
+ NULL (or missing values in R) when no match
+ ** LEFT JOIN**
---
# Cartesian Product - OUTER JOIN
```
ticker company_name sic_code net_income SIC Description
1 AMZN AMAZON COM INC 5961 2371000000 3292 Abestos Products
2 AMZN AMAZON COM INC 5961 2371000000 5182 Wine and Distilled Alcoholic Beverages
3 AMZN AMAZON COM INC 5961 2371000000 5961 Catalog and Mail-Order Houses
4 AMZN AMAZON COM INC 5961 2371000000 7841 Video Tape Rental
5 GOOGLE ALPHABET INC. 7370 19478000000 3292 Abestos Products
6 GOOGLE ALPHABET INC. 7370 19478000000 5182 Wine and Distilled Alcoholic Beverages
7 GOOGLE ALPHABET INC. 7370 19478000000 5961 Catalog and Mail-Order Houses
8 GOOGLE ALPHABET INC. 7370 19478000000 7841 Video Tape Rental
9 FB FACEBOOK INC 7370 10188000000 3292 Abestos Products
10 FB FACEBOOK INC 7370 10188000000 5182 Wine and Distilled Alcoholic Beverages
11 FB FACEBOOK INC 7370 10188000000 5961 Catalog and Mail-Order Houses
12 FB FACEBOOK INC 7370 10188000000 7841 Video Tape Rental
13 NFLX NETFLIX INC 7841 186678000 3292 Abestos Products
14 NFLX NETFLIX INC 7841 186678000 5182 Wine and Distilled Alcoholic Beverages
15 NFLX NETFLIX INC 7841 186678000 5961 Catalog and Mail-Order Houses
16 NFLX NETFLIX INC 7841 186678000 7841 Video Tape Rental
```
+ Obtain with
```sql
SELECT * FROM fang_info
JOIN fang_sic
```
---
# Queries with 2 or More Tables
+ Specify the tables after the FROM
+ comma-separated list
+ Refer to the columns/variables with `tableName.columnName`
+ e.g. fang_info.sic_code
+ Example
```sql
SELECT *
FROM fang_info, fang_sic
WHERE fang_info.sic_code = fang_sic.SIC;
```
```
ticker company_name sic_code net_income SIC Description
1 AMZN AMAZON COM INC 5961 2371000000 5961 Catalog and Mail-Order Houses
2 NFLX NETFLIX INC 7841 186678000 7841 Video Tape Rental
```
---
+ Can drop tableName in tableName.columnName if unambiguous
+ i.e. name not in two or more tables
```sql
SELECT *
FROM fang_info, fang_sic
WHERE sic_code = SIC;
```
+ Can also use AS to give short name for table
```sql
SELECT *
FROM fang_info AS fi, fang_sic AS sic
WHERE fi.sic_code = sic.SIC;
```
---
# INNER JOIN
+ Want
```
ticker company_name sic_code net_income SIC Description
1 AMZN AMAZON COM INC 5961 2371000000 5961 Catalog and Mail-Order Houses
2 NFLX NETFLIX INC 7841 186678000 7841 Video Tape Rental
```
```sql
SELECT *
FROM fang_info
INNER JOIN fang_sic
ON fang_info.sic_code = fang_sic.SIC;
```
+ Same as
```sql
SELECT *
FROM fang_info, fang_sic
WHERE fang_info.sic_code = fang_sic.SIC;
```
---
# LEFT JOIN
+ Result has a row for each row in the left table
+ NULL values for columns from right table when no match in the right table
```sql
SELECT *
FROM fang_info
LEFT JOIN fang_sic
ON fang_info.sic_code = fang_sic.SIC;
```
---
# Tuple Matches Many Tuples
+ Matches don't have to match one tuple to a single tuple
+ Consider table fang_prices
```
ticker date high
1 AMZN 2018-02-07 1460.99
2 FB 2018-02-07 185.08
3 GOOGLE 2018-02-07 1086.53
4 NFLX 2018-02-07 272.45
5 AMZN 2018-02-06 1443.99
6 FB 2018-02-06 185.77
7 GOOGLE 2018-02-06 1087.38
8 NFLX 2018-02-06 266.70
9 AMZN 2018-02-05 1458.98
10 FB 2018-02-05 190.61
```
+ Join fang_info with fang_prices to merge each company with several stock prices for that company
```
SELECT *
FROM fang_info AS i
INNER JOIN fang_prices AS p
ON i.ticker = p.ticker;
```
---
# Result
```
ticker company_name sic_code net_income ticker date high
1 AMZN AMAZON COM INC 5961 2371000000 AMZN 2018-02-05 1458.98
2 AMZN AMAZON COM INC 5961 2371000000 AMZN 2018-02-06 1443.99
3 AMZN AMAZON COM INC 5961 2371000000 AMZN 2018-02-07 1460.99
4 GOOGLE ALPHABET INC. 7370 19478000000 GOOGLE 2018-02-06 1087.38
5 GOOGLE ALPHABET INC. 7370 19478000000 GOOGLE 2018-02-07 1086.53
6 FB FACEBOOK INC 7370 10188000000 FB 2018-02-05 190.61
7 FB FACEBOOK INC 7370 10188000000 FB 2018-02-06 185.77
8 FB FACEBOOK INC 7370 10188000000 FB 2018-02-07 185.08
9 NFLX NETFLIX INC 7841 186678000 NFLX 2018-02-06 266.70
10 NFLX NETFLIX INC 7841 186678000 NFLX 2018-02-07 272.45
```
+ Same result as
```sql
SELECT *
FROM fang_info AS i,
fang_prices AS p
WHERE i.ticker = p.ticker;
```
---
# Note
+
```sql
SELECT * FROM fang_info AS i
INNER JOIN fang_prices AS p
ON i.ticker = p.ticker;
```
+ Note the use of AS for the table names
+ The need to differentiate the two ticker columns
+ The result contains two columns named ticker
```
ticker company_name sic_code net_income ticker date high
```
---
# Order of Evaluation of SQL Query
+ Understanding how the different elements of a query are evaluated helps to reason and debug
+ Consider the stackexchange data
```sql
SELECT PostTypeId, value, COUNT(PostTypeId)
FROM Posts AS p, PostTypeIdMap AS m
WHERE PostTypeId = m.Id
AND value IN ('Question', 'Answer')
GROUP BY PostTypeId
```
---
# Evaluation Order
+ FROM or JOIN
+ WHERE - selects the rows to be operated on.
+ executed before both the aggregate functions and the SELECT statement.
+ GROUP BY
+ any grouping levels eliminated by the WHERE will not appear as groups.
+ HAVING evaluated after the aggregation functions
+ SELECT - creates a new table with columns specified and the rows produced by the clauses before
+ DISTINCT - after SELECT has created a new table
+ ORDER BY
+ LIMIT
---
# What about Nested Queries?
+ They are evaluated according to the same order as above.
+ Whatever step in which the nested query occurs
---
# Reasoning about JOINs
+ Two tables <!-- - A and B -->
+ Compatible columns allowing us to link/join rows
```sql
select *
FROM fang_info;
```
```
AMZN|AMAZON COM INC|5961|2371000000.0
GOOGLE|ALPHABET INC.|7370|19478000000.0
FB|FACEBOOK INC|7370|10188000000.0
NFLX|NETFLIX INC|7841|186678000.0
```
```sql
select *
FROM fang_sic;
```
```
3292|Abestos Products
5182|Wine and Distilled Alcoholic Beverages
5961|Catalog and Mail-Order Houses
7841|Video Tape Rental
```
---
# TIMES/Cartesian Product/CROSS JOIN
+ Consider all possible pairs of tuples between the two tables
```sql
SELECT *
FROM fang_info
CROSS JOIN fang_sic;
AMZN|AMAZON COM INC|5961|2371000000.0|3292|Abestos Products
AMZN|AMAZON COM INC|5961|2371000000.0|5182|Wine and Distilled Alcoholic Beverages
AMZN|AMAZON COM INC|5961|2371000000.0|5961|Catalog and Mail-Order Houses
AMZN|AMAZON COM INC|5961|2371000000.0|7841|Video Tape Rental
GOOGLE|ALPHABET INC.|7370|19478000000.0|3292|Abestos Products
GOOGLE|ALPHABET INC.|7370|19478000000.0|5182|Wine and Distilled Alcoholic Beverages
GOOGLE|ALPHABET INC.|7370|19478000000.0|5961|Catalog and Mail-Order Houses
GOOGLE|ALPHABET INC.|7370|19478000000.0|7841|Video Tape Rental
FB|FACEBOOK INC|7370|10188000000.0|3292|Abestos Products
FB|FACEBOOK INC|7370|10188000000.0|5182|Wine and Distilled Alcoholic Beverages
FB|FACEBOOK INC|7370|10188000000.0|5961|Catalog and Mail-Order Houses
FB|FACEBOOK INC|7370|10188000000.0|7841|Video Tape Rental
NFLX|NETFLIX INC|7841|186678000.0|3292|Abestos Products
NFLX|NETFLIX INC|7841|186678000.0|5182|Wine and Distilled Alcoholic Beverages
NFLX|NETFLIX INC|7841|186678000.0|5961|Catalog and Mail-Order Houses
NFLX|NETFLIX INC|7841|186678000.0|7841|Video Tape Rental
```
+ Most of these mix apples and oranges
+ rows that don't correspond to same entity/unit
+ BTW, equivalent to
```
SELECT *
FROM fang_info, fang_sic;
```
---
# WHERE
+ INNER JOIN
+ Keep only the rows where sic_code = sic
<pre>
AMZN|AMAZON COM INC|5961|2371000000.0|3292|Abestos Products
AMZN|AMAZON COM INC|5961|2371000000.0|5182|Wine and Distilled Alcoholic Beverages
<a style="color:red">AMZN|AMAZON COM INC|5961|2371000000.0|5961|Catalog and Mail-Order Houses</a>
AMZN|AMAZON COM INC|5961|2371000000.0|7841|Video Tape Rental
GOOGLE|ALPHABET INC.|7370|19478000000.0|3292|Abestos Products
GOOGLE|ALPHABET INC.|7370|19478000000.0|5182|Wine and Distilled Alcoholic Beverages
GOOGLE|ALPHABET INC.|7370|19478000000.0|5961|Catalog and Mail-Order Houses
GOOGLE|ALPHABET INC.|7370|19478000000.0|7841|Video Tape Rental
FB|FACEBOOK INC|7370|10188000000.0|3292|Abestos Products
FB|FACEBOOK INC|7370|10188000000.0|5182|Wine and Distilled Alcoholic Beverages
FB|FACEBOOK INC|7370|10188000000.0|5961|Catalog and Mail-Order Houses
FB|FACEBOOK INC|7370|10188000000.0|7841|Video Tape Rental
NFLX|NETFLIX INC|7841|186678000.0|3292|Abestos Products
NFLX|NETFLIX INC|7841|186678000.0|5182|Wine and Distilled Alcoholic Beverages
NFLX|NETFLIX INC|7841|186678000.0|5961|Catalog and Mail-Order Houses
<a style="color:red">NFLX|NETFLIX INC|7841|186678000.0|7841|Video Tape Rental</a>
</pre>
---
# LEFT JOIN
+ Same as INNER JOIN
+ THEN
+ Append the rows/tuples that had no match with the second/right table
+ rows from left/first table
+ NULL values for all the columns from second/right table.
---
# Cumulative Sum
+ How to compute the cumulative sum in SQL?
+ across rows/tuples,
+ but not a single answer
+ Let's make a simple table for exploring
```sql
CREATE TABLE A (
year INTEGER,
val INTEGER
);
INSERT INTO A VALUES
(2004, 2), (2003, 12), (2002, 3), (2005, 5);
2004|2
2003|12
2002|3
2005|5
```
---
# Cumulative Sum
+ Want to compute cumulative sum across years of val, to get
```
year a cumulativeSum
2002|3 |3
2003|12|15
2004|2 |17
2005|5 |22
```
---
# Window functions and the OVER operator
+ aggregate functions (e.g. SUM, MIN, AVG) compute over all tuples in a group or entire table.
+ typically reduce to one result
+ Window functions work on one or more rows,
+ but returns a value for each tuple.
```sql
SELECT year, val, SUM(val)
OVER (ORDER BY year DESC)
FROM A;
```
+ Order the result (not the window function) by year
```sql
SELECT year, val, SUM(val)
OVER (ORDER BY year DESC)
FROM A
ORDER BY year;
```
+ Or can "name" the window, creating it later in the query
```sql
SELECT a, SUM(a) OVER win
FROM A
WINDOW win AS (ORDER BY a DESC);
```
---
# Compute differences between values
+ Differences between years
+ e.g. 2003's value - 2002's value,
+ 2005's value - 2004's value
+ Let's add a new year
```sql
INSERT INTO A VALUES (2010, 32);
```
+ Want to get
```
2002|
2003|1
2004|1