-
Notifications
You must be signed in to change notification settings - Fork 4
/
04-data-manipulation.dyndoc
1474 lines (1172 loc) · 50.4 KB
/
04-data-manipulation.dyndoc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data Manipulation
Let's reload the "auto" data to discard any changes made in previous sections
and to start fresh.
```stata
<<dd_do>>
sysuse auto, clear
<</dd_do>>
```
## Restricting commands to subsets
We'll discuss operating on subsets of the data in far more detail a bit
[later](04-data-manipulation.qmd#subsetting), but first we'll discuss how to
modify the [basic command
syntax](01-the-basics-of-stata.qmd#basic-command-syntax) to run a command only
on some rows of data.
Recall the basic command syntax,
```stata
command <variable(s)>, <options>
```
By default, this will use all rows of the data it can. However, we can restrict
this.
```stata
command <variable(s)> in <number list>, <options>
command <variable(s)> if <condition>, <options>
```
Both are optional (obviously), but you can include them if desired.
Using `in`, we pass a number list which consists of a lower bound, a `/`, and an
upper bound. For example, if we wanted to summarize the first 10 rows for a
variable, we could run:
```stata
<<dd_do>>
summarize weight
summarize weight in 1/10
<</dd_do>>
```
As you can see, the second call to `summarize` thinks there are only 10 rows of
data.
The `if` requires defining a conditional statement. Consider the following
statements
$$
4 \gt 2
$$
$$
1 \gt 2
$$
Remembering back to middle school math classes that $\gt$ means "greater than",
clearly the first statement is true and the second statement is false. We can
assign values of true and false to any such conditional statements, which use
the following set of conditional operators:
| Sign | Definition | True example | False example |
|:------:|:-------------------------------|:--------------------------|:--------------------------:|
| $==$ | equality | $3 == 3$ | $3 == 2$ |
| $!=$ | not equal | $3 != 4$ | $3 != 3$ |
| $\gt$ | greater than | $4 \gt 2$ | $1 \gt 2$ |
| $\lt$ | less than | $1 \lt 2$ | $4 \lt 2$ |
| $\gt=$ | greater than or equal to | $4 \gt= 4$ | $1 \gt= 2$ |
| $\lt=$ | less than or equal to | $1 \lt= 1$ | $4 \lt= 2$ |
| \& | and (both statements are true) | $(4 \gt 2)$ \& $(3 == 3)$ | $(4 \gt 2)$ \& $(1 \gt 2)$ |
| $|$ | or (either statement is true) | $(3 == 2) | (1 \lt= 2)$ | $(4 \lt 2) | (1 \gt 2)$ |
So we could summarize a variable only when some other variables have some
values.
```stata
<<dd_do>>
summarize weight if foreign == 1
summarize weight if foreign == 1 | (mpg > 20 & headroom < 10)
<</dd_do>>
```
Note in the second example we used parentheses to evaluate a more complex
expression; we follow order of operations (remember PEMBAS?) and evaluate the
inner-most parantheses first. So first `mpg > 20 & headroom < 10` gets evaluated
and returns `TRUE` or `FALSE`; then following that, we evaluate either `foreign
== 1 | TRUE` or `foreign == 1 | FALSE` depending on what the first result was.
We saw the usage of this earlier when discussing [loading subsets of the
data](02-working-with-data-sets.qmd#loading-subsets-of-the-data).
## Generating new variables
The `generate` command can be used to create new variables which are functions
of existing variables. For example, if we look at the variable label for
`weight`, we see that it is measured in pounds.
```stata
<<dd_do>>
describe weight
<</dd_do>>
```
Let's create a second weight variable measured in tons. The syntax for
`generate` is straightforward,
```stata
generate <new varname> = <function of old variables>
```
```stata
<<dd_do>>
generate weight2 = weight/2000
<</dd_do>>
```
The `list` command can be used to output some data, let's use it here to output
the first 5 rows' `weight` and `weight2` variables:
```stata
<<dd_do>>
list weight* in 1/5
<</dd_do>>
```
(Note: I use `list` here because I need the variables outputted to create the
document. When using Stata interactively, it'd probably be nicer to use `browse`
or `edit` in the exact same fashion, e.g. `browse weights* in 1/5`. These enter
the Data Browser (`browse`) or Data Browser (Edit Mode) (`edit`) showing the
same subset of rows/columns as requested.)
If you check the arithmetic, you'll see that we've got the right answer. We
should probably add a variable label to our new `weight`
```stata
<<dd_do>>
label variable weight2 "Weight (tons)"
describe weight*
<</dd_do>>
```
In addition to direct arithmetic equations, we can use a number of functions to
perform calculations. For example, a common transformation is to take the log of
any monetary variable, in our case `price`. This is done because typical
monetary variables, such as price or salary, tend to be very right-skewed - most
people make $30k-50k, and a few people make 6 or 7 digit incomes.
```stata
<<dd_do>>
generate logprice = log(price)
label variable logprice "Log price"
list *price in 1/5
<</dd_do>>
```
In that command, `log` is the function name, and it is immediately followed by
parentheses which enclose the variable to operate on. Read the parentheses as
"of", so that `log(price)` is read as "log of price".
There are a lot of functions that can be used. We list some commonly used
mathematical functions below for your convenience:
- `+`, `-`, `*`, `/`: Standard arithmetic
- `abs( )`: returns the absolute value
- `exp( )`: returns the exponential function of $e^x$
- `log( )` or `ln( )`: returns the natural logarithm of the argument^[If you
want log with a different base, you can use the transformation that dividing
by `log(b)` is equivalent to using `b` as a base. In other words, if you need
log base 10, use `gen newvar = log(oldvar)/log(10)`.]
- `round( )`, `ceil( )`, `floor( )`: returns the rounded value (rounded to
nearest integer, rounded up, and rounded down)
- `sqrt( )`: returns the square root
You can see a full accounting of all functions you can use in this setting in
```stata
help functions
```
### Creating dummies
Dummy variables (also known as indicator variables or binary variables) are
variables which take on two values, 0 and 1^[Technically and mathematically they
can take on any two values, but your life will be easier if you stick with the
0/1 convention.]. These are typically used in a setting where the 0 represents
an absence of something (or an answer of "no") and 1 represents the presence (or
an answer of "yes"). When naming dummy variables, you should keep this in mind
to make understanding the variable easier, as well as extracting interpretations
regarding the variable in a model.
For example, "highschool" is a poor dummy variable - what does 0 highschool or 1
highschool represent? Obviously we could (and should) use [value
labels](03-data-management.qmd#labeling-values) to associate 0 and 1 with
informative labels, but it is more straightforward to use a variable name such
as "highschool_graduate" or "graduateded_highschool) - a 0 represents "no" to
the question of "graduated high school?", hence a non-high school graduate; and
a 1 represents a "yes", hence a high school graduate.
If you are collecting data, consider collecting data as dummies where
appropriate - if the question has a binary response, encode it as a dummy
instead of strings. If a question has categorical responses, consider encoding
them as a series of dummy variables instead (e.g. "Are you from MI?", "Are you
from OH?" etc). These changes will (usually) need to be made later anyways.
Now here's the trick: In Stata^[This is true of most statistical software in
fact.], [conditional
statements](04-data-manipulation.qmd#restricting-commands-to-subsets) return 1
(True) and 0 (False). So we can use them in `generate` statements to create
binary variables easily.
```stata
<<dd_do>>
generate price4k = price > 4000
list price* in 1/5
<</dd_do>>
```
Note that this is NOT the same thing as using
[`if`](04-data-manipulation.qmd#restricting-commands-to-subsets). E.g., we see
the following error:
```stata
<<dd_do>>
generate price4k2 = if price > 4000
<</dd_do>>
```
Now, `price4k` takes on values 1 and 0 depending on whether the conditional
statement was true.
For a slightly more complicated example, lets create a dummy variable
representing cheap cars. There are two possible definitions of cheap cars - cars
which have a low cost, or cars which have low maintenance costs (high mileage
and low repairs).
```stata
<<dd_do>>
generate cheap = price < 3500 | (rep78 <= 2 & mpg > 20)
list make price rep78 mpg if cheap == 1
<</dd_do>>
```
The `list` commands conditions on `cheap == 1` because again, the conditional
statement will return 1 for true and 0 for false. We see 6 cheap cars; the
Chevette and Zephyr are cheap because of their cost, whereas the other four cars
are cheap because of the maintenance costs.
### System Variables
In Stata, under the [One Data](01-the-basics-of-stata.qmd#one-data) principal,
any information in the data^[We'll see some exceptions to this in the
[programming](05-programming.qmd) section.] must be in a variable. This
includes the System Variables of `_n` and `_N`. You can imagine that every data
st you ever open has two additional columns of data, one for `_n` and one for
`_N`.
`_n` represents the row number, currently. "Currently" means if the data is
re-sorted, `_n` can change.
`_N` represents the total number of rows in the data, hence this is the same for
every row. Again, if the data changes (e.g. you
[drop](04-data-manipulation.qmd#discarding-data) some data) then `_N` may be
updated.
While you cannot access these System Variables normally (e.g. they don't appear
in the Data Browser), you can use them in generating variables or conditional
statements. For example, we've seen that `list` can use `in` to restrict the
rows it outputs, and we've seen that it can use `if` to choose conditionally. We
can combine these:
```stata
<<dd_do>>
list make in 1/2
list make if _n <= 2
<</dd_do>>
```
A more useful example is to save the initial row numbering in your data. When we
discuss [sorting](04-data-manipulation.qmd#sorting) later, it may be useful to
be able to return to the original ordering. Since `_n` changes when the data is
re-sorted, if we save the initial row numbers to a permanent variable, we can
always re-sort by it later. `_N` is slightly less useful but can be used
similarly.
```stata
<<dd_do>>
generate row = _n
generate totalobs = _N
list row totalobs in 1/5
<</dd_do>>
```
### Extensions to generate
The command `egen`^[`egen` is **not** a [short
command](01-the-basics-of-stata.qmd#short-commands) for "egenerate"; the full
command name is simply "`egen`".] offers some functionality that `generate`
lacks, for example creating the mean of several variables
```stata
egen <newvar> = rowmean(var1, var2, var3)
```
The functions which `egen` support are fairly random; you can see the full list
in the help:
```stata
help egen
```
## Replacing existing variables
[Earlier](04-data-manipulation.qmd#generating-new-variables) we created the
`weight2` variable which changed the units on weight from pounds to tons. What
if, instead of creating a new variable, we tried to just change the existing
`weight` variable.
```stata
<<dd_do>>
generate weight = weight/2000
<</dd_do>>
```
Here Stata refuses to proceed since `weight` is already defined. To overwrite
`weight`, we'll instead need to use the `replace` command.
```stata
<<dd_do>>
replace weight = weight/2000
list weight in 1/5
<</dd_do>>
```
`replace` features syntax identical to `generate`.^[`generate` has a few
features we do not discuss which `replace` does not support. Namely, `generate`
can set the [type](03-data-management.qmd#describing-the-data) manually
(instead of letting Stata choose the best type
[automatically](03-data-management.qmd#compressing-data)), and `generate` can
place the new variable as desired rather than [using
`order`](03-data-management.qmd#changing-variable-ordering). Clearly, neither
of these features are needed for `replace`.]
### Conditional variable generation
(We're going to reload the `auto` data set at this point to ensure all data is
as originally saved.)
```stata
<<dd_do>>
sysuse auto, clear
<</dd_do>>
```
One frequent task is recoding variables. This can be "binning" continuous
variables into a few categories, re-ordering an ordinal variables, or collapsing
categories in an already-categorical variable. There are also multi-variable
versions; e.g. combining multiple variables into one.
The general workflow with these cases will be to optionally use `generate` to
create the new variable, then use `replace` to conditional replace the original
or new variable.
As an example, let's generate a new variable which categorizes cars into light,
medium weight, and heavy cars. We'll define light cars as a weight below 1 ton
(2000 lbs), and heavy cars as having a weight of 2 tons (4000 lbs) or more.
Before we do this, we've learned that the weight reported for the Pont. Grand
Prix was incorrect - we don't know what the correct weight is, but we know the
presented one is wrong, so let's make it missing. We could of course do this
manually - open the data editor and delete the value of `weight` corresponding
to the Pont. Grand Prix. As we saw earlier, [manually
editing](02-working-with-data-sets.qmd#editing-data-manually) the data like
this produces a `replace` call that we can move into our [Do file for
reproducibility](01-the-basics-of-stata.qmd#do-files). However, this `replace`
call would refer to a row number, something like
```stata
replace weight = . in 49
```
What would happen if our data was shuffled prior to running this command? It
would no longer be applied to the correct row. Therefore, it will be safer to
use a [conditional
statement](04-data-manipulation.qmd#restricting-commands-to-subsets) to
identify the row corresponding to `"Pont. Grand Prix"`.
```stata
<<dd_do>>
replace weight = . if make == "Pont. Grand Prix"
list make weight if make == "Pont. Grand Prix"
<</dd_do>>
```
Now, we'll return to generating the categorical weight variable. First, we'll
generate the new variable to store this information.
```stata
<<dd_do>>
generate weight_cat = 1
tab weight_cat
<</dd_do>>
```
Without any conditional statements, every observation's `weight_cat` is set
to 1. We'll let the 1 represent the "light" category, so next we'll replace it
with 2 for cars in the "medium" category.
```stata
<<dd_do>>
replace weight_cat = 2 if weight >= 2000 & weight < 4000
tab weight_cat
<</dd_do>>
```
Note the choice of `>=` instead of `>` and `<` instead of `<=`. As above, we
stated that light cars have weight **below** 2000 lbs, so medium cars have a
value of 2000 **or more** (greater than **or equal**). On the other end, heavy
cars have a weght of 4000 lbs **or more**, so medium cars are **strictly** less
than 4000 lbs (less than).
Finish with the "heavy" cars
```stata
<<dd_do>>
replace weight_cat = 3 if weight >= 4000
tab weight_cat
<</dd_do>>
```
When using less than/greater than conditinal statements to split a variable into
groups, you always want to ensure that when the two "endpoints" are the same,
one uses strictly less/more, and the other uses "or equal". If both use "or
equal", you'll get inconsistent results for exact values. If neither use "or
equal", exact values will not be classified. (For example, if we had used
`weight < 4000` and `weight > 4000`, any car with exact weight of 4000 would not
fall into either [and its `weight_cat` would stay 1, a light car]. On the other
hand, if we had used `weight <= 4000` and `weight >= 4000`, a car with exact
weight of 4000 would be assigned to whichever of the lines was run last.)
Lastly, we'll add some nice labels.
```stata
<<dd_do>>
label define weight_cat 1 "Light" 2 "Medium" 3 "Heavy"
label values weight_cat weight_cat
tab weight_cat
<</dd_do>>
```
There's one additional complication. Stata represents missing values by `.`, and
`.` has a value of positive infinity. That means that
$$
400 \lt .
$$
is true! There is some discussion [on the Stata
FAQs](http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/)
that goes into the rationale behind it, but the short version is that this
slightly complicates variable generation but greatly simplifies and protects
some data management tasks.
The complication referred to can be seen in the row corresponding to the Pont.
Grand Prix
```stata
<<dd_do>>
list make weight weight_cat in 46/50
<</dd_do>>
```
Even though the Grand Prix has no weight, it is categorized as "Heavy"
```stata
<<dd_do>>
replace weight_cat = . if missing(weight)
tab weight_cat, missing
<</dd_do>>
```
The `missing()` function returns true for each row with a missing value, and
false for each row with an observed value, for the variable inside the
parantheses (in this case, `weight`).
You may occasionally see `if weight != .` or `if weight <= .` instead of the
`missing()` function. Recall that missing values are sorted to be larger than
the largest observed value, so this works just as well as `missing()`. However,
Stata allows you to define "reasons" for missing, specifically `.a`, `.b`, all
the way through `.z`. These are sorted such that `.` < `.a` < `.b` < ... < `.z`.
For this reason, `!= .` is not suggested, as while `.` will be captured as
missing, `.a`, etc will not be. Using `missing()` removes the temptation to
write `!=` instead of `<=`.
The `missing()` function can be proceeded with an exclamation point to indicate
not missing. For example
```stata
replace x = 2 if !missing(y)
```
The `missing` option to `tab` forces it to show a row for any missing values.
Without it, missing rows are suppressed.
To summarize, we used the following commands:
```stata
generate weight_cat = 1
replace weight_cat = 2 if weight >= 2000 & weight < 4000
replace weight_cat = 3 if weight >= 4000
replace weight_cat = . if missing(weight)
```
There are various other ways it could have been done, such as
```stata
generate weight_cat = 1 if weight < 2000
replace weight_cat = 2 if weight >= 2000 & weight < 4000
replace weight_cat = 3 if weight >= 4000 & !missing(weight)
```
```stata
generate weight_cat = .
replace weight_cat = 1 if weight < 2000
replace weight_cat = 2 if weight >= 2000 & weight <= 4000
replace weight_cat = 3 if weight > 4000 & !missing(weight)
```
Of course, we could also generate it in the reverse order (3 to 1) or even mixed
up (3, 1, 2). There are also alternate ways to write the various conditionals,
such as replacing `weight > 4000` with `weight >= 4001`. There are usually
multiple correct ways to specify any conditional.
## More complicated replaces
The above example for `replace` was fairly simplistic, but you can imagine the
need for a much more complicated replacing structure (perhaps based on the value
of multiple variables). If, however, you do have something this simple, the
`recode` command could be used instead.
The `recode` command syntax is fairly simple,
```stata
recode <oldvar> (<rule 1>) (<rule 2>) ...., generate(<newvar>)
```
The different rules define the recoding to take place. For example, the above
creation of `weight_cat` can be written as
```stata
recode weight (1/1999 = 1) (2000/4000 = 2) (4001/99999999 = 3) ///
(missing = .), generate(weight_cat)
```
Each rule has the form of `old value(s) = new value`, where the old values can
be any of:
- A single number, e.g. `(5 = 2)`.
- several numbers, either
- a [numlist](02-working-with-data-sets.qmd#loading-subsets-of-the-data) as
in this example (note the use of a very large non-missing value for the
upper bound)
- a space-separated list of values, e.g. `(1 5 10 = 4)`
- Mixture of the those two, e.g. `(6 10 12/25 31 = 17)`
- the phrases `missing`, `nonmissing` or `else` to capture anything not
elsewhere defined.
The new value must be a single number or a missing value (`.` or `.a`, etc).
`else` cannot be used if `missing` or `nonmissing` is defined (and vice-versa),
and all of those must be the last rules defined. E.g.,
```stata
recode x (missing = 5) (2 = 4) (else = 3) (1 = 2), generate(y)
```
will not run because "missing" and "else" are both simultaneously defined, and
because the `1 = 2` rule is last instead of `else` or `missing`.
Note that if you see older code you may see either the parantheses or the
`generate` option excluded. You should include both of these.
Finally, the rules are executed left-to-right. So if you have two rules
referring to the same values, the first one is used, and the second is not. For
example,
```stata
recode x (1/5 = 7) (2 = 4), generate(y)
```
The `2 = 4` rule will never take place because 2 is already recoded to 7 in the
`1/5 = 7` rule.
## Subsetting
Almost any Stata command which operates on variables can operate on a subset of
the data instead of the entire data, [as we saw
before](04-data-manipulation.qmd#restricting-commands-to-subsets), by using the
`if` or `in` statements in the command. This is equivalent to throwing away some
data and then performing the command. In general, you should avoid discarding
data as you never know when you will possible use it. Of course, you could use
[`preserve` and
`restore`](02-working-with-data-sets.qmd#temporarily-preserving-and-restoring-data)
to temporarily remove the data, but using the conditional subsetting is more
straightforward.
If the conditional logic we want to use involves subsets of the data, we could
use this to give us results within each group.
```stata
<<dd_do>>
summarize price
summarize price if foreign == 0
summarize price if foreign == 1
<</dd_do>>
```
Keep track of the number of observations, `Obs`, to see that the second and
third commands are in fact operating on the subsets. We see here that American
cars are cheaper on average^[Note that this is not a statistical claim, we would
have to do a two-sample t-test to make any statistical claim.].
### Repeat commands on subsets
To look at the average price for American and foreign cars, we ran two
individual commands. If we wanted to look at the summaries by `rep78`, that
would take 6 commands (values 1 through 5 and `.`)!
Instead, we can use `by` and `bysort` to perform the same operation over each
unique value in a variable. For example, we could repeat the above with:
```stata
<<dd_do>>
by foreign: summ price
<</dd_do>>
```
There is a strong assumption here that the data is already sorted by the
variables we are splitting `by` on (e.g. `foreign`). If `foreign` were not
sorted (or if you simply did not want to check/assume it was), you could instead
use
```stata
bysort foreign: summ price
```
`bysort` is identical to sorting (which we'll discuss
[later](04-data-manipulation.qmd#sorting)) first and running the `by` statement
afterwards. In general, it is recommended to always use `bysort` instead of
`by`, *unless* you believe the data is already sorted and want an error if that
assumption is violated.
Before running these commands, consider generating a [original ordering
variable](04-data-manipulation.qmd#system-variables) first.
`bysort`'s variables cannot be conditional statements, so if you wanted to for
example get summaries by low and high mileage cars, you'd need to generate a
dummy variable first.
```stata
<<dd_do>>
gen highmileage = mpg > 20
bysort highmileage: summ price
<</dd_do>>
```
`bysort` can take two or more variables, and performs its commands within each
unique combination of the variable. For example,
```stata
<<dd_do>>
bysort foreign highmileage: summ price
<</dd_do>>
```
When specifying `bysort`, you can optionally specify a variable to sort on but
not to group by. To see this in action, let's switch to the "nlswork" data set.
```stata
<<dd_do>>
webuse nlswork, clear
list idcode year age ln_wage in 1/5
list idcode year age ln_wage in 11/15
<</dd_do>>
```
The file contains information from the National Longitudinal Survey of Young
Women. Each woman enters the data at a different age. First, let's obtain a
version of the data that contains only the first row for each woman.
```stata
<<dd_do>>
bysort idcode (year): generate rownum = _n
list idcode year rownum if inlist(_n, 1, 2, 12, 13, 14)
<</dd_do>>
```
(I use `if inlist(` instead of `in` to be able to view a non-continuous set of
rows; `inlist` evaluates to "True" only if the first entry (`_n` in this case)
is equal to any of the following entires (1, 2, 12, 13, or 14).)
Since we are essentially splitting the whole dataset into separate smaller ones
based upon `idcode`, the row numbers in each of those smaller data sets starts
at 1 again. Sorting by `year` (via `(year)`) ensures we get a proper ordering.
We could then do something like this
```stata
<<dd_do>>
generate firstyear = rownum == 1
list idcode year firstyear in 1/5
<</dd_do>>
```
or even
```stata
<<dd_do>>
keep if rownum == 1
list idcode year in 1/5
<</dd_do>>
```
Loading the full data back up, let's see a more advanced example. Let's create a
variable that contains the age of the woman at entrance to the data.
```stata
<<dd_do>>
webuse nlswork, clear
bysort idcode (year): gen age_at_first_year = age[1]
list idcode year age age_at_first_year if inlist(_n, 1, 2, 12, 13, 14)
<</dd_do>>
```
The square bracket notation (`[1]`) refers to row numbers; this extracts the
first row (per `idcode`) from `age`. We can also use `_n` inside the square
brackets to refer to following or preceding rows:
```stata
<<dd_do>>
bysort idcode (year): gen wave_prev_yr = ln_wage[_n-1]
list idcode year ln_wage wave_prev_yr if inlist(_n, 1, 2, 12, 13, 14)
<</dd_do>>
```
Note that both these examples would run without complaint if we excluded by
`bysort`, but then `[1]` and `[_n-1]` would correspond to the whole data - for
`age_at_first_year`, the `1` would refer to the very first row for the whole
data. For `wave_prev_yr` it'd be more insidious - specifically, as written
above, the first row per `idcode` has no value of `wave_prev_yr`, whereas
excluding `bysort` would assign the previous `idcode`'s *final* value of
`ln_wage` asn the *first* value of `wave_prev_yr` for the next `id_code`:
```stata
<<dd_do>>
webuse nlswork, clear
gen wave_prev_yr = ln_wage[_n-1]
list idcode year ln_wage wave_prev_yr if inlist(_n, 1, 2, 12, 13, 14)
<</dd_do>>
```
A lot of data manipulation operations, when dealing with repeated measures or
otherwise clustered data, can be accomplished easily with `bysort`.
### Discarding data
If you do want to discard data, you can use `keep` or `drop` to do so. They each
can perform on variables:
```stata
keep <var1> <var2> ...
drop <var1> <var2> ...
```
or on observations:
```stata
keep if <conditional>
drop if <conditional>
```
Note that these *cannot* be combined:
```stata
<<dd_do>>
drop turn if mpg > 20
<</dd_do>>
```
`keep` removes all variables except the listed variables, or removes any row
which the conditional does not return true.
`drop` removes any listed variables, or removes any row which the conditional
returns true.
## Dealing with duplicates
If your data is not expected to have duplicates, either across all variables or
within certain variables, the `duplicates` command can make their detection and
correction easier. The most basic command is `duplicates report`, which simply
reports on the status of duplicate rows. Let's use the built-in "bplong" data.
This data contains 120 patients (`patient`) with measures of blood pressure
(`bp`) at two different time points (`when`, a Before and After), and some
descriptive variables (`sex` and `agegrp`).
```stata
<<dd_do>>
sysuse bplong, clear
duplicates report
<</dd_do>>
```
This report is not very interesting; it reports that there are 240 observations
which have 1 copy (e.g. are unique), and hence no surplus. Given that each row
should be unique (just in patient ID and before/after alone), this is not
surprising. Let's instead look at the duplicates just for `bp` and `when`:
```stata
<<dd_do>>
duplicates report bp when
<</dd_do>>
```
Here we have some duplicates. First, there are 23 observations which are fully
unique. All other observations have repeats to some extent.
The second row of the output tells of us that there are 42 observations which
have 2 copies. The language here can be a bit confusing; all it is saying is
that there are 42 rows, each of which has a single duplicate *within that same
42*. So if we have values 1, 1, 2, 2, that would be reported as 4 observations
with 2 surplus.
The number of surplus is the number of non-unique rows in that category. We
could compute it ourselves - we know that there are 42 rows with 2 copies, so
that means that half of the rows are "unique" and the other half are
"duplicates" (which is unique and which is duplicate is not clear). So 42/2 =
21, and we have 21 surplus.
Consider the row for 4 copies. There are 48 rows, each of which belongs to a set
of four duplicates. For example, 1, 1, 1, 1, 2, 2, 2, 2, has observations 8 and
copies 2. In this row, 48/4 = 12, so there are 12 unique values, meaning 36
surplus.
Other useful commands include
- `duplicates list`: Shows every set of duplicates, including its row number and
value. Obviously for something like this the output would be massive as of the
240 total rows, only 23 are not duplicated to some degree!
- `duplicates tag <vars>, gen(<newvar>)`: Adds a new variable which represents
the number of other copies for each row. For unique rows, this will be 0. For
any duplicated rows, it will essentially be "copies" from `duplicates report`
minus 1. This can be useful for examining duplicates or dropping them.
- `duplicates drop`: Be cautious with this, as it drops any row which is a
duplicate of a previous row (in other words keeps the first entry of every set
of duplicates).
## Sorting
We already saw sorting [in the context of
`bysort`](04-data-manipulation.qmd#repeat-commands-on-subsets). We can also
sort as a standalone operation. As before, consider generating a [original
ordering variable](04-data-manipulation.qmd#system-variables) first.
We'll switch back to "auto" first.
```stata
<<dd_do>>
sysuse auto, clear
gen order = _n
<</dd_do>>
```
The `gsort` function takes a list of variables to order by.
```stata
<<dd_do>>
gsort rep78 price
list rep78 price in 1/5
<</dd_do>>
```
Stata first sorts the data by `rep78`, ascending (so the lowest value is in row
1). Then within each set of rows that have a common value of `rep78`, it sorts
by `price`.
You can append "+" or "-" to each variable to change whether it is ascending or
descending. Without a prefix, the variable is sorted ascending.
```stata
<<dd_do>>
gsort +rep78 -price
list rep78 price in 1/5
<</dd_do>>
```
Recall that missing values (`.`) are [larger than any other
values](04-data-manipulation.qmd#conditional-variable-generation). When sorting
with missing values, they follow this rule as well. If you want to treat missing
values as smaller than all other values, you can pass the `mfirst` option to
`gsort`. Note this does *not* make missingness "less than" anywhere else, only
for the purposes of the current sort.
Sorting strings does work and is done alphabetically. All capital letters are
"less than" all lower case letters, and a blank string (`""`) is the "smallest".
For example, if you have the strings `DBC`, `Daa`, `""`^[I'm explcitly writing
the quotations here as otherwise it would just look missing. The quotations
aren't part of the saved string.], `EEE`, the sorted ascending order would be
`""`, `DBC`, `Daa`, `EEE`. The blank is first; the two strings starting with "D"
are before the string `EEE`, and the upper case "B" precedes the lower case "a".
As a side note, there is an additional command, `sort`, which can perform
sorting. It does not allow sorting in descending order, however it does allow
you to sort only a certain number of rows; that is, passing something like `sort
<varname> in 100/200` would sort only rows 100 through 200, leaving the
remaining rows remain *in their exact same position*.
## Working with strings and categorical variables
String variables are commonly used during data collection but are ultimately not
very useful from a statistical point of view. Typically string variables should
be represented as [categorical variables with value
labels](03-data-management.qmd#labeling-values) as we've previously discussed.
Here are some useful commands for operating on strings and categorical
variables.
### Converting between string and numeric
These two commands convert strings and numerics between each other.
```stata
destring <variable>, gen(<newvar>)
tostring <variable>, replace
```
Both commands can take the options `replace` (to replace the existing variable
with the new one) or `gen( )` (to generate a new variable). I would recommend
always using `gen` to double-check that the conversion worked as expected, then
using `drop`, `rename` and `order` to replace the existing variable.
```stata
<<dd_do>>
desc mpg
tostring mpg, gen(mpg2)
desc mpg2
list mpg* in 1/5
<</dd_do>>
```
Now that the new string is correct, we can replace the existing `mpg`.
```stata
<<dd_do>>
drop mpg
rename mpg2 mpg
order mpg, after(price)
<</dd_do>>
```
Let's go the other way around:
```stata
<<dd_do>>
desc mpg
destring mpg, gen(mpg2)
desc mpg2
list mpg* in 1/5
drop mpg
rename mpg2 mpg
order mpg, after(price)
<</dd_do>>
```
And we're back to the original set-up.^[If you are sharp-eyed, you may have
noticed that the original `mpg` was an "int" whereas the final one is a "byte".
If we had called [`compress`](03-data-management.qmd#compressing-data) on the
original data, it would have done that type conversion anyways - so we're ok!]
When using `destring` to convert a string variable (that it storing numeric data
as strings - "13", "14") to a numeric variable, if there are *any* non-numeric
entries, `destring` will fail. For example, lets replace one entry in the `make`
variable with a numeric.
```stata
<<dd_do>>
replace make = "1" in 1
destring make, gen(make2)
<</dd_do>>
```
We must pass the `force` option. With this option, any strings which have
non-numeric variables will be marked as missing.
```stata
<<dd_do>>
destring make, gen(make2) force