-
Notifications
You must be signed in to change notification settings - Fork 4
/
03-data-management.qmd
961 lines (724 loc) · 36.4 KB
/
03-data-management.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
# Data Management
Throughout this chapter, we'll be using the "auto" data set which is distributed
with Stata. It can be loaded via
```stata
. sysuse auto
(1978 automobile data)
```
You can reload it as necessary (if you modify it and want the original) by
re-running this with the `clear` option. Feel free to make frequent use of
[`preserve` and
`restore`](02-working-with-data-sets.qmd#temporarily-preserving-and-restoring-data).
Whenever you first begin working with a data set, you'll want to spend some time
getting familiar with it. You should be able to answer the following questions
(even if some of them are approximations):
- How many observations does your data have?
- How many variables are in your data set?
- What is the unit of analysis? (What does each row represent - a person? a
couple? a classroom?)
- Is there a variable which uniquely identifies each unit of analysis?
- If the data is repeated measures in some form (multiple rows per person,
or data is students across several classrooms), what variable(s) identify
the levels and groups?
- Are there any variables which are strings (non-numeric) that you plan on using
in some statistical analysis (and will need to be converted to numeric)?
- Which variables are continuous (can take on any value in some reasonable
range, such as weight) vs which are categorical (take on a set number of
values where each represents something).
You'll notice that there are no statistical questions here - we're not even
worried about those yet! These are merely logistical.
In this chapter we'll go over various tools and commands you can use to answer
these questions (and more) and overall to gain a familiarity with the logistics
of your data.
## Referring to variables
When we discussed [basic command
syntax](01-the-basics-of-stata.qmd#basic-command-syntax), we said that the
optional list of variables can include any number of variables (for some
commands). Writing out all the variables can get very tedious as the number of
variables increases. Thankfully there are two alternatives.
First, we can use the wild card `*`^[This is the reason why `*` as a comment
does not work in the middle of a line (and we use `//` instead).]. For example,
we could refer to the variables "turn" and "trunk" as `t*`, as both variables
start with "t" and are followed by anything. However, be careful, as this would
also match variables such as `turnips`, `tprice`, `t`, etc, if any such
variables existed. It can also be used in the middle or beginning, e.g.:
- `c*t` would match `cat`, `caught` and `ct`
- `*2` would match `age2`, `gender2` and `salary2012`.
Multiple `*` can also be used, e.g.:
- `*age*` would match `birthage`, `age_last` or `latest_age_given`. (It would
also match `wages`, so be careful with this!)
Alternatively, if the variables we want to include are next to each other in the
data (e.g. in the Variables pane), we can refer to a list of them. Say the
variables `x1` through `x25` are in ordered in the logical fashion. We could
refer to the whole list of them as `x1-x25`. This includes both `x1` and `x25`,
as well as any variable in between them. We will discuss the
[`order`](03-data-management.qmd#changing-variable-ordering) command later to
re-order variables.
Finally, you often don't need to give the entire name of the variable, just
enough characters for Stata to be able to uniquely identify it (similar to
[short names](01-the-basics-of-stata.qmd#short-commands)). We'll see in a
minute more about the `describe` command, but for example,
```stata
. describe headr
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
headroom float %6.1f Headroom (in.)
```
Stata will error if you don't use enough characters:
```stata
. describe t
t ambiguous abbreviation
r(111);
```
Be very careful with this approach. I only recommend it's use when exploring the
data using the Command window; when writing a Do-file, use the full variable
name to prevent errors!
## Describing the data
The first command you should run is `describe`.
```stata
. describe
Contains data from /Applications/Stata/ado/base/a/auto.dta
Observations: 74 1978 automobile data
Variables: 12 13 Apr 2022 17:45
(_dta has notes)
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
make str18 %-18s Make and model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear ratio
foreign byte %8.0g origin Car origin
-------------------------------------------------------------------------------
Sorted by: foreign
```
This displays a large amount of information, so let's break in down.
First, the header displays general data set information - the number of
observations (`obs`, the number of rows) and variables (`vars`).
Next, there is a table listing each variable in the data and some information
about them. The "storage type" can be one of `byte`, `int`, `long`, `float`,
`double`; all of which are simply numbers. We'll touch on the differences
between these when we discuss
[`compress`](03-data-management.qmd#compressing-data), but for now they all
represent numeric variables. String variables are represented as `str##` where
the `##` represent the number of characters that the string can be, e.g. `str18`
shows that `make` has a maximum of 18 letters. (This limit is irrelevant, again,
see [`compress`](03-data-management.qmd#compressing-data) for details.)
The "display format" column contains format information about each variable
which only control how the variables are displayed in data view. For the most
part you can ignore these and leave them at the default, though you may need to
work with this if you have date or time information. For further details see
```stata
help formats
```
"value label" and "variable label" are used to display more information when
running analyses using these variables. See the
[label](03-data-management.qmd#labels) section for further details.
Finally, if the data is sorted, `describe` will provide information about the
sorting.
`describe` can also be used on a per variable basis:
```stata
. describe mpg
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
mpg int %8.0g Mileage (mpg)
. describe trunk displacement
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
trunk int %8.0g Trunk space (cu. ft.)
displacement int %8.0g Displacement (cu. in.)
```
If you have a very large number of variables you may wish to suppress the table
of variables entirely:
```stata
. describe, short
Contains data from /Applications/Stata/ado/base/a/auto.dta
Observations: 74 1978 automobile data
Variables: 12 13 Apr 2022 17:45
Sorted by: foreign
```
Alternatively, the `simple` option returns only the names of the variables, in
column-dominant order (meaning you read down the columns not across the rows).
```stata
. describe, simple
make rep78 weight displacement
price headroom length gear_ratio
mpg trunk turn foreign
```
## Compressing data
As mentioned [above](03-data-management.qmd#describing the data), there are
different ways to store a number variable, such as `byte` and `long`. The
various options take more space to save - types which take less space can store
only smaller numbers whereas types that take more space can store larger
numbers. For example, a number stored as a byte can only take on values between
-127 and 100 and only integers (e.g. not 2.5) whereas a number stored as float
can store numbers up to $1.7x10^{38}$ with up to 38 decimal places. Strings
operate similarly; a string variable with 20 characters would store "abc" as 17
blank characters followed by the "abc".
Understanding the above is not that important these days as computer power and
storage has increased to the point where the majority of us will not be reaching
its limits. However, Stata does offer the `compress` command which attempts to
store variables in the smallest possible type. For example, if a variable which
is a float takes on only values 1 through 10, it is replaced by a byte (and
similarly, strings are as long as the longest value).
The `memory` command lets us see the size of our data, particularlly the first
entry of "Data" under "Used" shows that we start with 3,182 bytes or roughly
3Kb.
```stata
. memory
Memory usage
Used Allocated
----------------------------------------------------------------------
Data 3,182 67,108,864
strLs 0 0
----------------------------------------------------------------------
Data & strLs 3,182 67,108,864
----------------------------------------------------------------------
Data & strLs 3,182 67,108,864
Variable names, %fmts, ... 4,370 71,230
Overhead 1,081,344 1,082,136
Stata matrices 0 0
ado-files 22,919 22,919
Stored results 0 0
Mata matrices 1,904 1,904
Mata functions 1,632 1,632
set maxvar usage 4,636,521 4,636,521
Other 14,828 14,828
----------------------------------------------------------------------
Total 5,752,072 72,940,034
. compress
variable mpg was int now byte
variable rep78 was int now byte
variable trunk was int now byte
variable turn was int now byte
variable make was str18 now str17
(370 bytes saved)
. memory
Memory usage
Used Allocated
----------------------------------------------------------------------
Data 2,812 67,108,864
strLs 0 0
----------------------------------------------------------------------
Data & strLs 2,812 67,108,864
----------------------------------------------------------------------
Data & strLs 2,812 67,108,864
Variable names, %fmts, ... 4,370 71,230
Overhead 1,081,344 1,082,136
Stata matrices 0 0
ado-files 22,919 22,919
Stored results 0 0
Mata matrices 1,904 1,904
Mata functions 1,632 1,632
set maxvar usage 4,636,521 4,636,521
Other 14,828 14,828
----------------------------------------------------------------------
Total 5,751,702 72,940,034
```
We see here a very modest saving (370 bytes, about 12%), but sometimes you can
see much more significant gains.
(When running Stata, instead of using `memory`, you can look at the "Size" entry
in the [properties pane](01-the-basics-of-stata.qmd#the-stata-environment).)
Don't be afraid of artificially restricting yourself going forward; if one of
your values exceeds the limitations its type supports, Stata will automatically
change types. So don't hesitate to run `compress` when loading new data or after
some manipulations.
## Exercise 2
1. "census9" is accesible via
[`webuse`](02-working-with-data-sets.qmd#stata-website-data). Load it.
2. Spend a minute looking at the data. What does this data seem to represent?
What variables do we have?
([`describe`](03-data-management.qmd#describing-the-data) will come in handy
here!)
3. Are there any missing states?
4. What variables (if any) are numeric and what variables (if any) are strings?
5. [Compress](03-data-management.qmd#compressing-data) the data. How much space
is saved? Why do you think this is?
## Labels
A raw data set is very sparse on context. In addition to the data itself, it
will have at most a variable name, which in Stata cannot include spaces and is
limited to 32 characters. All other context associated with the data must either
be documented in a data dictionary or exist in the recollection of the analyst.
In an Excel file, to get around this, you might add additional content to the
sheet outside of the raw data - a note here, a subtitle there, etc. However,
Stata does not allow such arbitrary storage. In contrast, Stata allows you to
directly **label** parts of the data with context information which will be
displayed in the appropriate Results, to make Stata output much easier to read
as well as removing the need for an external data dictionary.
### Labeling variables
Variables names, as mentioned, are limited to 32 characters and do not allow
spaces (or several other special characters). This is to encourage you to choose
short, simple, and memorable variable names, since you'll likely be typing them
a lot!
We can easily add more information with a variable label. If you look at the
`describe` output, you'll notice that the auto data set already has variable
labels applied to it.
```stata
. describe
Contains data from /Applications/Stata/ado/base/a/auto.dta
Observations: 74 1978 automobile data
Variables: 12 13 Apr 2022 17:45
(_dta has notes)
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
make str17 %-17s Make and model
price int %8.0gc Price
mpg byte %8.0g Mileage (mpg)
rep78 byte %8.0g Repair record 1978
headroom float %6.1f Headroom (in.)
trunk byte %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn byte %8.0g Turn circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear ratio
foreign byte %8.0g origin Car origin
-------------------------------------------------------------------------------
Sorted by: foreign
Note: Dataset has changed since last saved.
```
We can see variable `rep78` (an utterly incomprehensible name at first glance,
as opposed to `mpg`) has the label "Repair Record 1978". You can apply your own
variable labels (or overwrite existing by using the command:
```stata
label variable <variable name> "Variable label"
```
For example:
```stata
. label variable turn "Some new label for turn"
. describe turn
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
turn byte %8.0g Some new label for turn
```
To remove a variable label, you can call `label variable <varname>` without a
new label to remove the existing one. (Equivalent to `label variable <varname>
""`, so passing an empty variable label.)
```stata
. label variable turn
. describe turn
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
turn byte %8.0g
```
### Labeling values
It is tempting to store categorical variables as strings. If you ask, for
example, for someone's state of residence, you might store the answers as "MI",
"OH", "FL", etc. However, Stata (like most statistical software) cannot handle
string variables.^[In the few situations where it can, it doesn't handle them
cleanly.] A much better way to store this data would be to assign each state a
numerical value, say MI = 1, OH = 2, FL = 3, etc, then keep a data dictionary
linking the values to the labels they represent.
Stata allows you to store this value labels information within the data set,
such that whenever the values are output, the labels are printed instead. Let's
take a look at the `foreign` variable. This variable takes on two levels, and we
can tabulate it to see how many cars are in each category.
```stata
. tabulate foreign
Car origin | Freq. Percent Cum.
------------+-----------------------------------
Domestic | 52 70.27 70.27
Foreign | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
```
Here it appears that `foreign` is stored as two strings, but we know from
`describe` that it is not:
```stata
. describe foreign
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
foreign byte %8.0g origin Car origin
```
Additionally, if you look at the data through Data Editor or Data Browser, you
see that instead of `foreign` being red (as a string) it is blue, as we
discussed [earlier](02-working-with-data-sets.qmd#colors-as-variable-type).
Let's look at that table ignoring the value labels:
```stata
. tabulate foreign, nolabel
Car origin | Freq. Percent Cum.
------------+-----------------------------------
0 | 52 70.27 70.27
1 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
```
Now we see the values which are actually stored.
Look at the `describe` output:
```stata
. describe foreign
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
foreign byte %8.0g origin Car origin
```
You'll notice that the "value label" column has `origin` attached to `foreign`.
In Stata, the value labels information are *stored separately* from the
variables. They are two separate components - there is a variable and there is a
value label. You connect the value label to a variable to use it.
The benefit of this separated structure is that if you have several variables
which have the same coding scheme (e.g. a series of Likert scale questions,
where 1-5 represents "Strongly Disagree"-"Strongly Agree"), you can create a
single value label and rapidly apply it to all variables necessary.
Correspondingly, this process requires two commands. First we define the value
labels:
```stata
label define <label name> <value> "<label>" <value> "<label>" .....
```
For example, if we wanted to recreate the value label associated with `foreign`:
```stata
. label define foreign_label 0 "Domestic" 1 "Foreign"
```
Value labels exist in the data set, regardless of whether they are attached to
any variables, we can see all value labels:
```stata
. label list
foreign_label:
0 Domestic
1 Foreign
origin:
0 Domestic
1 Foreign
```
Here we see the original `origin` as well as our new `foreign_label`. To attach
it to the variable `foreign`:
```stata
. label values foreign foreign_label
```
If we wanted to attach a single value label to multiple variables, we could
simply provide a list of variables:
```stata
label values <var1> <var2> <var3> <label>
```
To remove the value labels from a variable, use the `label values <variable>`
command with no label name following the variable name:
```stata
. label values foreign
. describe foreign
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
foreign byte %8.0g Car origin
```
You can view all value labels in the data set:
```stata
. label list
foreign_label:
0 Domestic
1 Foreign
origin:
0 Domestic
1 Foreign
```
Note that value labels exist within a data set regardless of whether they are
attached to a variable. If there is a label value that you no longer want to
keep in the data-set, you can drop it:
```stata
. label drop foreign_label
. label list
origin:
0 Domestic
1 Foreign
```
This will *not* remove the value labels from any variables, but they will no
longer be active (i.e. if you run `describe` it will still show that the value
labels are attached, but running `tabulate` will not use them). So in order to
completely remove a value label, you'll need to both remove it from the variable
as well as the data.
**Do not forget** that modifying value labels counts as modifying the data. Make
sure you `save, replace` after making these modifications (if you want to keep
them) or they'll be lost!
## Managing variables
In Stata, managing the names and order of variables is important to make
entering commands easier due to the [shortcuts for referring to
variables](03-data-management.qmd#referring-to-variables). Recall that
variables can be referred to using wildcards (e.g. `a*` to include `age`,
`address` or `a10`, or using `var1-var10` to include all variables between
`var1` and `var10` as they are ordered). Of course, you may also want to rename
or re-order variables for other reasons such as making the data easier to look
at.
### Renaming variables
To rename variables:
```stata
rename <oldname> <newname>
```
For example:
```stata
. rename rep78 repair_record_1978
. describe, simple
make repair_~1978 weight displacement
price headroom length gear_ratio
mpg trunk turn foreign
```
The output truncated the name because it was so long.
Variable names are unique; if you wanted to swap to variable names, you'd have
to name one to a temporary name, rename the second, then rename the first again:
```stata
rename a tmp
rename b a
rename tmp b
```
You can use wildcards in the renaming too. For example, imagine you had a
collection of variables from a longitudinal data set, "visit1_age",
"visit1_weight", "visit1_height", etc. To simplify variable names, we'd prefer
to use "age1", "weight1", etc.
```stata
rename visit1_* *1
```
Finally, you can change a variable name to upper/lower/proper case easily by
passing `upper`/`lower`/`proper` as an argument and giving no new variable name.
```stata
. rename length, upper
. describe, simple
make repair_~1978 weight displacement
price headroom LENGTH gear_ratio
mpg trunk turn foreign
```
You can also do this for the whole data set with the special variable list
`_all`:
```stata
. rename _all, upper
. describe, simple
MAKE REPAIR_~1978 WEIGHT DISPLACEMENT
PRICE HEADROOM LENGTH GEAR_RATIO
MPG TRUNK TURN FOREIGN
. rename _all, lower
```
### Changing variable ordering
The `order` command takes a list of variables and moves them to the
front/end/before a certain variable/after a certain variable. The options
`first`, `last`, `before(<variable>)` and `after(<variable>)` control this.
```stata
. order foreign // The default is `first`
. describe, simple
foreign mpg trunk turn
make repair_~1978 weight displacement
price headroom length gear_ratio
. order weight, last
. describe, simple
foreign mpg trunk displacement
make repair_~1978 length gear_ratio
price headroom turn weight
. order mpg trunk, before(displacement)
. describe, simple
foreign repair_~1978 turn displacement
make headroom mpg gear_ratio
price length trunk weight
```
## Exercise 3
If you've changed to a different data set, load "census9" back up with `webuse`.
1. Going forward, we'll be using a version of "census9" with changes we're
making. [Save](02-working-with-data-sets.qmd#saving-data) a copy of the data
to somewhere convenient (such as your Desktop). Don't forget to give it a
name!
2. The `drate` variable is a bit vague - the name of the variable provides no
clue that "d" = "death", and the values (e.g. 75) are ambiguous.
1. [Rename](03-data-management.qmd#renaming-variables) `drate` to
`deathrate`.
2. The rate is actually per 10,000 individuals.
[Label](03-data-management.qmd#labeling-variables) `dearthrate` to
include this information.
3. The variable `region` has a value label associated with it ("cenreg"). It has
some inconsistent choices, namely "NE" and "N Cntrl". Fix this.
1. [Create a new value label](03-data-management.qmd#labeling-values) which
uses "Northeast" and "North Central" instead of "NE" and "N Cntrl".
2. Attach this new value label to `region`.
3. Remove the existing value label "cenreg".
4. Use `label list` and `tabulate` to confirm it worked.
4. Save the data, replacing the version you created in step 1.
## Summarizing the data
While these notes will not cover anything statistical, it can be handy from a
data management perspective to look at some summary statistics, mostly to
identify problematic variables. Among other things, we can try and detect
- Unexpectedly missing data
- Incorrectly coded data
- Errors/typos in input data
- Incorrect assumptions about variable values
There are numerous ways to look at summary statistics, from obtaining one-number
summaries to visualizations, but we will focus on two Stata commands,
`summarize` and `codebook`.
`summarize` produces basic summary statistics *for numeric variables*.
```stata
. summarize
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
foreign | 74 .2972973 .4601885 0 1
make | 0
price | 74 6165.257 2949.496 3291 15906
repair_~1978 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+---------------------------------------------------------
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
mpg | 74 21.2973 5.785503 12 41
trunk | 74 13.75676 4.277404 5 23
displacement | 74 197.2973 91.83722 79 425
-------------+---------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
weight | 74 3019.459 777.1936 1760 4840
```
The table reports the total number of non-missing values (`make` appears to be
entirely missing because it is non-numeric), the mean (the average value), the
standard deviation (a measure of how spread out the data is) and the minimum and
maximum non-missing^[As we discuss
[later](04-data-manipulation.qmd#conditional-variable-generation), in Stata,
missing values (represented by `.` in the data) are considered to be higher than
any other number (so 99999 < .).] values observed.
Here's some suggestions of how to look at these values.
- Make sure the amount of missing data is expected. If the number of
observations is lower than anticipated, is it an issue with the data
collection? Or did the import into Stata cause issues? 5 cars have no
`repair_record_1978`.
- The mean should be a reasonable number, somewhere in the rough middle of the
range of possible values for the variable. If you have age recorded for a
random selection of adults and the mean age is 18, something has gone wrong.
If the mean age is -18, something has gone tragically wrong!
- The standard deviation is hard to interpret precisely, but in a very rough
sense, 2/3rds of the values should lie within 1 standard deviation of the
mean. For example, consider `mpg`. The mean is ~21 and the standard deviation
is ~6, so roughly 2/3rds of the cars have `mpg` between 15 and 27. Does this
seems reasonable? If the standard deviation is very high compared to the mean
(e.g. if `mpg`'s standard deviation was 50) or close to 0, that could indicate
an issue.
- Are the max and min reasonable? If the minimum `LENGTH` was -9, that's
problematic. Maybe -9 is the code for missing values? If the range of `LENGTH`
is 14 to 2500, maybe the units differ? They measured in feet for some cars and
inches for others?
`summarize` can also take a list of variables, e.g.
```stata
. summarize t*
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
turn | 74 39.64865 4.399354 31 51
trunk | 74 13.75676 4.277404 5 23
```
For more detailed information, we can look at the codebook. `codebook` works
similarly to `describe` and `summarize` in the sense that without any additional
arguments, it displays information on every variable; you can pass it a list of
variables to only operate on those. Because `codebook`'s output is quite long,
we only demonstrate the restricted version. Before we do that, we reload the
"auto" data because we messed with it quite a bit earlier!
First, categorical data:
```stata
. sysuse auto, clear
(1978 automobile data)
. codebook rep78
-------------------------------------------------------------------------------
rep78 Repair record 1978
-------------------------------------------------------------------------------
Type: Numeric (int)
Range: [1,5] Units: 1
Unique values: 5 Missing .: 5/74
Tabulation: Freq. Value
2 1
8 2
30 3
18 4
11 5
5 .
```
We see that `rep78` takes on five unique values, as well as having some missing
values (`.`). If the unique values is more than expected, it's something to
investigate.
Next, continuous variables:
```stata
. codebook price
-------------------------------------------------------------------------------
price Price
-------------------------------------------------------------------------------
Type: Numeric (int)
Range: [3291,15906] Units: 1
Unique values: 74 Missing .: 0/74
Mean: 6165.26
Std. dev.: 2949.5
Percentiles: 10% 25% 50% 75% 90%
3895 4195 5006.5 6342 11385
```
We still see the number of unique values reported, but here every observation
has a unique value (74 unique values, 74 rows in the data). There is no missing
data. The percentiles should be checked to see if they're reasonable, if 90% of
the cars had a price under $100, something's not right.
Finally, string variables:
```stata
. codebook make
-------------------------------------------------------------------------------
make Make and model
-------------------------------------------------------------------------------
Type: String (str18), but longest is str17
Unique values: 74 Missing "": 0/74
Examples: "Cad. Deville"
"Dodge Magnum"
"Merc. XR-7"
"Pont. Catalina"
Warning: Variable has embedded blanks.
```
We get less information here, but still useful to check that the data is as
expected. There are no empty strings nor any repeated strings. The warning about
"embedded blanks" is spaces; it's telling us that there are spaces in some of
the cars (e.g. "Dodge Magnum"). The reason it is a warning is that
"Dodge_Magnum" and "Dodge__Magnum" read the same to us, but that extra space
means Stata recognizes these as two different variables.
Two options for `codebook` which come in handy:
```stata
. codebook, compact
Variable Obs Unique Mean Min Max Label
-------------------------------------------------------------------------------
make 74 74 . . . Make and model
price 74 74 6165.257 3291 15906 Price
mpg 74 21 21.2973 12 41 Mileage (mpg)
rep78 69 5 3.405797 1 5 Repair record 1978
headroom 74 8 2.993243 1.5 5 Headroom (in.)
trunk 74 18 13.75676 5 23 Trunk space (cu. ft.)
weight 74 64 3019.459 1760 4840 Weight (lbs.)
length 74 47 187.9324 142 233 Length (in.)
turn 74 18 39.64865 31 51 Turn circle (ft.)
displacement 74 31 197.2973 79 425 Displacement (cu. in.)
gear_ratio 74 36 3.014865 2.19 3.89 Gear ratio
foreign 74 2 .2972973 0 1 Car origin
-------------------------------------------------------------------------------
```
`compact` shows a reduced size version, most useful for the "Unique" column.
(Useful if that's the only thing you're running the codebook for.)
```stata
. codebook, problems
Potential problems in dataset /Applications/Stata/ado/base/a/auto.dta
Potential problem Variables
--------------------------------------------------
str# vars that may be compressed make
string vars with embedded blanks make
--------------------------------------------------
```
This reports potential issues Stata has discovered in the data. In this data,
neither are really concerns. (We can run `compress`, but this isn't a "problem"
so much as a suggestion. We already saw above the concern about "embedded
blanks.") More serious problems that it can detect include:
- Constant columns (all entries being the same value, including all missing).
- Issues with value labels (if you've attached a value label to a variable and
subsequently deleted the value label without detaching it; or if your variable
takes on values unaccounted for in the value label).
- Issues with date variables.
## Exercise 4
Using [`summarize` and `codebook`](03-data-management.qmd#summarizing-the-data)
to explore the "census9" data (use the version you saved
[earlier](03-data-management.qmd#exercise-3)!) and answer the following
questions:
1. Are there any values which seem to be errors?
2. I'd expect each state to have their own unique value of death rate,
population and median age. Is this true? If not, why?
2. Are there any problems with the data?