03-data-management.qmd

# Data Management

Throughout this chapter, we'll be using the "auto" data set which is distributed
with Stata. It can be loaded via

```stata
. sysuse auto
(1978 automobile data)

```

You can reload it as necessary (if you modify it and want the original) by
re-running this with the `clear` option. Feel free to make frequent use of
[`preserve` and
`restore`](02-working-with-data-sets.qmd#temporarily-preserving-and-restoring-data).

Whenever you first begin working with a data set, you'll want to spend some time
getting familiar with it. You should be able to answer the following questions
(even if some of them are approximations):

- How many observations does your data have?
- How many variables are in your data set?
- What is the unit of analysis? (What does each row represent - a person? a
  couple? a classroom?)
    - Is there a variable which uniquely identifies each unit of analysis?
    - If the data is repeated measures in some form (multiple rows per person,
      or data is students across several classrooms), what variable(s) identify
      the levels and groups?
- Are there any variables which are strings (non-numeric) that you plan on using
  in some statistical analysis (and will need to be converted to numeric)?
- Which variables are continuous (can take on any value in some reasonable
  range, such as weight) vs which are categorical (take on a set number of
  values where each represents something).

You'll notice that there are no statistical questions here - we're not even
worried about those yet! These are merely logistical.

In this chapter we'll go over various tools and commands you can use to answer
these questions (and more) and overall to gain a familiarity with the logistics
of your data.

## Referring to variables

When we discussed [basic command
syntax](01-the-basics-of-stata.qmd#basic-command-syntax), we said that the
optional list of variables can include any number of variables (for some
commands). Writing out all the variables can get very tedious as the number of
variables increases. Thankfully there are two alternatives.

First, we can use the wild card `*`^[This is the reason why `*` as a comment
does not work in the middle of a line (and we use `//` instead).]. For example,
we could refer to the variables "turn" and "trunk" as `t*`, as both variables
start with "t" and are followed by anything. However, be careful, as this would
also match variables such as `turnips`, `tprice`, `t`, etc, if any such
variables existed. It can also be used in the middle or beginning, e.g.:

- `c*t` would match `cat`, `caught` and `ct`
- `*2` would match `age2`, `gender2` and `salary2012`.

Multiple `*` can also be used, e.g.:

- `*age*` would match `birthage`, `age_last` or `latest_age_given`. (It would
  also match `wages`, so be careful with this!)

Alternatively, if the variables we want to include are next to each other in the
data (e.g. in the Variables pane), we can refer to a list of them. Say the
variables `x1` through `x25` are in ordered in the logical fashion. We could
refer to the whole list of them as `x1-x25`. This includes both `x1` and `x25`,
as well as any variable in between them. We will discuss the
[`order`](03-data-management.qmd#changing-variable-ordering) command later to
re-order variables.

Finally, you often don't need to give the entire name of the variable, just
enough characters for Stata to be able to uniquely identify it (similar to
[short names](01-the-basics-of-stata.qmd#short-commands)). We'll see in a
minute more about the `describe` command, but for example,

```stata
. describe headr

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
headroom        float   %6.1f                 Headroom (in.)

```

Stata will error if you don't use enough characters:

```stata
. describe t
t ambiguous abbreviation
r(111);

```

Be very careful with this approach. I only recommend it's use when exploring the
data using the Command window; when writing a Do-file, use the full variable
name to prevent errors!

## Describing the data

The first command you should run is `describe`.

```stata
. describe

Contains data from /Applications/Stata/ado/base/a/auto.dta
 Observations:            74                  1978 automobile data
    Variables:            12                  13 Apr 2022 17:45
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
make            str18   %-18s                 Make and model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear ratio
foreign         byte    %8.0g      origin     Car origin
-------------------------------------------------------------------------------
Sorted by: foreign

```

This displays a large amount of information, so let's break in down.

First, the header displays general data set information - the number of
observations (`obs`, the number of rows) and variables (`vars`).

Next, there is a table listing each variable in the data and some information
about them. The "storage type" can be one of `byte`, `int`, `long`, `float`,
`double`; all of which are simply numbers. We'll touch on the differences
between these when we discuss
[`compress`](03-data-management.qmd#compressing-data), but for now they all
represent numeric variables. String variables are represented as `str##` where
the `##` represent the number of characters that the string can be, e.g. `str18`
shows that `make` has a maximum of 18 letters. (This limit is irrelevant, again,
see [`compress`](03-data-management.qmd#compressing-data) for details.)

The "display format" column contains format information about each variable
which only control how the variables are displayed in data view. For the most
part you can ignore these and leave them at the default, though you may need to
work with this if you have date or time information. For further details see

```stata
help formats
```

"value label" and "variable label" are used to display more information when
running analyses using these variables. See the
[label](03-data-management.qmd#labels) section for further details.

Finally, if the data is sorted, `describe` will provide information about the
sorting.

`describe` can also be used on a per variable basis:

```stata
. describe mpg

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
mpg             int     %8.0g                 Mileage (mpg)

. describe trunk displacement

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
trunk           int     %8.0g                 Trunk space (cu. ft.)
displacement    int     %8.0g                 Displacement (cu. in.)

```

If you have a very large number of variables you may wish to suppress the table
of variables entirely:

```stata
. describe, short

Contains data from /Applications/Stata/ado/base/a/auto.dta
 Observations:            74                  1978 automobile data
    Variables:            12                  13 Apr 2022 17:45
Sorted by: foreign

```

Alternatively, the `simple` option returns only the names of the variables, in
column-dominant order (meaning you read down the columns not across the rows).

```stata
. describe, simple
make          rep78         weight        displacement
price         headroom      length        gear_ratio
mpg           trunk         turn          foreign

```


## Compressing data

As mentioned [above](03-data-management.qmd#describing the data), there are
different ways to store a number variable, such as `byte` and `long`. The
various options take more space to save - types which take less space can store
only smaller numbers whereas types that take more space can store larger
numbers. For example, a number stored as a byte can only take on values between
-127 and 100 and only integers (e.g. not 2.5) whereas a number stored as float
can store numbers up to $1.7x10^{38}$ with up to 38 decimal places. Strings
operate similarly; a string variable with 20 characters would store "abc" as 17
blank characters followed by the "abc".

Understanding the above is not that important these days as computer power and
storage has increased to the point where the majority of us will not be reaching
its limits. However, Stata does offer the `compress` command which attempts to
store variables in the smallest possible type. For example, if a variable which
is a float takes on only values 1 through 10, it is replaced by a byte (and
similarly, strings are as long as the longest value).

The `memory` command lets us see the size of our data, particularlly the first
entry of "Data" under "Used" shows that we start with 3,182 bytes or roughly
3Kb.

```stata
. memory

Memory usage
                                         Used                Allocated
----------------------------------------------------------------------
Data                                    3,182               67,108,864
strLs                                       0                        0
----------------------------------------------------------------------
Data & strLs                            3,182               67,108,864

----------------------------------------------------------------------
Data & strLs                            3,182               67,108,864
Variable names, %fmts, ...              4,370                   71,230
Overhead                            1,081,344                1,082,136

Stata matrices                              0                        0
ado-files                              22,919                   22,919
Stored results                              0                        0

Mata matrices                           1,904                    1,904
Mata functions                          1,632                    1,632

set maxvar usage                    4,636,521                4,636,521

Other                                  14,828                   14,828
----------------------------------------------------------------------
Total                               5,752,072               72,940,034

. compress
  variable mpg was int now byte
  variable rep78 was int now byte
  variable trunk was int now byte
  variable turn was int now byte
  variable make was str18 now str17
  (370 bytes saved)

. memory

Memory usage
                                         Used                Allocated
----------------------------------------------------------------------
Data                                    2,812               67,108,864
strLs                                       0                        0
----------------------------------------------------------------------
Data & strLs                            2,812               67,108,864

----------------------------------------------------------------------
Data & strLs                            2,812               67,108,864
Variable names, %fmts, ...              4,370                   71,230
Overhead                            1,081,344                1,082,136

Stata matrices                              0                        0
ado-files                              22,919                   22,919
Stored results                              0                        0

Mata matrices                           1,904                    1,904
Mata functions                          1,632                    1,632

set maxvar usage                    4,636,521                4,636,521

Other                                  14,828                   14,828
----------------------------------------------------------------------
Total                               5,751,702               72,940,034

```

We see here a very modest saving (370 bytes, about 12%), but sometimes you can
see much more significant gains.

(When running Stata, instead of using `memory`, you can look at the "Size" entry
in the [properties pane](01-the-basics-of-stata.qmd#the-stata-environment).)

Don't be afraid of artificially restricting yourself going forward; if one of
your values exceeds the limitations its type supports, Stata will automatically
change types. So don't hesitate to run `compress` when loading new data or after
some manipulations.

## Exercise 2

1. "census9" is accesible via
   [`webuse`](02-working-with-data-sets.qmd#stata-website-data). Load it.
2. Spend a minute looking at the data. What does this data seem to represent?
   What variables do we have?
   ([`describe`](03-data-management.qmd#describing-the-data) will come in handy
   here!)
3. Are there any missing states?
4. What variables (if any) are numeric and what variables (if any) are strings?
5. [Compress](03-data-management.qmd#compressing-data) the data. How much space
   is saved? Why do you think this is?


## Labels

A raw data set is very sparse on context. In addition to the data itself, it
will have at most a variable name, which in Stata cannot include spaces and is
limited to 32 characters. All other context associated with the data must either
be documented in a data dictionary or exist in the recollection of the analyst.

In an Excel file, to get around this, you might add additional content to the
sheet outside of the raw data - a note here, a subtitle there, etc. However,
Stata does not allow such arbitrary storage. In contrast, Stata allows you to
directly **label** parts of the data with context information which will be
displayed in the appropriate Results, to make Stata output much easier to read
as well as removing the need for an external data dictionary.

### Labeling variables

Variables names, as mentioned, are limited to 32 characters and do not allow
spaces (or several other special characters). This is to encourage you to choose
short, simple, and memorable variable names, since you'll likely be typing them
a lot!

We can easily add more information with a variable label. If you look at the
`describe` output, you'll notice that the auto data set already has variable
labels applied to it.

```stata
. describe

Contains data from /Applications/Stata/ado/base/a/auto.dta
 Observations:            74                  1978 automobile data
    Variables:            12                  13 Apr 2022 17:45
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
make            str17   %-17s                 Make and model
price           int     %8.0gc                Price
mpg             byte    %8.0g                 Mileage (mpg)
rep78           byte    %8.0g                 Repair record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           byte    %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            byte    %8.0g                 Turn circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear ratio
foreign         byte    %8.0g      origin     Car origin
-------------------------------------------------------------------------------
Sorted by: foreign
     Note: Dataset has changed since last saved.

```

We can see variable `rep78` (an utterly incomprehensible name at first glance,
as opposed to `mpg`) has the label "Repair Record 1978". You can apply your own
variable labels (or overwrite existing by using the command:

```stata
label variable <variable name> "Variable label"
```

For example:

```stata
. label variable turn "Some new label for turn"

. describe turn

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
turn            byte    %8.0g                 Some new label for turn

```

To remove a variable label, you can call `label variable <varname>` without a
new label to remove the existing one. (Equivalent to `label variable <varname>
""`, so passing an empty variable label.)

```stata
. label variable turn

. describe turn

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
turn            byte    %8.0g                 

```

### Labeling values

It is tempting to store categorical variables as strings. If you ask, for
example, for someone's state of residence, you might store the answers as "MI",
"OH", "FL", etc. However, Stata (like most statistical software) cannot handle
string variables.^[In the few situations where it can, it doesn't handle them
cleanly.] A much better way to store this data would be to assign each state a
numerical value, say MI = 1, OH = 2, FL = 3, etc, then keep a data dictionary
linking the values to the labels they represent.

Stata allows you to store this value labels information within the data set,
such that whenever the values are output, the labels are printed instead. Let's
take a look at the `foreign` variable. This variable takes on two levels, and we
can tabulate it to see how many cars are in each category.

```stata
. tabulate foreign

 Car origin |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

```

Here it appears that `foreign` is stored as two strings, but we know from
`describe` that it is not:

```stata
. describe foreign

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
foreign         byte    %8.0g      origin     Car origin

```

Additionally, if you look at the data through Data Editor or Data Browser, you
see that instead of `foreign` being red (as a string) it is blue, as we
discussed [earlier](02-working-with-data-sets.qmd#colors-as-variable-type).

Let's look at that table ignoring the value labels:

```stata
. tabulate foreign, nolabel

 Car origin |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         52       70.27       70.27
          1 |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

```

Now we see the values which are actually stored.

Look at the `describe` output:

```stata
. describe foreign

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
foreign         byte    %8.0g      origin     Car origin

```

You'll notice that the "value label" column has `origin` attached to `foreign`.
In Stata, the value labels information are *stored separately* from the
variables. They are two separate components - there is a variable and there is a
value label. You connect the value label to a variable to use it.

The benefit of this separated structure is that if you have several variables
which have the same coding scheme (e.g. a series of Likert scale questions,
where 1-5 represents "Strongly Disagree"-"Strongly Agree"), you can create a
single value label and rapidly apply it to all variables necessary.

Correspondingly, this process requires two commands. First we define the value
labels:

```stata
label define <label name> <value> "<label>" <value> "<label>" .....
```

For example, if we wanted to recreate the value label associated with `foreign`:

```stata
. label define foreign_label 0 "Domestic" 1 "Foreign"

```

Value labels exist in the data set, regardless of whether they are attached to
any variables, we can see all value labels:

```stata
. label list
foreign_label:
           0 Domestic
           1 Foreign
origin:
           0 Domestic
           1 Foreign

```

Here we see the original `origin` as well as our new `foreign_label`. To attach
it to the variable `foreign`:

```stata
. label values foreign foreign_label

```

If we wanted to attach a single value label to multiple variables, we could
simply provide a list of variables:

```stata
label values <var1> <var2> <var3> <label>
```

To remove the value labels from a variable, use the `label values <variable>`
command with no label name following the variable name:

```stata
. label values foreign

. describe foreign

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
foreign         byte    %8.0g                 Car origin

```

You can view all value labels in the data set:

```stata
. label list
foreign_label:
           0 Domestic
           1 Foreign
origin:
           0 Domestic
           1 Foreign

```

Note that value labels exist within a data set regardless of whether they are
attached to a variable. If there is a label value that you no longer want to
keep in the data-set, you can drop it:

```stata
. label drop foreign_label

. label list
origin:
           0 Domestic
           1 Foreign

```

This will *not* remove the value labels from any variables, but they will no
longer be active (i.e. if you run `describe` it will still show that the value
labels are attached, but running `tabulate` will not use them). So in order to
completely remove a value label, you'll need to both remove it from the variable
as well as the data.

**Do not forget** that modifying value labels counts as modifying the data. Make
sure you `save, replace` after making these modifications (if you want to keep
them) or they'll be lost!

## Managing variables

In Stata, managing the names and order of variables is important to make
entering commands easier due to the [shortcuts for referring to
variables](03-data-management.qmd#referring-to-variables). Recall that
variables can be referred to using wildcards (e.g. `a*` to include `age`,
`address` or `a10`, or using `var1-var10` to include all variables between
`var1` and `var10` as they are ordered). Of course, you may also want to rename
or re-order variables for other reasons such as making the data easier to look
at.

### Renaming variables

To rename variables:

```stata
rename <oldname> <newname>
```

For example:

```stata
. rename rep78 repair_record_1978

. describe, simple
make          repair_~1978  weight        displacement
price         headroom      length        gear_ratio
mpg           trunk         turn          foreign

```

The output truncated the name because it was so long.

Variable names are unique; if you wanted to swap to variable names, you'd have
to name one to a temporary name, rename the second, then rename the first again:

```stata
rename a tmp
rename b a
rename tmp b
```

You can use wildcards in the renaming too. For example, imagine you had a
collection of variables from a longitudinal data set, "visit1_age",
"visit1_weight", "visit1_height", etc. To simplify variable names, we'd prefer
to use "age1", "weight1", etc.

```stata
rename visit1_* *1
```

Finally, you can change a variable name to upper/lower/proper case easily by
passing `upper`/`lower`/`proper` as an argument and giving no new variable name.

```stata
. rename length, upper

. describe, simple
make          repair_~1978  weight        displacement
price         headroom      LENGTH        gear_ratio
mpg           trunk         turn          foreign

```

You can also do this for the whole data set with the special variable list
`_all`:

```stata
. rename _all, upper

. describe, simple
MAKE          REPAIR_~1978  WEIGHT        DISPLACEMENT
PRICE         HEADROOM      LENGTH        GEAR_RATIO
MPG           TRUNK         TURN          FOREIGN

. rename _all, lower

```

###  Changing variable ordering

The `order` command takes a list of variables and moves them to the
front/end/before a certain variable/after a certain variable. The options
`first`, `last`, `before(<variable>)` and `after(<variable>)` control this.

```stata
. order foreign // The default is `first`

. describe, simple
foreign       mpg           trunk         turn
make          repair_~1978  weight        displacement
price         headroom      length        gear_ratio

. order weight, last

. describe, simple
foreign       mpg           trunk         displacement
make          repair_~1978  length        gear_ratio
price         headroom      turn          weight

. order mpg trunk, before(displacement)

. describe, simple
foreign       repair_~1978  turn          displacement
make          headroom      mpg           gear_ratio
price         length        trunk         weight

```

## Exercise 3

If you've changed to a different data set, load "census9" back up with `webuse`.

1. Going forward, we'll be using a version of "census9" with changes we're
   making. [Save](02-working-with-data-sets.qmd#saving-data) a copy of the data
   to somewhere convenient (such as your Desktop). Don't forget to give it a
   name!
2. The `drate` variable is a bit vague - the name of the variable provides no
   clue that "d" = "death", and the values (e.g. 75) are ambiguous.
    1. [Rename](03-data-management.qmd#renaming-variables) `drate` to
       `deathrate`.
    2. The rate is actually per 10,000 individuals.
       [Label](03-data-management.qmd#labeling-variables) `dearthrate` to
       include this information.
3. The variable `region` has a value label associated with it ("cenreg"). It has
   some inconsistent choices, namely "NE" and "N Cntrl". Fix this.
    1. [Create a new value label](03-data-management.qmd#labeling-values) which
       uses "Northeast" and "North Central" instead of "NE" and "N Cntrl".
    2. Attach this new value label to `region`.
    3. Remove the existing value label "cenreg".
    4. Use `label list` and `tabulate` to confirm it worked.
4. Save the data, replacing the version you created in step 1.

## Summarizing the data

While these notes will not cover anything statistical, it can be handy from a
data management perspective to look at some summary statistics, mostly to
identify problematic variables. Among other things, we can try and detect

- Unexpectedly missing data
- Incorrectly coded data
- Errors/typos in input data
- Incorrect assumptions about variable values

There are numerous ways to look at summary statistics, from obtaining one-number
summaries to visualizations, but we will focus on two Stata commands,
`summarize` and `codebook`.

`summarize` produces basic summary statistics *for numeric variables*.

```stata
. summarize

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     foreign |         74    .2972973    .4601885          0          1
        make |          0
       price |         74    6165.257    2949.496       3291      15906
repair_~1978 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5
-------------+---------------------------------------------------------
      length |         74    187.9324    22.26634        142        233
        turn |         74    39.64865    4.399354         31         51
         mpg |         74     21.2973    5.785503         12         41
       trunk |         74    13.75676    4.277404          5         23
displacement |         74    197.2973    91.83722         79        425
-------------+---------------------------------------------------------
  gear_ratio |         74    3.014865    .4562871       2.19       3.89
      weight |         74    3019.459    777.1936       1760       4840

```

The table reports the total number of non-missing values (`make` appears to be
entirely missing because it is non-numeric), the mean (the average value), the
standard deviation (a measure of how spread out the data is) and the minimum and
maximum non-missing^[As we discuss
[later](04-data-manipulation.qmd#conditional-variable-generation), in Stata,
missing values (represented by `.` in the data) are considered to be higher than
any other number (so 99999 < .).] values observed.

Here's some suggestions of how to look at these values.

- Make sure the amount of missing data is expected. If the number of
  observations is lower than anticipated, is it an issue with the data
  collection? Or did the import into Stata cause issues? 5 cars have no
  `repair_record_1978`.
- The mean should be a reasonable number, somewhere in the rough middle of the
  range of possible values for the variable. If you have age recorded for a
  random selection of adults and the mean age is 18, something has gone wrong.
  If the mean age is -18, something has gone tragically wrong!
- The standard deviation is hard to interpret precisely, but in a very rough
  sense, 2/3rds of the values should lie within 1 standard deviation of the
  mean. For example, consider `mpg`. The mean is ~21 and the standard deviation
  is ~6, so roughly 2/3rds of the cars have `mpg` between 15 and 27. Does this
  seems reasonable? If the standard deviation is very high compared to the mean
  (e.g. if `mpg`'s standard deviation was 50) or close to 0, that could indicate
  an issue.
- Are the max and min reasonable? If the minimum `LENGTH` was -9, that's
  problematic. Maybe -9 is the code for missing values? If the range of `LENGTH`
  is 14 to 2500, maybe the units differ? They measured in feet for some cars and
  inches for others?

`summarize` can also take a list of variables, e.g.

```stata
. summarize t*

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        turn |         74    39.64865    4.399354         31         51
       trunk |         74    13.75676    4.277404          5         23

```

For more detailed information, we can look at the codebook. `codebook` works
similarly to `describe` and `summarize` in the sense that without any additional
arguments, it displays information on every variable; you can pass it a list of
variables to only operate on those. Because `codebook`'s output is quite long,
we only demonstrate the restricted version. Before we do that, we reload the
"auto" data because we messed with it quite a bit earlier!

First, categorical data:

```stata
. sysuse auto, clear
(1978 automobile data)

. codebook rep78

-------------------------------------------------------------------------------
rep78                                                        Repair record 1978
-------------------------------------------------------------------------------

                  Type: Numeric (int)

                 Range: [1,5]                         Units: 1
         Unique values: 5                         Missing .: 5/74

            Tabulation: Freq.  Value
                            2  1
                            8  2
                           30  3
                           18  4
                           11  5
                            5  .

```

We see that `rep78` takes on five unique values, as well as having some missing
values (`.`). If the unique values is more than expected, it's something to
investigate.

Next, continuous variables:

```stata
. codebook price

-------------------------------------------------------------------------------
price                                                                     Price
-------------------------------------------------------------------------------

                  Type: Numeric (int)

                 Range: [3291,15906]                  Units: 1
         Unique values: 74                        Missing .: 0/74

                  Mean: 6165.26
             Std. dev.:  2949.5

           Percentiles:     10%       25%       50%       75%       90%
                           3895      4195    5006.5      6342     11385

```

We still see the number of unique values reported, but here every observation
has a unique value (74 unique values, 74 rows in the data). There is no missing
data. The percentiles should be checked to see if they're reasonable, if 90% of
the cars had a price under $100, something's not right.

Finally, string variables:

```stata
. codebook make

-------------------------------------------------------------------------------
make                                                             Make and model
-------------------------------------------------------------------------------

                  Type: String (str18), but longest is str17

         Unique values: 74                        Missing "": 0/74

              Examples: "Cad. Deville"
                        "Dodge Magnum"
                        "Merc. XR-7"
                        "Pont. Catalina"

               Warning: Variable has embedded blanks.

```

We get less information here, but still useful to check that the data is as
expected. There are no empty strings nor any repeated strings. The warning about
"embedded blanks" is spaces; it's telling us that there are spaces in some of
the cars (e.g. "Dodge Magnum"). The reason it is a warning is that
"Dodge_Magnum" and "Dodge__Magnum" read the same to us, but that extra space
means Stata recognizes these as two different variables.

Two options for `codebook` which come in handy:

```stata
. codebook, compact

Variable      Obs Unique      Mean   Min    Max  Label
-------------------------------------------------------------------------------
make           74     74         .     .      .  Make and model
price          74     74  6165.257  3291  15906  Price
mpg            74     21   21.2973    12     41  Mileage (mpg)
rep78          69      5  3.405797     1      5  Repair record 1978
headroom       74      8  2.993243   1.5      5  Headroom (in.)
trunk          74     18  13.75676     5     23  Trunk space (cu. ft.)
weight         74     64  3019.459  1760   4840  Weight (lbs.)
length         74     47  187.9324   142    233  Length (in.)
turn           74     18  39.64865    31     51  Turn circle (ft.)
displacement   74     31  197.2973    79    425  Displacement (cu. in.)
gear_ratio     74     36  3.014865  2.19   3.89  Gear ratio
foreign        74      2  .2972973     0      1  Car origin
-------------------------------------------------------------------------------

```

`compact` shows a reduced size version, most useful for the "Unique" column.
(Useful if that's the only thing you're running the codebook for.)

```stata
. codebook, problems

Potential problems in dataset /Applications/Stata/ado/base/a/auto.dta

               Potential problem   Variables
--------------------------------------------------
str# vars that may be compressed   make
string vars with embedded blanks   make
--------------------------------------------------

```

This reports potential issues Stata has discovered in the data. In this data,
neither are really concerns. (We can run `compress`, but this isn't a "problem"
so much as a suggestion. We already saw above the concern about "embedded
blanks.") More serious problems that it can detect include:

- Constant columns (all entries being the same value, including all missing).
- Issues with value labels (if you've attached a value label to a variable and
  subsequently deleted the value label without detaching it; or if your variable
  takes on values unaccounted for in the value label).
- Issues with date variables.

## Exercise 4

Using [`summarize` and `codebook`](03-data-management.qmd#summarizing-the-data)
to explore the "census9" data (use the version you saved
[earlier](03-data-management.qmd#exercise-3)!) and answer the following
questions:

1. Are there any values which seem to be errors?
2. I'd expect each state to have their own unique value of death rate,
   population and median age. Is this true? If not, why?
2. Are there any problems with the data?