-
Notifications
You must be signed in to change notification settings - Fork 0
/
krippendorff.tex
216 lines (190 loc) · 7.74 KB
/
krippendorff.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
\documentclass{article}
\usepackage{amsmath}
\title{What is Krippendorff's Alpha?}
\author{David Honour}
\begin{document}
\maketitle
Krippendorff's alpha is non-parametric a reliability measure applicable to
coding systems that yield one of a finite discreet set of valid responses.
It is often used to investigate the inter rater reliability of scoring systems.
In such an investigation you first need to get a number of samples of the thing
you want to score. Then you need a number of observers to look at each of these
samples so you can compare their agreement. The results of observers for a
given sample is called a unit.
In my research I found articles extolling the virtues of Krippendorff's
alpha\cite{hayes2007answering}, and those explaining how to compute
it\cite{krippendorff2007computing}.
However I found something of a deficit of explanations of what it actually is.
\section*{What is it?}
Krippendorff's alpha looks at the pairs of responses given, and can be seen to
be comparing the disagreement of the pairs observed within each unit to the
disagreement expected by chance based on the responses given to all units.
Define $R$ as the set of responses that an observation may yield.
The responses in a given unit form a multiset $u$.
Units with a single response cannot be compared for reliability, as there is
nothing to compare to, and should be discarded.
All the responses form a multiset whose items are these unit multisets, we
define this multiset of all responses as $U$.
A way of quantifying the disagreement between ratings is required, and in order
to support different classes of data (e.g. ordinal, nominal or interval) we
require a function describing this which we denote $\delta(c, k)$.
This represents the distance between responses in some sense:
\begin{align}
\delta(c, c) &= 0 \label{delta_0} \\
\delta(c, k) &\ge 0 \\
\delta(c, k) &= \delta(k, c)
\end{align}
\newcommand{\sumpairs}{\sum_{c \in R} \sum_{k \in R}}
Now that the problem is expressed in a mathematical form we can define the
average disagreement within a multiset:
\begin{equation}
D(m) = \sumpairs \delta(c, k) \frac{W(m, c, k)}{P(|m|, 2)}
\end{equation}
where $|m|$ is the cardinality (number of elements) of $m$, $P$ is the number
of permutations and the number of ways to make a pair containing $c$ and $k$
is:
\begin{equation}
W(m, c, k) =
\begin{cases}
c \ne k & \nu(m, c) \nu(m, k) \\
c = k & \nu(m, c) (\nu(m, c) - 1)
\end{cases}
\end{equation}
where $\nu(m, c)$ is the multiplicity (number of occurrences) of $c$ in $m$.
The average disagreement can be applied to each unit in turn and averaged
across all units (weighted by the number of responses) to yield a quantity
known as the observed disagreement:
\begin{equation}
D_o = \sum_{u \in U} \frac{|u|}{|V|}D(u)
\end{equation}
where $V$ is a combined multiset containing the responses of all units:
\begin{equation}
V = \sum_{u \in U}u
\end{equation}
the sum is performed with $\uplus$ (the multiset sum).
The average disagreement of all possible response pairs is known as the
expected disagreement and is given by:
\begin{equation}
D_e = D(V)
\end{equation}
Krippendorff's alpha is defined as:
\begin{equation}
\alpha = 1 - \frac{D_o}{D_e}
\end{equation}
\section*{An example}
3 observers looked at 3 pictures of dresses, and asked if it was blue.
Not all observers were asked about all pictures.
In this example the set of possible responses $R = \{\rm{y}, \rm{n}\}$.
We can choose a function for $\delta$ that tells us that they're different:
\begin{equation}
\delta(c, k) = \begin{cases} c = k & 0 \\ c \ne k & 1 \end{cases}
\end{equation}
Collect the results of the observations to yield a multiset of unit multisets:
\begin{equation}
\{ \{\rm{y}, \rm{n}, \rm{n}\}, \{\rm{y}, \rm{n}\}, \{\rm{n}\} \}
\end{equation}
Removing any units with a single response (as they yield no pairs and thus do
not contribute to the comparison):
\begin{equation}
U = \{ \{\rm{y}, \rm{n}, \rm{n}\}, \{\rm{y}, \rm{n}\} \}
\end{equation}
From this we can determine:
\begin{equation}
V = \{ \rm{y}, \rm{n}, \rm{n}, \rm{y}, \rm{n}\}
\end{equation}
Now that we have these we can calculate the expected disagreement:
\begin{equation}
D_e = D(\{ \rm{y}, \rm{n}, \rm{n}, \rm{y}, \rm{n} \}) \\
\end{equation}
the summations run over all possible pairs of values of $R$, which in this case
are (y, y), (y, n), (n, y) and (n, n).
Consider the term for the pair (y, y):
\newcommand{\dforpair}[2]{\delta(\rm{#1}, \rm{#2}) \frac{W(V, \rm{#1}, \rm{#2})}{P(|V|, 2)})}
\begin{equation}
\dforpair{y}{y} = 0
\end{equation}
since $\delta(\rm{y}, \rm{y}) = 0$. It can in fact be seen from equation
\ref{delta_0} that the contributions for pairs where $c = k$ are always 0 for
any valid choice of $\delta$. Thus we know that the contribution from (n, n) is
also 0.
Consider the term corresponding to the pair (y, n):
\begin{align}
\dforpair{y}{n} &= 1\frac{\nu(V, \rm{y}) \nu(V, \rm{n})}{P(5, 2)} \nonumber \\
&= 1\frac{2 \times 3}{20} \nonumber \\
&= \frac{3}{10}
\end{align}
note that we did not use the order of the pair and thus the contribution due to
(n, y) is also $\frac{3}{10}$.
Using these results we determine that in this example:
\begin{equation}
D_e = 0 + \frac{3}{10} + \frac{3}{10} + 0 = \frac{3}{5} = 0.6
\end{equation}
Calculating the observed disagreement:
\begin{align}
D_o &= \left( \frac{3}{5}D(\{ \rm{y}, \rm{n}, \rm{n} \}) + \frac{2}{5}D(\{ \rm{y}, \rm{n} \}) \right) \\
&= \left( \frac{3}{5} \left( 2\frac{1 \times 2}{6} \right) + \frac{2}{5}\left( 2\frac{1 \times 1}{2} \right) \right) \\
&= \frac{4}{5} = 0.8
\end{align}
Using these results we can calculate:
\begin{equation}
\alpha = 1 - \frac{0.8}{0.6} = - \frac{1}{3}
\end{equation}
this tells us that the agreement is worse than would be expected by chance.
\section*{Equivalence to Krippendorff's form}
The forms given by Krippendorff\cite{krippendorff2007computing} do not match
those I have presented here, however they can be seen to be equivalent.
Let's begin with:
\begin{equation}
D(m) = \sumpairs \delta(c, k) \frac{W(m, c, k)}{P(|m|, 2)}
\end{equation}
Taking the formula for the number of permutations of length $r$ that may be
made from a number of items $n$:
\begin{equation}
P(n, r) = \frac{n!}{(n - r)!}
\end{equation}
setting $r = 2$ and simplifying yields:
\begin{align}
P(n, 2) &= \frac{n!}{(n - 2)!} \\
&= \frac{n(n-1)(n-2)!}{(n-2)!} \\
&= n(n-1)
\end{align}
If we note that the terms where $c = k$ are multiplied by 0 and thus do not
contribute (equation \ref{delta_0}) we can also substitute the $c \ne k$ branch
of $W$:
\begin{equation}
D(m) = \sumpairs \delta(c, k) \frac{\nu(m, c) \nu(m, k)}{|m|(|m| - 1)}
\end{equation}
Thus the form of $D_e$ has been recovered:
\begin{equation}
D_e = \frac{1}{|V|(|V| - 1)} \sumpairs \nu(V, c) \nu(V, k) \delta(c, k)
\end{equation}
note that my notation and Krippendorff's differ such that:
\begin{align}
n &= |V| \\
n_c &= \nu(V, c) \\
n_k &= \nu(V, k) \\
{}_{\rm{metric}}\delta^2_{ck} &= \delta(c, k)
\end{align}
To recover the form of $D_o$, begin by substituting, cancelling and reordering
the sums:
\begin{align}
D_o &= \sum_{u \in U} \frac{|u|}{|V|} D(u) \nonumber \\
&= \sum_{u \in U} \frac{|u|}{|V|} \sumpairs \delta(c, k) \frac{\nu(u, c) \nu(u, k)}{|u|(|u| - 1)} \nonumber \\
&= \frac{1}{|V|} \sumpairs \delta(c, k) \sum_{u \in U} \frac{\nu(u, c) \nu(u, k)}{|u| - 1}
\end{align}
then noting that the quantity referred to as $O_{ck}$ in Krippendorff's paper is given by:
\begin{equation}
O_{ck} = \sum_{u \in U} \frac{\nu(u, c) \nu(u, k)}{|u| - 1}
\end{equation}
we are able to recover the original form:
\begin{equation}
D_o = \frac{1}{|V|} \sumpairs O_{ck}
\end{equation}
Should you wish to look further at this and Krippendorff's paper there is one
further piece of notational difference worth knowing:
\begin{equation}
m_u = |u|
\end{equation}
\bibliographystyle{plain}
\bibliography{krippendorff}
\end{document}