This repository has been archived by the owner on Nov 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
13-AppendixE.Rmd
434 lines (338 loc) · 23.9 KB
/
13-AppendixE.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
# Bird's Eye View of Motus Data #
This document has been adapted from a version written by John Brzustowsky in 2017. It provides an overview
of the fundamentals of how Motus data processing works. The document has been adapted to incorporate more recent
developments, particularly the addition of new digital tags and base stations manufactured by Cellular
Tracking Technologies (CTT), who took over the development from Cornell University.
## What data look like ##
Here's a segment of data from a receiver (with a single antenna):
```
Receiver R
Time ->
\==========================================================================\
Tag A: / 1-----1--1----1-----1-----1 4---4-----4--4-------4--4-/ <- antenna #1
\. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
Tag B: / 3-----3--3--3--3--3-------3----3--3 / <- antenna #1
\. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
Tag C: / 2--2---2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2 / <- antenna #1
\. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
Tag C: / 5--------5--5--5--5--5--5-----5--------5--------5 / <- antenna #2
\==========================================================================\
```
- time increases from left (earlier) to right (later)
- horizontal *lanes*, separated by `. . .` correspond to individual tags, labelled at left
- **hits** (detections) of a tag are plotted as single digits within its lane
- in the diagram above, runs are labelled by digits. So all hits of tag
A above are either in run `1` or in run `4`.
- the same tag can be detected simultaneously on multiple antennae, and therefore be part
of runs that overlap in time. For example, tag C is detected on antenna 1 and 2.
- In principle, the same tag should never be part of overlapping runs on the SAME antenna,
but in practice this can happen in noisy environments with Lotek tags, and it should be assumed
that these runs are likely from other sources than actual tags (false positives).
- Overlapping run would not happen with CTT tags because of the way runs are built, but the rate of hits
would still be expected to be around 1 per second or lower.
- for Lotek coded ID tags, individual hits are not sufficient to determine which tag is being detected,
because multiple tags transmit the same ID. However, the period (spacing between consecutive
transmissions) is precise, and differs among tags sharing the same ID.
- the fundamental unit of detection for Lotek coded ID tags is the **run**, or sequence of
hits of a given ID, with spacing **compatible** with the value for that tag. i.e. two hits
might differ by the tag's period, or by twice its period, or three times its period, ...,
up to a limit required by drift between clocks on the transmit and receive sides.
- for Lotek ID tags, a run can have gaps (missing hits) up to a certain size. Beyond that size,
measurement error and clock drift are large enough that we can't be sure that the next detection of
the same ID code is after a compatible time interval. So for tag A above, we're not sure the gap
between the last detection in run 1 and the first detection in run 4 is really compatible with tag A,
so run 1 is ended and a new run is begun. We could link runs 1 and 4 post-hoc, once we saw that run 4 was being built with gaps
compatible with tag A, but at present, the run-building algorithm doesn't backtrack.)
- for CTT tags, there are 2^32 unique codes, which ensures that each of them is unique. Contrary to Lotek
tags, runs are not limited to consecutive detections matching a precise period, in part because the period may
vary depending on the energy available from the photovoltaic cell. Run are still assembled
based on consecutive detections on a single antenna (or node, if applicable), as long as they
are not spaced by more than an arbitrary period (600 seconds).
## False positives ##
False positives (apparent detections of tags actually caused by other sources) exist in all technologies
and need to be taken into account. These can happen for quite a number of reasons, and will sometimes
affect Lotek and CTT tags in different measure. False positives are difficult to identify, but there are
approaches to identify conditions in which the are more likely to occur, and potential ways to mitigate them.
- Noisy environments: radio noise from interference can create bursts that look like tags. As the amoung of radio
pulses from other sources increases in the environment, so it the expected number of false positives.
- Bit errors: actual tag signals may be incorrectly transmitted. Those should rarely produce valid ID's, but
they will be more likely with CTT tags, mostly because where is a very high number of possible combinations
(4 billions). They should still rarely produce an ID of a tag actually manufactured. In CTT tags, bit errors
has been found to lead more often to certain patterns (e.g. the last trailing digits being all 7 or F's: xxxFFFF
due to only a partial signal being received), and those patterns have since been excluded from the list of valid tags.
- Aliasing: aliasing happens when the combination of multiple tags create the appearance of a tag that is not present
but matches another known tag. This can happen in at least 2 ways: 1) 2 tags with distinct ID's, but with the same period
(both Lotek and CTT). 2) 2 tags with the same ID, but with distinct periods (only for Lotek). In the first case,
if the burst of both tags overlap, they may interfere with each other and create the appearance of a new tag for a while.
Assuming that the 2 tags do not have exactly the same period, their bursts should eventually drift apart and the precise alias
probably will not persist over a very long run. In the second case, if you have multiple tags with the same ID but distinct
periods, this may potentially also result in new periods, but those would likely rarely exceed 2 consecutive hits (run
length = 2). Both types of aliasing is mostly problematic in environments where you have many tags present simultaneously
that increases the incidence of overlapping tags (e.g. colonies).
For both Lotek and CTT tags, their manufacturers have integrated various methods to reduce the incidence of false detections
into their proprietary technologies. Data collected by Sensor Gnomes are processed in Motus by the "Tag finder", which looks at
properties of the signal to exclude likely false positives (e.g. higher deviation in the frequency of pulses within a burst). The
parameters can potentially be adjusted, but an aggressive approach aimed at reducing false positives will also result in an increase
in false negatives.
- Run length: With both types of tag technologies, the likelihood of obtaining false positives should decrease as the run length increases.
For Lotek tags, we generally recommend ignoring runs comprised only of 2 or 3 hits. Some (many?) probably relate
to real tag detections, but the vast majority probably do not. For CTT tags, we do not yet have a suggested minimum threshold.
In most cases, given that tag period is short, one would expect that true single-hit runs to be quite rare, so those should
likely be excluded to be safer. The likelihood of obtaining the same false ID in consecutive detections is probably very low,
except for some specific tag ID's that are more prone to error. Runs of 2 detections or more are probably safe in most instances.
- Measures of noise. A higher amount of radio pulses is likely a good predictor of false positives, though other factors are also
at play, and not all radio noise is problematic in the same way. We recommend that you use the number of runs comprised of only 2 hits,
which is provided by the activity table (see details here). In noisy environments, shorter runs are much more likely to be false
positives and should be excluded.
- Missed detections: False positives should more often lead to gaps in detections during a run. A run that contains several gaps in detection
would therefore be deemed less reliable. A simple metric for this would be to divide the number of hits in a run (the run lenght) by the
duration of the run (tsEnd - tsStart). Longer runs with few or no gaps in detection would be optimal.
- Overlapping runs. Runs of the same tag on the same antenna should also be an indication of false positives, but this should be highly correlated
with the number and/or ratio of short runs described above.
- Spatio-temporal models. State-space models and other approach can be used to assess whether movement make sense from a biological
point of view, and can help assign probabilities that individual detections are valid.
- If you have any suggestions about techniques you are using to separate true detections from false positives, we encourage you to share
them with us!
## False negatives ##
False negatives are usually even more difficult to detect.
- Faulty equipement or installation is always possible of course. Please refer to the installation guidelines to make sure
you follow all the recommendations.
- DO NOT FORGET to register your tags! (details here)
- DO NOT FORGET to activate your tags before deploying them!
- Make sure that you report your tag and receiver deployment details BEFORE uploading your data. Motus tag finder only seek tags
that are known to be deployed.
## Complication: data are processed in batches ##
The picture above is complicated by several facts:
- receivers are often deployed to isolated areas so that we can only
obtain raw data from them occasionally
- receivers are not aware of the full set of currently-active tags
- sensorgnome receivers do not "know" the full Lotek code set; they record
pulses thought to be from Lotek coded ID tags, but are only able
to assemble them into tag hits for a small set of tags for
which they have an on-board database, built from the user's own
recordings of their tags. This limitation is due to restrictions
in the agreement between Lotek and Acadia University for our use
of their codeset.
- Lotek receivers report each detected tagID independently, and do not
assemble them into runs. This means a raw Lotek .DTA file does not
distinguish between tag 123:4.5 and tag 123:9.1 (i.e. between tags
with ID `123` and burst intervals 4.5 seconds and 9.1 seconds).
It follows that:
- raw receiver data must be processed on a server with full knowledge of:
- which tags are deployed and likely active
- the Lotek codeset(s)
- raw data should be processed in incremental batches
- processed data should be distributed to users in incremental batches,
especially if they wish to obtain results "as they arrive", rather than
in one lump after all their tags have expired.
## Batches ##
A **batch** is the result of processing one collection of raw data files from a receiver.
Batches are not inherent in the data, but instead reflect how data were processed.
Batches arise in these ways:
- a user visits an isolated receiver, downloads raw data files from
it, then uploads the files to motus.org once they are back from the
field
- a receiver connected via WiFi, ethernet, or cell modem is polled
for new data files; this typically happens every hour, with random
jitter to smooth the processing load on the motus server
- an archive of data files from a receiver is re-processed on the
motus server, because important metadata have changed (e.g. new or
changed tag deployment records), or because a significant change
has been made to processing algorithms.
Batches are artificial divisions in the data stream, so runs of hits will
often cross batch boundaries. Adding this complication to the picture above gives
this:
```
Receiver R
Time ->
\====================|=================================|=====================\
Tag A: / 1-----1--1-|---1-----1-----1 4---4|-----4--4-------4--4-/
\. . . . . . . . . . | . . . . . . . . . . . . . . . . | . . . . . . . . . . \
Tag B: / | 3-----3--3--3--3--3-------3---|-3--3 /
\. . . . . . . . . . | . . . . . . . . . . . . . . . . | . . . . . . . . . . \
Tag C: / 2--2---2|--2--2--2--2--2--2--2--2--2--2--2|--2--2--2--2--2--2 /
\====================|=================================|=====================\
| |
<---- Batch N ------>|<------- Batch N+1 ------------->|<----- Batch N+2 ---->
```
### Receiver Reboots ###
A receiver **reboots** when it is powered down and then (possibly much later) powered
back up. Reboots often correspond to a receiver:
- being redeployed
- having its software updated
- or having a change made to its attached radios,
so motus treats receiver reboots in a special way:
- a reboot always begins a new batch; i.e. batches never extend
across reboots. This simplifies determination of data ownership.
For example, all data in a boot session (time period between
consecutive reboots) are deemed to belong to the same motus
project. This reflects the fact that a receiver is (almost?)
always turned off between the time it is deployed by one project,
and the time it is redeployed by another project.
- any active tag runs are ended when a receiver reboots. Even if the
same tag is present and broadcasting, and even if the reboot takes
only a few minutes, hits of a tag before and after the reboot
will belong to separate runs. This is partly for convenience in
determining data ownership, as mentioned above. It is also
necessary because sometimes receiver clocks are not properly set by
the GPS after a reboot, and so the timestamps for that boot session
will revert to a machine default, e.g. 1 Jan 2000. Although runs
from these boot sessions could in principle be re-assembled post
hoc if the system clock can be pinned from information in field
notes, this is not done automatially at present.
- parameters to the tag-finding algorithm are set on a per-batch basis.
At some field sites, we want to allow more lenient filtering because
there is very little radio noise. At other sites, filtering should
be more strict, because there is considerable noise and high false-positive
rates for tags. motus allows projects to set parameter overrides for
individual receivers, and these overrides are applied by boot session,
because redeployments (always?) cause a reboot.
- when reprocessing data (see below) from an archive of data files,
each boot session is processed as a batch.
### Incremental Distribution of Data ###
The [Motus R package](https://github.com/MotusWTS/motus/)
allows users to build a local copy of the database of all their tags'
(or receivers') hits incrementally. A user can regularly call the
`tagme()` function to obtain any new hits of their tags. Because
data are processed in batches, `tagme()` either does nothing, or downloads
one or more new batches of data into the user's local DB.
Each new batch corresponds to a set of files processed from a single
receiver. A batch record includes these items:
- receiver device ID
- how many of hits of their tags occurred in the batch
- first and last timestamp of the raw data processed in this batch
Each new batch downloaded will include hits of one or more
of the users's tags (or someone's tags, if the batch is for a "receiver"
database).
A new batch might also include some GPS fixes, so that the user knows
where the receiver was when the tags were detected.
A new batch will include information about runs. This information comes
in three versions:
- information about a new run; i.e. one that begins in this batch
- information about a continuing run; i.e. a run that began in a previous batch,
has some hits in this batch, and is not known to have ended
- information about an ending run; i.e. a run that began in a previous batch,
might have some hits in this batch, but which is also known to end in this batch
(because a sufficiently long time has elapsed since the last detection of its tag)
Although the unique `runID` identifier for a run doesn't change when the user calls
`tagme()`, the number of hits in that run and its status (`done` or not), might
change.
## Reprocessing Data ##
motus will occasionally need to reprocess raw files from receivers. There are several
reasons:
- new or modified tag deployment records. The tag detection code
relies on knowing the active life of each tag it looks for, to
control rates of false positive and false negative hits. If
the deployment record for a tag only reaches the server after it
has already processed raw files overlapping the tag's deployment,
then those files will need to be reprocessed in order to
(potentially) find the tag therein. Similary, if a tag was
mistakenly sought during a period when it was not deployed, it will
have "used up" signals that could instead have come from other
tags, thereby causing both its own false positives, and false
negatives on other tags. (This is only true for Lotek ID tags;
CTT should be unaffected, provided deployed tags are well
dispersed in the ID codespace.)
- bug fixes or improvements in the tag finding algorithm
- corrections of mis-filed data from receivers. Sometimes,
duplication among receiver serial numbers (a rare event) is only
noticed *after* data from them has already been processed. Those
data will likely have to be reprocessed so that hits are
assigned to the correct station. Interleaved data from two
receivers having the same serial number will typically prevent
hits from at least one of them, as the tag finder ignores data
where the clock seems to have jumped backwards significantly.
## The (eventual) Reprocessing Contract ##
Reprocessing can be very disruptive from the user's point of view
("What happened to my hits?"), so motus reprocessing will be:
1. optional: users should be able to obtain new data without
having to accept reprocessed versions of data they already have.
2. reversible: users should be able to "go back" to a previous
version of any reprocessed data they have accepted.
3. transparent: users will receive a record of what was reprocessed,
why, when, what was done differently, and what changed
4. all-or-nothing: for each receiver boot session for which users
have data, these data must come entirely from either the original
processing, or a subsequent single reprocessing event. The user
*must not* end up with an undefined mix of data from original and
reprocessed sources.
5. in-band: the user's copy of data will be updated to incorporate
*reprocessed* data as part of the normal process of updating to obtain
*new* data, unless they choose otherwise. We expect that most users
will want to accept reprocessed data most of the time.
Initially, motus data processing might not adhere to this contract,
but it is an eventual goal.
## Reprocessing simplified: only by boot session ##
A general reprocessing scenario would look like this:
```
Receiver R
Time ->
\=================!==|=================================|======!==============\
Tag A: / 1-----1-!1-|---1-----1-----1 4---4|-----4!-4-------4--4-/
\. . . . . . . . .!. | . . . . . . . . . . . . . . . . | . . .!. . . . . . . \
Tag B: / ! | 3-----3--3--3--3--3-------3---|-3--3 ! /
\. . . . . . . . .!. | . . . . . . . . . . . . . . . . | . . .!. . . . . . . \
Tag C: / 2--2-!-2|--2--2--2--2--2--2--2--2--2--2--2|--2--2!-2--2--2--2 /
\=================!==|=================================|======!==============\
! | | !
<---- Batch N ----!->|<------- Batch N+1 ------------->|<-----!Batch N+2 ---->
! !
!<- Reprocess this period (no, too hard!) ->!
```
if raw data records from an arbitrary stretch of time could be reprocessed.
However, this is complicated because runs like `1` `2`, and `4` above might lose
or gain hits within the reprocessing period, but not outside of it. This
might even break an existing run into distinct new runs.
This situation is challenging (NB: not impossible; might be a TODO) to
formalize and represent in the database if we want to maintain a full
history of processing. For example, if reprocessing deletes some hits
from run `2`, how do we represent both the old and the new versions of
that run?
The complications arise due to runs crossing the reprocessing period
boundaries, so for simplicity we should *choose a reprocessing period
that **no** runs cross*. Currently, that means a boot session, as discussed
above.
## Distributing reprocessed data ##
The previous section shows why we only reprocess data by boot session.
Given that, how do we get reprocessed data to users while fulfilling
the reprocessing contract?
Note that a reprocessed boot session will fully replace one or more
existing batches and one or more runs, because batches and runs both
nest within boot sessions.
Replacement of data by reprocessed versions should happen in-band
(`5` above), so one approach is this:
- the `batches_for_XXX` API entries should mark new batches which result
from reprocessing, so that the client can deal with them appropriately.
This can be done by adding a field called `reprocessID` with these semantics:
- `reprocessID == 0`: data in this batch are from new raw files; the normal
situation
- `reprocessID == X > 0`: data in this batch are from reprocessing
existing raw files.
- `X` is the ID of the reprocessing event, and a new API entry `reprocessing_info (X)`
can be called to obtain details about it.
- if the user chooses to accept the reprocessed version, then existing batches, runs, hits and GPS
fixes from the same receiver and boot session are retired before
adding the new batches.
- if the user chooses to reject the reprocessed version, then `X` is added to a client-side blacklist,
and the user will not receive any data from batches whose reprocessID is on the blacklist.
- later, if a user decides to accept a reprocessing event they had
earlier declined, then the IDs of new batches for
that event can be fetched from another new API
`reprocessing_batches (X)`, and the original batches will be
deleted
- to let users more efficiently fetch the "best" version of their
dataset (i.e. accepting all reprocessing events), we should also
mark batches which are subsequently replaced by a reprocessing
event. For this, we add the field `replacedIn` with these
semantics:
- `replacedIn == 0`: data in this batch have not been replaced by any reprocessing event
- `replacedIn == X > 0`: data in this batch have been replaced by reprocessing event `X`.
Then the client can ignore any batches for which `replacedIn > 0`. We could also add
a new boolean parameter `unreplacedOnly` to the `batches_for_XXX` API entries. It defaults to `false`,
but if `true`, then only batches which have not been replaced by subsequent reprocessing events
are returned.
- users can choose a policy for how reprocessed data are handled by setting the value of
`Motus$acceptReprocessedData` in their workspace before calling `tagme()`:
- `Motus$acceptReprocessedData <- TRUE`; always accept batches of data from reprocessing events
- `Motus$acceptReprocessedData <- FALSE`; never accept batches of data from reprocessing events
- `Motus$acceptReprocessedData <- NA` (default); ask about each reprocessing event