-
Notifications
You must be signed in to change notification settings - Fork 13
Frequently Asked Questions
I'm getting infrequent {{{AttributeError: 'OrderedDict' object has no attribute '_OrderedDict__root'}}} errors
Sometime (rarely) you may see one of these show up while running LSD code:
Traceback (most recent call last):
File "/n/sw/python-2.7/lib/python2.7/multiprocessing/queues.py", line 242, in _feed
send(obj)
File "/n/sw/python-2.7/lib/python2.7/collections.py", line 91, in __reduce__
items = [[k, self[k]] for k in self]
File "/n/sw/python-2.7/lib/python2.7/collections.py", line 74, in __iter__
root = self.__root
AttributeError: 'OrderedDict' object has no attribute '_OrderedDict__root'
We currently believe this is most likely due to a bug in Python 2.7 implementation of !OrderedDict. While the message looks scary, it appears to have no effect on the actual execution of the code.
When developing LSD code, it may be advantageous to run it single-threaded while debugging. To do that, set the environment variable DEBUG to 1. E.g.:
export DEBUG=1
See the description of [wiki:LargeSurveyDatabaseAPI#envvars environment variables used by LSD] for more details.
Set NWORKERS=XXX environment variable, where XXX is the desired number of workers. For example:
export NWORKERS=8
will force the code to use 8 workers, irrespective of the number of cores present.
By default, LSD uses as many workers as there are (logical) cores on the system.
See the description of [wiki:LargeSurveyDatabaseAPI#envvars environment variables used by LSD] for more details.
Given a mapper (i.e., the first kernel in the list passed to query.execute()) such as:
def row_counter_kernel(qresult):
for rows in qresult:
yield len(rows)
what are 'rows' that the qresult yields?
Logically, these are ''blocks of rows'' (i.e., not a single row!) that form the query result in the cell where the mapper is running. I.e., for many reasons (one of which is that the complete query result may not fit into memory), the query may get executed in blocks and not all at once. As soon as each of the blocks is computed, it is yielded to the mapper as the variable 'rows'.
Physically, the 'rows' variable is an instance of ColGroup class. Think of ColGroup as a functional equivalent of [http://docs.scipy.org/doc/numpy/user/basics.rec.html numpy structured array] (aka. record array), where the data is internally stored by-column, instead of by-record. So, to obtain the first row of the returned block you would do:
row = rows[0]
Alternatively, to extract a column (as a numpy array), you'd do:
col = rows['ra']
(where I assumed we have a column named 'ra' in the query result).
Finally, to iterate through all rows, row by row, your mapper would look like (for example):
def row_counter_kernel(qresult):
for rows in qresult:
for row in rows:
g, r = row['mag_g'], row['mag_r']
gr = g-r
yield gr
where variables g and r above would be the values of mag_g and mag_r column in the given row (scalars). Note, however, that this is '''''EXTREMELY''''' inefficient (orders of magnitude slower) compared to performing vector operations column by column. A much more efficient implementation of the above kernel would operate on columns of the returned blocks:
def row_counter_kernel(qresult):
for rows in qresult:
g, r = rows['mag_g'], rows['mag_r']
gr = g-r
yield gr
Note that now the {{{ gr = g-r }}} snippet above is a numpy vector operation (fast).
For more discussion, please see [wiki:LargeSurveyDatabase the main LSD document].
Check what kind of keys you're giving it.
Only [http://docs.python.org/glossary.html#term-hashable hashable] objects are permitted to serve as the keys. In short, these are the objects that:
- have a {{{hash()}}} method that returns a value that does not change within that object's lifetime
- can be tested for equality using == (i.e., have a {{{eq()}}} method), and
- if {{{x == y}}}, it is implied that {{{x.hash()}}} == {{{y.hash()}}}. The converse doesn't have to be true.
Built-in immutable types (str, int, long, bool, float, tuple) are hashable. '''We recommend you always use these types as the key''', because it's easy to make a mistake otherwise and accidentally use objects that satisfy 1) and 2), but not 3). An example are records derived from {{{numpy.recarray}}}:
In [266]: x1=np.array([1,1]); x2=np.array(['a','a']); r = np.core.records.fromarrays([x1,x2],names='a,b')
In [270]: r[0].__hash__()
Out[270]: 4444286
In [271]: r[1].__hash__()
Out[271]: -8070450532243484546
In [272]: r[0] == r[1]
Out[272]: True
In [274]: set([r[0], r[1]])
Out[274]: set([(1, 'a'), (1, 'a')])
In particular, note the set at the last line appears to have two elements that are the same (!).