Frequently Asked Questions

I'm getting infrequent {{{AttributeError: 'OrderedDict' object has no attribute '_OrderedDict__root'}}} errors

Sometime (rarely) you may see one of these show up while running LSD code:

Traceback (most recent call last):
  File "/n/sw/python-2.7/lib/python2.7/multiprocessing/queues.py", line 242, in _feed
    send(obj)
  File "/n/sw/python-2.7/lib/python2.7/collections.py", line 91, in __reduce__
    items = [[k, self[k]] for k in self]
  File "/n/sw/python-2.7/lib/python2.7/collections.py", line 74, in __iter__
    root = self.__root
AttributeError: 'OrderedDict' object has no attribute '_OrderedDict__root'

We currently believe this is most likely due to a bug in Python 2.7 implementation of !OrderedDict. While the message looks scary, it appears to have no effect on the actual execution of the code.

Tips for Debugging LSD Code

When developing LSD code, it may be advantageous to run it single-threaded while debugging. To do that, set the environment variable DEBUG to 1. E.g.:

export DEBUG=1

See the description of [wiki:LargeSurveyDatabaseAPI#envvars environment variables used by LSD] for more details.

How Do I Set the Number of Workers

Set NWORKERS=XXX environment variable, where XXX is the desired number of workers. For example:

export NWORKERS=8

will force the code to use 8 workers, irrespective of the number of cores present.

By default, LSD uses as many workers as there are (logical) cores on the system.

See the description of [wiki:LargeSurveyDatabaseAPI#envvars environment variables used by LSD] for more details.

What gets yielded into kernels

Given a mapper (i.e., the first kernel in the list passed to query.execute()) such as:

def row_counter_kernel(qresult):
        for rows in qresult:
                yield len(rows)

what are 'rows' that the qresult yields?

Logically, these are ''blocks of rows'' (i.e., not a single row!) that form the query result in the cell where the mapper is running. I.e., for many reasons (one of which is that the complete query result may not fit into memory), the query may get executed in blocks and not all at once. As soon as each of the blocks is computed, it is yielded to the mapper as the variable 'rows'.

Physically, the 'rows' variable is an instance of ColGroup class. Think of ColGroup as a functional equivalent of [http://docs.scipy.org/doc/numpy/user/basics.rec.html numpy structured array] (aka. record array), where the data is internally stored by-column, instead of by-record. So, to obtain the first row of the returned block you would do:

row = rows[0]

Alternatively, to extract a column (as a numpy array), you'd do:

col = rows['ra']

(where I assumed we have a column named 'ra' in the query result).

Finally, to iterate through all rows, row by row, your mapper would look like (for example):

def row_counter_kernel(qresult):
        for rows in qresult:
                for row in rows:
                        g, r = row['mag_g'], row['mag_r']
                        gr = g-r
                        yield gr

where variables g and r above would be the values of mag_g and mag_r column in the given row (scalars). Note, however, that this is '''''EXTREMELY''''' inefficient (orders of magnitude slower) compared to performing vector operations column by column. A much more efficient implementation of the above kernel would operate on columns of the returned blocks:

def row_counter_kernel(qresult):
        for rows in qresult:
                g, r = rows['mag_g'], rows['mag_r']
                gr = g-r
                yield gr

Note that now the {{{ gr = g-r }}} snippet above is a numpy vector operation (fast).

For more discussion, please see [wiki:LargeSurveyDatabase the main LSD document].

MapReduce doesn't correctly regroup values by keys

Check what kind of keys you're giving it.

Only [http://docs.python.org/glossary.html#term-hashable hashable] objects are permitted to serve as the keys. In short, these are the objects that:

have a {{{hash()}}} method that returns a value that does not change within that object's lifetime
can be tested for equality using == (i.e., have a {{{eq()}}} method), and
if {{{x == y}}}, it is implied that {{{x.hash()}}} == {{{y.hash()}}}. The converse doesn't have to be true.

Built-in immutable types (str, int, long, bool, float, tuple) are hashable. '''We recommend you always use these types as the key''', because it's easy to make a mistake otherwise and accidentally use objects that satisfy 1) and 2), but not 3). An example are records derived from {{{numpy.recarray}}}:

In [266]: x1=np.array([1,1]); x2=np.array(['a','a']); r = np.core.records.fromarrays([x1,x2],names='a,b')

In [270]: r[0].__hash__()
Out[270]: 4444286

In [271]: r[1].__hash__()
Out[271]: -8070450532243484546

In [272]: r[0] == r[1]
Out[272]: True

In [274]: set([r[0], r[1]])
Out[274]: set([(1, 'a'), (1, 'a')])

In particular, note the set at the last line appears to have two elements that are the same (!).

Provide feedback

Saved searches