Compute Graph

I. Trivia

This library is for calculations on tables.

Table is a sequence of dict-like objects where each dict is a row of table and every key in the dicts is a column in the table.

Calculation is a sequence of transformations. Computational Graph is a chain of transformations where each node is a transform operation. Definition of operations is separated from its calculations.

II. API.

All nodes are defined in keras style.

if __name__ == "__main__":

    input_node = Input()
    mapper = Map(split_text)(input_node)
    sort = Sort("word")(mapper)
    reduce = Reduce(word_counter, "word")(sort)

    graph = Graph(input_node=input_node, output_node=reduce)
    graph.run(input_file="data/text_corpus.txt",
              output_file=open("word_count.txt", "w"))

Operations in the graph work with rows in the table (not with entire table), because this approach avoids extra copying.

The list of all possible operations:

Map

Apply a mapper operation to every row in table. For example, one of possible mapper:

def tokenizer_mapper(r):
    """
    splits rows with 'text' field into set of rows with 'token' field
    (one for every occurence of every word in text)
    """
    tokens = r['text'].split()
    for token in tokens:
        yield {
               'doc_id' : r['doc_id'],
               'word' : token,
        }

Sort

Sort a table by a set of keys.

Fold

Makes convolution of a table to a single row via binary associated operation. Fold node needs a folder and initial state. Fold calls folder on (state, new row) and returns new state. The result of this operation is a finish state (after calculations on all rows in series).

Possible folder:

def sum_columns_folder(state, record):
     for column in state:
         state[column] += record[column]
     return state

Reduce

This operation is similar to Map but it is called for rows which have the same keys. These rows are united in a list and then passed to reducer. Reduce works for O(n) if rows are sorted by required keys so input sequence of dicts should be sorted.

Example of reducer:

def term_frequency_reducer(records):
      '''
      calculates term frequency for every word in doc_id
      '''
      word_count = Counter()
      for r in records:
          word_count[r['word']] += 1
          
      total  = sum(word_count.values)
      for w, count in word_count.items():
          yield {
              'doc_id' : r['doc_id'],
              'word' : w,
              'tf' : count / total
          }

This reducer takes a list of rows with the same "doc_id" and count frequncy for each word in a document.

Join

Works like SQL Join operation. The result of one graph is joined to another graph.

Join takes another graph as an argument. Topological sort is used to find optimal order of computation of different graphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls