Skip to content
mli edited this page Jan 4, 2015 · 1 revision

Data Format

The parameter server supports flexible text formats, where each example is presented by an ASCII line. For example:

LIBSVM format

label feature_id:weight feature_id:weight feature_id:weight ...
  • label: float
  • feature_id: int32
  • weight: float

PS Format:

label ...; group_id feature[:weight] feature[:weight] ...; group_id ...; ...

Each example has several slots, which are separated by semicolon. The first slot contains the label. There may be several labels (multi-label learning) or empty label (unsupervised learning). Then there are several slots, each of them presents a feature group. A group starts with a nonzero int32 group id (0 is preserved for the label), and then multiple feature weight pairs. The meaning of these pairs depends on the data format:

  • SPARSE_BINARY: the example is a sparse binary vector. feature is a 64-bit unsigned integer ID, and there is no weight
  • SPARSE: the example is a general sparse vector, feature is a 64-bit unsigned integer ID, and weight is a float value
  • DENSE: the example is a dense vector. feature is the float value and there is no weight

More format

Internally, the parameter server uses protobuf to store the example:

message Slot {
  optional int32 id = 1;
  repeated uint64 key = 2 [packed=true];
  repeated float val = 3 [packed=true];
}

message Example {
  repeated Slot slot = 1;
}

To add a new text format, one need to first add the format name in DataFormat, and then add a function in class ExampleParser which converts a line of text into Example

Clone this wiki locally