Skip to content

Customize transform function to process data

workingloong edited this page Jan 17, 2020 · 3 revisions

This document introduces how to define function for function of tf.keras.layers.Lambda or normalizer_fn of tf.feature_column.numeric_column to transform feature.

For tf.keras.layers.Lambda

tf.keras.layers.Lambda(
    lambda x: transform_fn(x)
)

For tf.feature_column.numeric_column

tf.feature_column.numeric_column(
    name,
    dtype=tf.int32,
    normalizer_fn=lambda x: transform_fn(x)
)

The transform function not only can process data during training, but also need to be saved in SavedModel along with keras model. So, we utilize TensorFlow operators to implement the transform function so we can export the function to graph and save it to SaveModel.

Transform category and numeric feature to id.

Why do we need to transform feature to id?

Though, we can use tf.feature_column.embedding_column to wrap categroy_column and transform the feature into dense tensor. The tf.feature_column.embedding_column can only transform one column and we must make many embedding_column when we have many category features. In this case, we must create many embedding weights for embedding_column which needs more memory.

Sometimes, we want to split features into a group and make embedding for the values in each group. To make embedding, we need to transform the feature value to id and then use tf.keras.layers.Embedding layer. During transformation, we should guarantee that the ids between features in the same group cannot conflict.

For example, there are 5 features in the training data.

age workclass education marital_status hours_per_week
39 State-gov Bachelors Never-married 40
50 Private Bachelors Divorced 45
38 Local-gov Doctorate Separated 35

If we want to place "age" and "workclass" to a group and "education", "marital_status" and "hours_per_week" to another group. For numeric features "age" and "hours_per_week", we can bucket the value to id. For other category features, we can transform the value to id by lookuping a vocabulary list. Suppose the boundaries for "age" and hours_per_week" are
"age": [49,100]
"hours_per_week": [30,40,50]

The vocabularies are:
"workclass": ["State-gov", "Private", "Local-gov"]
"education": ["Bachelors", "Bachelors", "Doctorate"]
"marital_status": ["Never-married", "Divorced", "Separated"]

In the first group, the id range for "age" is [0, 1, 2] and the id range for "workclass" is [3, 4, 5]. Because the element ids in ["State-gov", "Private", "Local-gov"] must add the offset len([0, 1, 2]) to avoid conflicting with ids of "age". For another group, the "education" ids range is [0, 1, 2] and the "marital_status" ids range is [3, 4, 5] and the "hours_per_week" ids range is [6, 7, 8, 9]. Now, we can transform the training data into the following table.

age workclass education marital_stattus hours_per_week
0 2 0 3 6
1 3 1 4 7
0 4 2 5 5

We can utilize the tf.feature_column.numeric_column and define the normalizer_fn to transform feature to id column by column.

tf.feature_column.numeric_column(
    name,
    dtype=tf.int32,
    normalizer_fn=transform_fn
)

In the following, this document introduces how to define normalizer_fn transform category and numeric feature to id.

Transform category feature to id using hash

def hash_bucket_id(x, bucket_size, offset=0):
    if x.dtype is not tf.string:
        x = tf.strings.as_string(x)
    return tf.strings.to_hash_bucket_fast(x, bucket_size) + offset

transform_fn = (
    lambda x, size=HASH_BUCKET_SIZE, offset=id_offset: (
        hash_bucket_id(x, size, offset)
    )
)

Transform category feature to id using vocabulary list

def vocabulary_lookup_id(x, vocabulary_list, offset=0):
    table = lookup_ops.index_table_from_tensor(
        vocabulary_list=vocabulary_list, num_oov_buckets=1, default_value=-1
    )
    return table.lookup(x) + offset

transform_fn = (
    lambda x, voca_list=vocabulary_list, offset=id_offset: (
    vocabulary_lookup_id(x, voca_list, offset)
)

Bucket numeric feature to id

def bucket_id(x, boundaries, offset=0):
    if x.dtype is tf.string:
        x = tf.strings.to_number(x, out_type=tf.float32)
    else:
        x = tf.cast(x, tf.float32)
    bucket_id = math_ops._bucketize(x, boundaries=boundaries)
    return bucket_id + offset

transform_fn = (
    lambda x, boundaries=LOG_BOUNDARIES, offset=id_offset: (
        bucket_id(x, boundaries, offset)
    )
)

Transform sequence feature with padding.

def pad_sequence(x, maxlen):
    x = tf.strings.split(x, sep=",")
    x = tf.strings.to_number(x, tf.int64)
    return x.values.to_tensor(shape=(None, maxlen))

transform_fn = lambda x, maxlen=50: pad_sequence(x, maxlen),