-
Notifications
You must be signed in to change notification settings - Fork 114
Customize transform function to process data
This document introduces how to define function for function
of tf.keras.layers.Lambda
or normalizer_fn
of tf.feature_column.numeric_column
to transform feature.
For tf.keras.layers.Lambda
tf.keras.layers.Lambda(
lambda x: transform_fn(x)
)
For tf.feature_column.numeric_column
tf.feature_column.numeric_column(
name,
dtype=tf.int32,
normalizer_fn=lambda x: transform_fn(x)
)
The transform function not only can process data during training, but also need to be saved in SavedModel along with keras model. So, we utilize TensorFlow operators to implement the transform function so we can export the function to graph and save it to SaveModel.
Why do we need to transform feature to id?
Though, we can use tf.feature_column.embedding_column
to wrap categroy_column
and transform
the feature into dense tensor. The tf.feature_column.embedding_column
can only transform one column and we must make many embedding_column
when we have many category features. In this case,
we must create many embedding weights for embedding_column
which needs more memory.
Sometimes, we want to split features into a group and make embedding for the values in each group.
To make embedding, we need to transform the feature value to id and then use tf.keras.layers.Embedding
layer.
During transformation, we should guarantee that the ids between features in the same group cannot conflict.
For example, there are 5 features in the training data.
age | workclass | education | marital_status | hours_per_week |
---|---|---|---|---|
39 | State-gov | Bachelors | Never-married | 40 |
50 | Private | Bachelors | Divorced | 45 |
38 | Local-gov | Doctorate | Separated | 35 |
If we want to place "age" and "workclass" to a group and "education", "marital_status" and "hours_per_week"
to another group. For numeric features "age" and "hours_per_week", we can bucket the value to id. For other category features, we can transform the value to id by lookuping a vocabulary list.
Suppose the boundaries for "age" and hours_per_week" are
"age": [49,100]
"hours_per_week": [30,40,50]
The vocabularies are:
"workclass": ["State-gov", "Private", "Local-gov"]
"education": ["Bachelors", "Bachelors", "Doctorate"]
"marital_status": ["Never-married", "Divorced", "Separated"]
In the first group, the id range for "age" is [0, 1, 2] and the id range for "workclass" is [3, 4, 5]. Because the element ids in ["State-gov", "Private", "Local-gov"] must add the offset len([0, 1, 2]) to avoid conflicting with ids of "age". For another group, the "education" ids range is [0, 1, 2] and the "marital_status" ids range is [3, 4, 5] and the "hours_per_week" ids range is [6, 7, 8, 9]. Now, we can transform the training data into the following table.
age | workclass | education | marital_stattus | hours_per_week |
---|---|---|---|---|
0 | 2 | 0 | 3 | 6 |
1 | 3 | 1 | 4 | 7 |
0 | 4 | 2 | 5 | 5 |
We can utilize the tf.feature_column.numeric_column
and define the normalizer_fn
to transform feature to id column by column.
tf.feature_column.numeric_column(
name,
dtype=tf.int32,
normalizer_fn=transform_fn
)
In the following, this document introduces how to define normalizer_fn
transform category and numeric feature to id.
def hash_bucket_id(x, bucket_size, offset=0):
if x.dtype is not tf.string:
x = tf.strings.as_string(x)
return tf.strings.to_hash_bucket_fast(x, bucket_size) + offset
transform_fn = (
lambda x, size=HASH_BUCKET_SIZE, offset=id_offset: (
hash_bucket_id(x, size, offset)
)
)
def vocabulary_lookup_id(x, vocabulary_list, offset=0):
table = lookup_ops.index_table_from_tensor(
vocabulary_list=vocabulary_list, num_oov_buckets=1, default_value=-1
)
return table.lookup(x) + offset
transform_fn = (
lambda x, voca_list=vocabulary_list, offset=id_offset: (
vocabulary_lookup_id(x, voca_list, offset)
)
def bucket_id(x, boundaries, offset=0):
if x.dtype is tf.string:
x = tf.strings.to_number(x, out_type=tf.float32)
else:
x = tf.cast(x, tf.float32)
bucket_id = math_ops._bucketize(x, boundaries=boundaries)
return bucket_id + offset
transform_fn = (
lambda x, boundaries=LOG_BOUNDARIES, offset=id_offset: (
bucket_id(x, boundaries, offset)
)
)
def pad_sequence(x, maxlen):
x = tf.strings.split(x, sep=",")
x = tf.strings.to_number(x, tf.int64)
return x.values.to_tensor(shape=(None, maxlen))
transform_fn = lambda x, maxlen=50: pad_sequence(x, maxlen),