Skip to content

Data Transform Process

brightcoder01 edited this page Jan 17, 2020 · 45 revisions

Data Transform Process

Normalize Table Schema: Wide Table

Transform the table schema to be wide (aka. one table column one feature) if the original table schema is not. We implement it using a batch processing job such as a MaxCompute job.
Q: How to describe the Table Flatten behavior using SQLFlow?

  1. We provide UDF to transform the table and the user can write SQL using the UDF to execute the transformation. In this way, the user must write column names in SQL like:
CREATE TABLE IF NOT EXIST flatten_table LIFECYCLE 7
AS SELECT transform_udf(kv_feature) as (age, income, region) FROM source_table.

In the SQL, "kv_feature" is the column which contains all feature values using a string like "age:60,income:1688,region:USA"

  1. We can launch a pod to parse the feature names from the source table using python(PyOdps + UDF). Then, we can generate the SQL with those feature names to transform the source table to a wide table.

Do Statistics Using SQL

Calculate the statistical value for the following transform code_gen.
Q: How to get statistical vocabulary? File storage? Export the model, save the vocabulary values under assets folder.
In the statistics step, we can analyze and get statistics for each column and save those to a temporary table.

feature_name statistics variable value
age mean 56.4
age variance 45.3
income mean 1700
age variance 168.2
region vocabulary USA,China,Japan

After analysis, we can generate the transformation code by "code_gen". For the statistics variables with small size, we can directly write those into code like mean and variance. The vocabulary may be very large, the code will be very verbose if we write vocabulary into code. We can save the vocabulary to a file and generate the code with the file path. Then, the vocabulary file will be saved into the image along with the generated code to train.

Generate the Code of Data Transform Stage With SQLFlow

  1. We can use keras layer + feature column to do the data transformation. Please look at the Google Cloud Sample.
  2. We can only use keras preprocess layers and Lambda layer with custom transform_fn to do the data transformation without feature column. The design of keras preprocess layers have been submitted.

Idea (Need Sample Code): dataset_fn can be auto-generated from the normalized table schema.

Create Feature Transform Library Based on TensorFlow OP for Transformer word in SQLFlow.

For each TRANSFORMER word in SQLFlow, we should build the common transform function set using TensorFlow op. It can be fed into tf.keras.layers.Lambda or normalizer_fn of numeric_column. For example, we should build a function to hash and bucketize a value to id for "HASH_BUCKET" in SQLFlow. As the transform function set is built upon TensorFlow op, we can ensure the consistency between training and inference.
The functions in this library can be executed and debugged in both eager mode and graph mode.

Key point: Express the Transform function using COLUMN expression. How to design the syntax in SQLFlow to express our functions elegantly?

Transform Code Structure

We want to settle down the pattern of the mode definition. In this way, we can generate the code according to this pattern.

Transform Layers => Feature Columns + DenseFeatures => Neural Network Structure

or

Transform Layers => Neural Network Structure

Transform Work: tf.keras.layers.Lambda
Multiple Column Transform: tf.keras.layers.Lambda
Feature Column: Categorical mapper
Embedding:

  1. Dense embedding -> tf.keras.layers.Embedding
  2. Sparse embedding -> Both embedding_column and tf.keras.layers.Embedding + keras combiner layer are fine. We can use SparseEmbedding if keras provides the native SparseEmbedding layer in the future.

Combine the model definition from model zoo and the generated transform code to the complete submitter code

Model Definition Standard

  1. Decouple the input features and the feature transformation outputs with the Model definition.
  2. Model structure such as the hidden units parameter is not hard coded. Hyper-parameter configuration driven.

Open Questions

  1. Can Lambda layer handle the input or output of the SparseTensor?
  2. How to implement the apply_vocab from vocab file using Keras Lambda Layer?
  3. How to combine multiple inputs into one and add individual offset to each inputs at the same time using Lambda Layer?
  4. What's the code structure of the model defintion using subclass way?
  5. How to combine model mudules to a model?
Clone this wiki locally