Some questions about data normalization #411

chuzheng88 · 2022-02-24T09:53:15Z

chuzheng88
Feb 24, 2022

Tsai is a very good project which can save a lot of time. I want to usd the project to solve a regression problem.

I have the dataset (both X and y), I want to use X (time series sequence) to predict y (float point), but X and y also not be normalized, so I normalized the dataset (both X and y) before training model and the non-normalized the predicted value to the true value in the real world. When I used the function "TSNormalize" to preprocess my dataset, the function did not work in my opinion. My code describe as follows:

Codes in notebook cell print the normalized data, but the normalized data also greater than 1 (e.g., tensor([2.7380, 2.8582, 2.6833, 2.7741], device='cuda:0')) or tensor([2.6764, 2.9775, 3.0802, 2.6598], device='cuda:0'))).

I want to know how should I use the TSNormalize function.

chuzheng88 · 2022-02-24T09:58:45Z

chuzheng88
Feb 24, 2022
Author

In addition, how should I non-normalized the predicted y (label) in the real wold after I use the function learn.get_X_preds(x[splits[2]]).

Hahhahahh, I have so much questions and my English skill also pool.
I'm so sorry.

0 replies

oguiza · 2022-02-24T11:36:47Z

oguiza
Feb 24, 2022
Maintainer

Hi @chuzheng88,
If you want to normalize data between 0 and 1 for each sample and each variable you should use:

batch_tfms = TSNormalize(by_sample=True, by_var=True, range=(0,1))

like in this example:

X, y, splits = get_regression_data('Covid3Month', split_data=False)
print(X.min(), X.max()) # 0.0 20341.0
tfms = [None, TSRegression()]
batch_tfms = TSNormalize(by_sample=True, by_var=True, range=(0,1))
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms)
xb, yb = dls.train.one_batch()
print(xb.min(), xb.max()) # TSTensor([0.0], device=cuda:0) TSTensor([1.0], device=cuda:0)

As to the y, why do you want to normalize it? I think it'd be good to try it first without any preprocessing.
You could apply the TSNormalize function to y, but bear in mind it is not reversible. So I wouldn't recommend it.
Only functions that have a decodes method are reversible. If they don't have it, you'd need to manually reverse the preprocessing operation once you get the final predictions.

0 replies

chuzheng88 · 2022-02-25T02:16:58Z

chuzheng88
Feb 25, 2022
Author

Thank you for your answer. In my opinion, the operation (y were normalizated to range(0, 1)) will be good for calculate the gradients and then execute the backpropagation algorithm becuase normalized X were normalized to range(0, 1). If X are range(0, 1) and y are range(-100, 10000), I think the magnitude of X and y will cause a larger error.

0 replies

oguiza · 2022-02-25T08:16:08Z

oguiza
Feb 25, 2022
Maintainer

I still think you should try it.
Some tsai models support a y_range argument. If that's the case with the model you use you can pass it with the min and max you expect to get as output, plus some margin. If you expect to get results in the (-100, 10000) range, you can pass y_range=(-1000, 110000) for example. y_range performs the following operation:

output = output * (y_range.max() - y_range.min()) + y_range.min()

In this way, the network only needs to predict values between 0 and 1.
Note: if you use y_range, you don't need to make any changes to the target y.

I'd recommend you to use the model with the original y with and without y_range and compare the results. If they are not good, you may want to implement a manual preprocessing and postprocessing.

0 replies

chuzheng88 · 2022-02-25T09:31:39Z

chuzheng88
Feb 25, 2022
Author

I still think you should try it. Some tsai models support a y_range argument. If that's the case with the model you use you can pass it with the min and max you expect to get as output, plus some margin. If you expect to get results in the (-100, 10000) range, you can pass y_range=(-1000, 110000) for example. y_range performs the following operation:
output = output * (y_range.max() - y_range.min()) + y_range.min()
In this way, the network only needs to predict values between 0 and 1. Note: if you use y_range, you don't need to make any changes to the target y.

I'd recommend you to use the model with the original y with and without y_range and compare the results. If they are not good, you may want to implement a manual preprocessing and postprocessing.

Thank you very much. I will try it and compare results between normalized y and true y in the real world. I will post the results if the results have a quite different.

0 replies

chuzheng88 · 2022-02-25T09:38:08Z

chuzheng88
Feb 25, 2022
Author

In addition, can tsai be well compatible with parameters tuning tools, such as RAY TUNE (https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html)?
I just came arcoss this article. Or tsai has supplied these tools.

0 replies

oguiza · 2022-02-25T10:29:20Z

oguiza
Feb 25, 2022
Maintainer

I haven't used Ray Tune.
However, we built some tutorial notebooks demonstrating how to use Optuna and WandB. You can take a look and see if they meet your needs.

0 replies

chuzheng88 · 2022-02-26T03:59:03Z

chuzheng88
Feb 26, 2022
Author

I have read these two articles which are helpful for me. It suddenly occured to me that X in my dataset consists of variable sequence len, such as:
X=[
[x, x, x],
[x, x],
[x, x, x, x]
]
where [x, x, x] or [x, x] is sequence in my dataset. A straightforward way is that different sequences are padded to max sequence length with specific value, e.g., 0, -1, etc. Does tasi provide an elegant way to solve this problem ?

0 replies

oguiza · 2022-02-26T13:16:24Z

oguiza
Feb 26, 2022
Maintainer

I've modified a function that was already available to make it fit a wider need. It's called pad_sequences. You can read the documentation here.

I'll move this issue to Discussions as no changes to tsai are required.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about data normalization #411

{{title}}

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some questions about data normalization #411

chuzheng88 Feb 24, 2022

Replies: 9 comments

chuzheng88 Feb 24, 2022 Author

oguiza Feb 24, 2022 Maintainer

chuzheng88 Feb 25, 2022 Author

oguiza Feb 25, 2022 Maintainer

chuzheng88 Feb 25, 2022 Author

chuzheng88 Feb 25, 2022 Author

oguiza Feb 25, 2022 Maintainer

chuzheng88 Feb 26, 2022 Author

oguiza Feb 26, 2022 Maintainer

chuzheng88
Feb 24, 2022

chuzheng88
Feb 24, 2022
Author

oguiza
Feb 24, 2022
Maintainer

chuzheng88
Feb 25, 2022
Author

oguiza
Feb 25, 2022
Maintainer

chuzheng88
Feb 25, 2022
Author

chuzheng88
Feb 25, 2022
Author

oguiza
Feb 25, 2022
Maintainer

chuzheng88
Feb 26, 2022
Author

oguiza
Feb 26, 2022
Maintainer