Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is data distribution consistent between training and test sets? #17

Open
fruboes opened this issue May 14, 2020 · 2 comments
Open

Is data distribution consistent between training and test sets? #17

fruboes opened this issue May 14, 2020 · 2 comments

Comments

@fruboes
Copy link

fruboes commented May 14, 2020

I wanted to see how classifier accuracy for Kuzushiji-MNIST behaves wrt varying training set size. My procedure was the following:

  1. Sample a small-sized dataset from the full training dataset, making sure each class has the same number of images
  2. Estimate the classifier (KNN) accuracy with cross-validation (applied to the dataset created in 1.)
  3. Estimate the classifier (again KNN, with same parameters) accuracy using the test set (i.e. train the KNN classifier on the full dataset obtained in 1., estimate accuracy on last 10000 images from KMNIST)
  4. Compare values obtained in 2. and 3.

To my surprise, classifier efficiency obtained with cross-validation was always significantly higher than the one obtained with the test set:

  • train_size: 200

    • xval efficiency: 0.64 pm 0.03
    • test set effic.: 0.52 pm 0.02
  • train_size: 500

    • xval efficiency: 0.75 pm 0.02
    • test set effic.: 0.61 pm 0.02
  • train_size: 1000

    • xval efficiency: 0.81 pm 0.01
    • test set effic.: 0.68 pm 0.01

After a fair amount of debugging I've found out, that effect completely disappears if I merge train and test parts, shuffle the result, and then define new train and test datasets (60000 and 10000 images respectively, with class balancing ensured). For such datasets the results seem consistent for both methods used:

  • train_size: 200

    • xval efficiency: 0.63 pm 0.03
    • test set effic.: 0.67 pm 0.01
  • train_size: 500

    • xval efficiency: 0.74 pm 0.01
    • test set effic.: 0.76 pm 0.01
  • train_size: 1000

    • xval efficiency: 0.80 pm 0.01
    • test set effic.: 0.82 pm 0.01

The above may suggest, that the original training and test parts of Kuzushiji-MNIST are somehow different. Could you have a look at this? Please find the test code producing the above results attached ( compare_xval_and_test.zip ).

@fruboes
Copy link
Author

fruboes commented May 17, 2020

One important detail - in both scenarios (i.e. original test/train split and the second one with merging) the training set used is shuffled prior to classifier training. This is to make sure the effect is not due to the ordering of the original training set.

@joksas
Copy link

joksas commented Aug 6, 2022

Thank you for producing this!

However, I noticed the same thing as @fruboes when it comes to KMNIST -- test accuracy is significantly lower than I would expect. However, this GitHub issue seems to be the only mention of this potential problem, so maybe I'm overlooking something.

Anyway, even using one of the benchmark files in this repository -- kuzushiji_mnist_cnn.py -- I can produce a test accuracy that is unusually low. To demonstrate it, I shuffled the training set, after which I reserved 50000 images for training and 10000 images purely for evaluation after training; the latter would be used to compare with test-set performance.

Here is the modified code:

# Based on MNIST CNN from Keras' examples: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py (MIT License)

from __future__ import print_function

import keras
import numpy as np
from keras import backend as K
from keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from keras.models import Sequential

batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28


def load(f):
    return np.load(f)["arr_0"]


# Load the data
x_train = load("kmnist-train-imgs.npz")
x_test = load("kmnist-test-imgs.npz")
y_train = load("kmnist-train-labels.npz")
y_test = load("kmnist-test-labels.npz")

if K.image_data_format() == "channels_first":
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255
x_test /= 255
print("{} train samples, {} test samples".format(len(x_train), len(x_test)))

# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# Shuffle training data
np.random.seed(0)
perm = np.random.permutation(len(x_train))
x_train = x_train[perm]
y_train = y_train[perm]

# Split training data
num_unused_train_images = len(x_test)
my_x_train = x_train[:-num_unused_train_images]
my_y_train = y_train[:-num_unused_train_images]
my_x_train_unused = x_train[-num_unused_train_images:]
my_y_train_unused = y_train[-num_unused_train_images:]

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation="relu", input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation="softmax"))

model.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=keras.optimizers.Adadelta(),
    metrics=["accuracy"],
)

model.fit(
    my_x_train,
    my_y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=2,
)

train_score = model.evaluate(my_x_train, my_y_train, verbose=0)
unused_train_score = model.evaluate(my_x_train_unused, my_y_train_unused, verbose=0)
test_score = model.evaluate(x_test, y_test, verbose=0)
print("Train loss:", train_score[0])
print("Train accuracy:", train_score[1])
print("Unused train loss:", unused_train_score[0])
print("Unused train accuracy:", unused_train_score[1])
print("Test loss:", test_score[0])
print("Test accuracy:", test_score[1])

and the results:

Train loss: 1.1492388248443604
Train accuracy: 0.7178800106048584
Unused train loss: 1.1511162519454956
Unused train accuracy: 0.7146000266075134
Test loss: 1.545776128768921
Test accuracy: 0.5529000163078308

Test accuracy is 55% compared to 71% of the unused partition of the train set. arXiv paper says that "data distributions of each class are consistent between the two sets", but, given the results, I find this statistically implausible. Or I made an error in which case I would be grateful if you could point to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants