- An item that cannot be assigned the largest score by a function assigning scores to items.
This repository contains algorithms for detecting unargmaxable classes in low-rank softmax layers. A softmax layer is by construction low-rank if we have C > d + 1, where C is the number of classes and d is the dimensionality of the input feature vector.
The repository also contains code to reproduce our results, tables and figures from our paper that was accepted to ACL 2022.
python3.7 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
pip install -e .
export OMP_NUM_THREADS=1
export STOLLEN_NUM_PROCESSES=4
# Adapt below as needed
export FLASK_APP="$PWD/stollen/server"
# Adapt below if you would rather install models elsewhere
mkdir models
export TRANSFORMERS_CACHE="$PWD/models"
export OMP_NUM_THREADS=1
is needed as otherwise we don't benefit from multithreading (numpy hogs all threads).- You can set
STOLLEN_NUM_PROCESSES
if you want to run the search on multiple CPUs/threads. Each thread processes a single vocabulary item in parallel. We usedexport STOLLEN_NUM_PROCESSES=10
on an AMD 3900X CPU with 64 Gb of RAM.
Install Gurobi
The linear programming algorithm depends on Gurobi. It requires a license, see link above.
This script exists as a sanity check for our algorithms. We assert that we can detect which points are internal to the convex hull. To make this assertion we compare results to QHull.
Are any of the 20 class weight vectors randomly initialised in 2 and 3 dimensions unargmaxable?
stollen_random --num-classes 20 --dim 2
stollen_random --num-classes 20 --dim 3
stollen_random --help # For more details / options
If the dimension is 2 or 3 we also plot the resulting convex hull for visualisation purposes.
The result of the algorithm is also compared to the exact Qhull result if dim < 10
.
The approximate algorithm will have 100% recall but may have lower precision.
stollen_random --num-classes 300 --dim 8 --seed 3 --patience 50
Below we run the exact algorithm, this should always return 100% for both precision and recall unless the input range is too large.
stollen_random --num-classes 300 --dim 8 --seed 3 --patience 50 --exact-algorithm lp_chebyshev
As a sanity check we verify that all classes are argmaxable when we normalise the weights or set the bias term as mentioned in Appendix D of the paper.
stollen_prevention --num-classes 500 --dim 10
stollen_prevention --num-classes 500 --dim 10 --use-bias
We can also see that the script would raise an assertion error if we did not follow the normalisation step.
stollen_prevention --num-classes 500 --dim 10 --do-not-prevent
stollen_prevention --num-classes 500 --dim 10 --use-bias --do-not-prevent
Note that in high dimensions unargmaxable tokens are not expected to exist if we randomly initialise the weight vectors.
Expects the weight matrix to be in decoder_Wemb attribute. Takes transpose, since expects the matrix in [dim, num_classes] format.
stollen_numpy --numpy-file path-to-numpy-model.npz
stollen_hugging --url https://huggingface.co/bert-base-cased --patience 2500 --exact-algorithm lp_chebyshev
NB: The script does not work with any arbitrary model: It needs to be adapted if the Softmax weights and bias are stored in an unforeseen variable.
Scripts to reproduce experiments can be found here, see the README.md file for details.
The scripts generally write to a postgres database, but the save-db
parameter can be toggled within the script to change that.
- wget
- gunzip
- psql
export FLASK_APP="$PWD/stollen/server"
cd db
export DB_FOLDER="$PWD/stollen_data"
export DB_PORT=5436
export DB_USER=`whoami`
export DB_NAME=stollenprob
# fun times
export DB_PASSWD="cov1d"
export PGPASSWORD=$DB_PASSWD
export DB_HOST="localhost"
export DB_SSL="prefer"
# Creates the database, tables etc.
./install.sh
# Will download the tables in CSV format from aws s3
# and populate the psql database
# (the csv files are saved in the data folder - e.g. if you want to use pandas)
./download_and_populate_db.sh
From the db
folder, run:
# IMPORTANT: Run stop before deleting any files
./stop.sh
rm -r stollen_data
rm -r migrations
The following scripts generally accept a file with experiment ids to plot/aggregate. For example:
cd ../paper/plots
python plot_bounded.py --ids-file datafiles/bounded.txt --title "bounded models"
paper/
├── appendix
│ ├── braid-slice-regions
│ └── check_quantiles
├── plots
│ ├── plot_bounded.py
│ ├── plot_random_iterations.py
│ ├── plot_row_iterations.py
│ ├── plot.sh
│ ├── stolen_probability.py
│ └── stolen_probability_with_convex.py
└── tables
├── plot_iterations.py
└── print_bounded_table.py
You can use the above with the experiment ids generated from your own experiments, assuming you save them to the database.
From the paper/plots
folder run:
./plot.sh
This assumes you have installed and populated the database mentioned above.
- Demeter(2020) identified that unargmaxable classes can arise in classification layers and coined the more general phenomenon Stolen Probability.
- Warren D. Smith comprehensively summarises the history of the problem.
Please cite our work as:
@inproceedings{grivas-etal-2022-low,
title = "Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice",
author = "Grivas, Andreas and
Bogoychev, Nikolay and
Lopez, Adam",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.465",
doi = "10.18653/v1/2022.acl-long.465",
pages = "6738--6758",
abstract = "Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The Softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to be predicted via argmax, irrespective of input features, and empirically, there is evidence this happens in small language models (Demeter et al., 2020). In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such unargmaxable tokens in public models. We find that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our algorithms and code to the public.",
}
As we get closer to Christmas, stollen probability increases.