Skip to content

uhh-pd-ml/LHCO_EPiC-FM

Repository files navigation

LHCO EPiC Flow Matching

python pytorch lightning hydra black isort
Template Paper Conference

Description

Note

A library that includes these models as well as additional loss functions, architectures and (particle physics) datasets can be found here.

This repository contains the code for the Flow Matching Generative Models from the paper 'Full Phase Space Resonant Anomaly Detection'. For the preparation of the data see this repository, the second generative model and the classifier can be found here. We used the LHC Olympics R&D anomaly detection dataset.

Physics beyond the Standard Model that is resonant in one or more dimensions has been a longstanding focus of countless searches at colliders and beyond. Recently, many new strategies for resonant anomaly detection have been developed, where sideband information can be used in conjunction with modern machine learning, in order to generate synthetic datasets representing the Standard Model background. Until now, this approach was only able to accommodate a relatively small number of dimensions, limiting the breadth of the search sensitivity. Using recent innovations in point cloud generative models, we show that this strategy can also be applied to the full phase space, using all relevant particles for the anomaly detection. As a proof of principle, we show that the signal from the R&D dataset from the LHC Olympics is findable with this method, opening up the door to future studies that explore the interplay between depth and breadth in the representation of the data for anomaly detection.

For the generation of dijet events, we use a chain of multiple generative models. One particle feature Model generates the jet constituents based on jet features that are generated by a jet feature model. The jet feature model generates the jet features for both jets in the event and is conditioned on the dijet mass of the jet pair. This conditioniong allows us to train on sideband region in dijet mass and sample in the signal region, where a signal is expected.

The particle feature model is the EPiC Flow Matching model introduced here. EPiC Flow Matching is a Continuous Normalising Flow that is trained with a simulation free approach called Flow Matching. The model uses DeepSet based EPiC layers for the architecture, which allow for good scalability to high set sizes.

The jet feature model is also trained with Flow Matching, but since it doesn't model point clouds, a simple fully connected architecture is used.

This repository uses pytorch lightning, hydra for model configurations and supports logging with comet and wandb. For a deeper explanation of how to use this repository, please have a look at the template directly.

How to run

Install dependencies

# clone project
git clone https://github.com/YourGithubName/your-repo-name
cd your-repo-name

# [OPTIONAL] create conda environment
conda create -n myenv python=3.10
conda activate myenv

# install pytorch according to instructions
# https://pytorch.org/get-started/

# install requirements
pip install -r requirements.txt

Create .env file to set paths and API keys

PROJEKT_ROOT="/folder/folder/"
DATA_DIR="/folder/folder/"
LOG_DIR="/folder/folder/"
COMET_API_TOKEN="XXXXXXXXXX"

Before training, the data needs to be downloaded. Because the data simply consists of all jet constituents from a dijet event, the data has to be clustered and prepared first. This can be done with this code.

Then the jet feature model can be trained with

python src/train.py experiment=lhco/lhco_jet_features

and the particle feature model with

python src/train.py experiment=lhco/lhco_both_jets

After training both models, one can generate the events with the lhco_full_eval notebook. The classifier training and evaluation plots were done with this code. This code also contains an EPiC classifier that can be used to quickly evaluate if the generated samples can fool a classifier, but this is not used in the paper.