Frechet Distance Evaluation of Generative Models for Calorimeter Shower Simulation
For collider experiments, particle interaction processes in calorimeters are simulated using e.g. the Monte-Carlo-based simulation tool Geant4. While essential for the scientific progress, the production of these simulations is increasingly computationally costly. A promising solution to accelerate the simulation of particle interactions by several orders of magnitude is the data generation by generative machine learning models, such as the Bounded Information Bottleneck Autoencoder (BIB-AE). The evaluation of such models has so far often been done by qualitative evaluation methods, i.e. comparing relevant high-level physical observables either by eye or by calculating a similarity metric between histograms. In the computer vision literature, several quantitative methods have been proposed that represent the image quality in a numerical score, e.g. the Inception Score and the Frechet Inception Distance (FID). In this thesis the FID is adapted for the photon shower data set generated by the BIB-AE network using the original Inception V3 network as well as several different regression networks. The approaches for the Frechet Regression Distance (FRD) are compared and evaluated for their viability in a physical context. It is shown, that high performing regression networks do not necessarily correlate with a low FRD score. Additionally a low FRD score does not necessarily correlate with high generation fidelity when judged by a small set of high-level observables. Though with a more basic regression network the FRD score correlates reasonably well with the previously used histogram based Fidelity score, the more accurate regression networks, seem to place higher emphasis on details such as energy deposited in the rear layers and outer corners of the calorimeter by the BIB-AE network. This indicates, that while the BIB-AE might enable important features such as the hit energy to be mapped correctly, it does also have an effect on attributes that are not relevant for high-level calorimeter observables. Yet these attributes are placed importance on by the regression network resulting in a high FRD score for generated showers. Since it is unknown which exact features the regression network places importance on, the use of such score for the evaluation of the BIB-AE model does not seem to be a reliable option, as of now.