Automated selection of models via a `best_chi2_worse_phi2` algorithm #1962

Cmurilochem · 2024-02-29T12:57:13Z

As a continuation of #1943, I managed to automate the selection of best models via the @juanrojochacon's hyperopt algorithm wherein data of 1/$\varphi^{2}$ is used to decide on the best $\chi^{2}$ hyperpoint. Here I am just referring to it as best_chi2_worse_phi2 algorithm.

To this end, I made a post-fit script which is primarily based on the validphys vp_hyperoptplot.py module. I did so in such a way to make our implementation easier later. Just in case I attach it here analysis_hyperopt.zip.

The core of the idea is presented in the code snippet below:

args = {
    'loss_target': 'best_chi2_worst_phi2',    # select Juan & Roy's algorithm
    'max_phi2_points': 10,                             # select the n lowest values of 1/phi2
    'threshold': 3.0,
}

if args.loss_target == "best_chi2_worst_phi2":
        minimum = dataframe.loss[best_idx]
        std = np.std(dataframe.loss)
        lim_max = dataframe.loss[best_idx] + std
        # select rows with chi2 losses within the best point and lim_max
        selected_chi2 = dataframe[(dataframe.loss >= minimum) & (dataframe.loss <= lim_max)]
        # among the selected points, select the nth lowest in 1/phi2
        selected_phi2 = selected_chi2.loss_reciprocal_phi2.nsmallest(args.max_phi2_points)
        # find the location of these points in the dataframe
        indices = dataframe[dataframe['loss_reciprocal_phi2'].isin(selected_phi2)].index
        best_trial = dataframe.loc[indices]

Here, I define an internal between the chi2 minimum and 1 standard deviation std from which I will monitor later on the corresponding 1/phi2 values. For these, I get the nth lowest 1/phi2 hyperpoints and save the selected models into best_trial. In the zip attached file I take as example the runs I discussed on Monday using 10 replicas (because I have much more points to test the algorithm). The final plot is show below:

The yellow region defines the interval between chi2 minimum (grey circle) and 1 standard deviation std of the loss data. I also asked the script to give me 10 models within this region which show the lowest 1/phi2's (cyan circles).

Questions

Is 1 std sufficient for our purposes ? Note that for the analysis I selected a loss threshold of 3. So, all models showing higher losses were excluded from the DataFrame and analysis.
When looking at 1/phi2 values which option is more physically sound and the best: (i). 1/ < phi2 > or (ii). <1/phi2> ? Note that in the analysis I use <1/phi2>.
Is the idea to implement this later in validphys ? I tried to run the vp-hyperoptplot but it always complains about the need for pandoc (even if I have pandoc installed).

I would appreciate any comments and idea to improve are always welcome.

The text was updated successfully, but these errors were encountered:

Radonirinaunimi · 2024-02-29T13:29:22Z

Thanks @Cmurilochem for this nice analysis. Something like this would indeed help us better visualize and select the "best" models. Regarding your questions:

I think that ultimately this will depend both on the number of samples in the vicinity of $\langle \chi^2\rangle_{\mathrm{min}}$ and the number of models that one would like to request.
I would say that $1 / \langle \varphi^2 \rangle$ is the one we should look at.
That'd be nice! One thing that'd also be useful to have from your script is the dumping of the selected hyperparameters (into yaml/json for example) for the $N$ best models.

RoyStegeman · 2024-02-29T15:22:38Z

Thanks @Cmurilochem.

Since the meaning of 1std in this case depends even on you threshold I'd say this is maybe not the best measure. 1. because it's not normally distributed so it's not so clear how to interpret the meaning of 1std. And 2. it introduces two hyperparameters, namely the number of standard deviations and the threshold, while we can probably think of ways to use a single one.
What about e.g. taking the spread of chi2 among the replicas of the best fit to construct the yellow band? In that case the interpretation of 1std is clear. Whether we then really one 1std is remains a good question. In principle we may want define points with equivalent chi2 as those for which the (bootstrapped) uncertainty on the central chi2 are within 1std of the best central chi2, but that may make things overly complicated.
Agreed with @Radonirinaunimi. Remember that the intuition is to maximize .
This just sounds like a problem with your environment. Is the pandoc installation inside the paths of the python environment you're using?

Cmurilochem · 2024-03-04T06:36:16Z

Thanks @Radonirinaunimi and @RoyStegeman for the extremely useful comments. I will start its implementation based on you suggestions in a small separated PR branched from hyperopt_loss.
@RoyStegeman. Yes, I installed it via pip inside my conda env. I also noted that there is the possibility of installing it via homebrew/dpkg. Maybe this last one is the preferred way?. Is this the reason why pandoc is not included in conda-recipe or pyproject.toml ?

RoyStegeman · 2024-03-04T11:28:19Z

There is no pandoc in pip, only some python-specific functionality I believe. If you're using conda that's probably the best way to install it.

Pandoc is not in our dependencies because we depend on it through reportengine. If you're using conda, pandoc should thus be in your dependencies through reportengine, if you're not using conda but something else to manage your environments you indeed need to install pandoc separately.

Cmurilochem · 2024-03-04T11:42:47Z

Thanks @RoyStegeman. I think I managed to make it work. I will proceed as planned then.

Cmurilochem assigned goord, juanrojochacon, scarlehoff and RoyStegeman Feb 29, 2024

Cmurilochem added question Further information is requested escience labels Feb 29, 2024

Cmurilochem linked a pull request Mar 4, 2024 that will close this issue

Implementation of hyperopt model selection #1976

Open

Cmurilochem linked a pull request Mar 8, 2024 that will close this issue

Implementation of hyperopt model selection #1976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated selection of models via a `best_chi2_worse_phi2` algorithm #1962

Automated selection of models via a `best_chi2_worse_phi2` algorithm #1962

Cmurilochem commented Feb 29, 2024

Radonirinaunimi commented Feb 29, 2024

RoyStegeman commented Feb 29, 2024

Cmurilochem commented Mar 4, 2024

RoyStegeman commented Mar 4, 2024

Cmurilochem commented Mar 4, 2024

Automated selection of models via a best_chi2_worse_phi2 algorithm #1962

Automated selection of models via a best_chi2_worse_phi2 algorithm #1962

Comments

Cmurilochem commented Feb 29, 2024

Questions

Radonirinaunimi commented Feb 29, 2024

RoyStegeman commented Feb 29, 2024

Cmurilochem commented Mar 4, 2024

RoyStegeman commented Mar 4, 2024

Cmurilochem commented Mar 4, 2024

Automated selection of models via a `best_chi2_worse_phi2` algorithm #1962

Automated selection of models via a `best_chi2_worse_phi2` algorithm #1962