Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated selection of models via a best_chi2_worse_phi2 algorithm #1962

Open
Cmurilochem opened this issue Feb 29, 2024 · 5 comments · May be fixed by #1976
Open

Automated selection of models via a best_chi2_worse_phi2 algorithm #1962

Cmurilochem opened this issue Feb 29, 2024 · 5 comments · May be fixed by #1976
Assignees
Labels
escience question Further information is requested

Comments

@Cmurilochem
Copy link
Collaborator

As a continuation of #1943, I managed to automate the selection of best models via the @juanrojochacon's hyperopt algorithm wherein data of 1/$\varphi^{2}$ is used to decide on the best $\chi^{2}$ hyperpoint. Here I am just referring to it as best_chi2_worse_phi2 algorithm.

To this end, I made a post-fit script which is primarily based on the validphys vp_hyperoptplot.py module. I did so in such a way to make our implementation easier later. Just in case I attach it here analysis_hyperopt.zip.

The core of the idea is presented in the code snippet below:

args = {
    'loss_target': 'best_chi2_worst_phi2',    # select Juan & Roy's algorithm
    'max_phi2_points': 10,                             # select the n lowest values of 1/phi2
    'threshold': 3.0,
}

if args.loss_target == "best_chi2_worst_phi2":
        minimum = dataframe.loss[best_idx]
        std = np.std(dataframe.loss)
        lim_max = dataframe.loss[best_idx] + std
        # select rows with chi2 losses within the best point and lim_max
        selected_chi2 = dataframe[(dataframe.loss >= minimum) & (dataframe.loss <= lim_max)]
        # among the selected points, select the nth lowest in 1/phi2
        selected_phi2 = selected_chi2.loss_reciprocal_phi2.nsmallest(args.max_phi2_points)
        # find the location of these points in the dataframe
        indices = dataframe[dataframe['loss_reciprocal_phi2'].isin(selected_phi2)].index
        best_trial = dataframe.loc[indices]

Here, I define an internal between the chi2 minimum and 1 standard deviation std from which I will monitor later on the corresponding 1/phi2 values. For these, I get the nth lowest 1/phi2 hyperpoints and save the selected models into best_trial. In the zip attached file I take as example the runs I discussed on Monday using 10 replicas (because I have much more points to test the algorithm). The final plot is show below:
best_chi2_worst_phi2_plot

The yellow region defines the interval between chi2 minimum (grey circle) and 1 standard deviation std of the loss data. I also asked the script to give me 10 models within this region which show the lowest 1/phi2's (cyan circles).

Questions

  • Is 1 std sufficient for our purposes ? Note that for the analysis I selected a loss threshold of 3. So, all models showing higher losses were excluded from the DataFrame and analysis.
  • When looking at 1/phi2 values which option is more physically sound and the best: (i). 1/ < phi2 > or (ii). <1/phi2> ? Note that in the analysis I use <1/phi2>.
  • Is the idea to implement this later in validphys ? I tried to run the vp-hyperoptplot but it always complains about the need for pandoc (even if I have pandoc installed).

I would appreciate any comments and idea to improve are always welcome.

@Radonirinaunimi
Copy link
Member

Thanks @Cmurilochem for this nice analysis. Something like this would indeed help us better visualize and select the "best" models. Regarding your questions:

  • I think that ultimately this will depend both on the number of samples in the vicinity of $\langle \chi^2\rangle_{\mathrm{min}}$ and the number of models that one would like to request.
  • I would say that $1 / \langle \varphi^2 \rangle$ is the one we should look at.
  • That'd be nice! One thing that'd also be useful to have from your script is the dumping of the selected hyperparameters (into yaml/json for example) for the $N$ best models.

@RoyStegeman
Copy link
Member

Thanks @Cmurilochem.

  • Since the meaning of 1std in this case depends even on you threshold I'd say this is maybe not the best measure. 1. because it's not normally distributed so it's not so clear how to interpret the meaning of 1std. And 2. it introduces two hyperparameters, namely the number of standard deviations and the threshold, while we can probably think of ways to use a single one.
    What about e.g. taking the spread of chi2 among the replicas of the best fit to construct the yellow band? In that case the interpretation of 1std is clear. Whether we then really one 1std is remains a good question. In principle we may want define points with equivalent chi2 as those for which the (bootstrapped) uncertainty on the central chi2 are within 1std of the best central chi2, but that may make things overly complicated.
  • Agreed with @Radonirinaunimi. Remember that the intuition is to maximize .
  • This just sounds like a problem with your environment. Is the pandoc installation inside the paths of the python environment you're using?

@Cmurilochem
Copy link
Collaborator Author

Thanks @Radonirinaunimi and @RoyStegeman for the extremely useful comments. I will start its implementation based on you suggestions in a small separated PR branched from hyperopt_loss.
@RoyStegeman. Yes, I installed it via pip inside my conda env. I also noted that there is the possibility of installing it via homebrew/dpkg. Maybe this last one is the preferred way?. Is this the reason why pandoc is not included in conda-recipe or pyproject.toml ?

@Cmurilochem Cmurilochem linked a pull request Mar 4, 2024 that will close this issue
@RoyStegeman
Copy link
Member

There is no pandoc in pip, only some python-specific functionality I believe. If you're using conda that's probably the best way to install it.

Pandoc is not in our dependencies because we depend on it through reportengine. If you're using conda, pandoc should thus be in your dependencies through reportengine, if you're not using conda but something else to manage your environments you indeed need to install pandoc separately.

@Cmurilochem
Copy link
Collaborator Author

Thanks @RoyStegeman. I think I managed to make it work. I will proceed as planned then.

@Cmurilochem Cmurilochem linked a pull request Mar 8, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escience question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants