Skip to content

GSoC 2022 projects

Oriol Abril-Pla edited this page Feb 7, 2022 · 6 revisions

ArviZ

ArviZ, is a Python package for exploratory analysis of Bayesian models. ArviZ aims to integrate seamlessly with established probabilistic programming languages like PyStan, PyMC (3 and 4), Edward, emcee, Pyro, and to be easily integrated with novel or bespoke Bayesian analyses. Where the aim of the probabilistic programming languages is to make it easy to build and solve Bayesian models, the aim of the ArviZ library is to make it easy to process and analyze the results from those Bayesian models.

Getting started

New contributors should first read the ArviZ documentation, its contributing guide and be familiar with matplotlib and familiar with the basics of git and github. Before tackling a bigger project it is a good idea to start with some beginner issues which can be found here.

Timeline

  • Org Applications Open, 29 January 2021
  • Org Application Deadline, 19 Februrary 2021
  • Orgs Announced, 9 March 2021
  • Student Application Period, 29 March - 13 April 2021
  • Application Review Period, 13 April - 17 May 2021
  • Student Projects Announced, 17 May 2021
  • Community Bonding, 17 May - 7 June 2021
  • Coding, 7 June - 16 August, 2021
  • Evaluations, 12 - 16 July 2021
  • Students Submit Code and Final Evaluations, 16 - 23 August 2021
  • Mentors Submit Final Evaluations, 23 - 30 August 2021
  • Results Announced, 31 August 2021

Projects

Below there is a list of possible topics for your GSoC project, we are also open to other topics, contact us on Gitter. Keep in mind that these are only ideas and that probably none of them can be completely solved in a single GSoC project. When writing your proposal, choose some specific tasks and make sure your proposal is adequate for the GSoC time commitment.

Each project also lists some specific requirements needed to be able to successfully complete the project, general requirements are listed below. Note that these requirements can be learnt while writing the proposal and during the community bonding period. You should feel confident to work in any project whose requirements are interesting to you and you would like to learn about them, they are not skills all of which you are expected to know before writing your proposal. We aim for GSoC to provide a win-win scenario where you benefit from an inclusive and thriving environment in which to learn and the library benefits from your contributions. All projects requires being comfortable using ArviZ and understanding the relations between its 3 main modules: plots, stats and data. However, unless specified otherwise, no specific knowledge of inference libraries of about the internals of from_xyz converter functions is needed.

Python

Students working on Python projects should be familiar with Python, numpy and scipy and have basic xarray/InferenceData knowledge. They should also be able to write unit tests for the added functionality using pytest and be able to enforce development conventions and use black, pylint and pydocstyle for code style and linting.

Julia

Students working on Julia projects should be familiar with Julia, PyCall to use Python objects from within Julia, DataFrames and StatsBase. They should also be able to write unit tests for the added functionality using Test.

Project priority

The highest priority projects are (in no particular order): "Add Gen converter to ArviZ.jl", "New plots", "Increase support for time-series and regressions" and "InferenceData R compatibility". What does this mean? Student selection at GSoC has 2 phases. First, Google allocates the slots available based on the initial requests of the participating open source projects. Then, if necessary open source projects choose which subset of the applicants is accepted to comply with the allocated slots. We will send our preferences to Google based on proposal quality alone, and only if we receive less slots than requested, would take into account project priority as a tiebreaker between proposals of similar quality.

Expected benefits of working on ArviZ

Students who work on ArviZ can expect their skillset to grow in

  • Bayesian Inference libraries
  • Bayesian modeling workflow and model criticism
  • Matplotlib and/or bokeh usage
  • xarray usage
  • Numba or Dask optimization (depending on project)

ArviZ Dashboards (Python)

The main proposal is to built dashboard with linked plots, so inspecting multiple dimensions is easier. At first the focus should be the ability to call templates which only consume data. Some of the possible templates should be prior + prior predictive, sample diagnostics, posterior + posterior predictive, loo, regression. The ability to dynamically add or subtract new plots, change plot types, and manually select and save information should also be considered. This dashboard could be built on top of Panel. Although other alternatives might be explored.

Possible mentors

  • Ari Hartikainen

Required skills

People working on this project will need to be familiar with Panel (or alternative dashboard framework) and with ArviZ plotting and stats module.

InferenceData R compatibility (R)

Work together with posterior developers to enable data sharing between ArviZ and posterior via netCDF.

Possible mentors

  • Oriol Abril

Required skills

People working on this project should be familiar with R, the posterior package and one of the netcdf R packages. Basic Python knowledge will also come in handy.

New plots (Python)

Add new plotting capabilities to ArviZ such as:

Possible mentors

  • Oriol Abril
  • Osvaldo Martin

Required skills

People working on this project should be between familiar and proficient in matplotlib and/or bokeh (some plots may be complicated, for example keeping the aspect ratio of the circles in the dot plots independently of the axis ratio). They are also expected to understand the theory and data processing of the chosen plots.

Increase support for time-series and regressions (Python)

Time-series and regressions play a very important role in modeling, and ArviZ currently lacks plots tailored to the typical needs of these models. Some examples of useful plots can be found in #313 and #512. This project aims to implement several plots to handle time-series and regressions, and whenever possible to extend them to multidimensional regressions.

Possible mentors

  • Ari Hartikainen
  • Ravin Kumar
  • Alexandre Andorra

Required skills

People working on this project should be familiar with matplotlib and/or bokeh and with time-series/regression plots.

Native Julia plotting backends (Julia)

Add native plotting functions in Julia to take advantage of its multiple backends and ability to express complex combinations of plots with little code when plotting InferenceData objects. Check out https://github.com/arviz-devs/ArviZPlots.jl for more details and to see the current experimentation state.

Possible mentors

  • Seth Axen

Required skills

People working on this project should be familiar with Plots.jl and/or Makie.jl, with InferenceData, and with the targeted plots. It may involve reimplementing ArviZ algorithms such as KDE or PSIS in Julia.

Add Gen converter to ArviZ.jl (Julia)

Add a converter function to ArviZ.jl to transform inference results obtained with Gen.jl to InferenceData.

Possible mentors

  • Seth Axen

Required skills

People working on this project should be familiar with the inference library Gen and with the InferenceData schema.

Speed-up, parallelize and distribute ArviZ (Python)

ArviZ uses Numba optionally to speed up expensive calculations, achieving noticeable speed-ups in some cases. Many ArviZ use-cases also require an optimization of the memory usage and of the resource handling. Dask combines dynamic task scheduling optimized for computation and “Big Data” collections and it is compatible with both xarray and Numba. The usage of Dask would allow to seamlessly use ArviZ with large databases that do not fit into memory and to optimize the distribution of computational resources. The aim of this project is to make ArviZ compatible with Dask, so that it may be used as an optional dependency and to build on top of Numba benchmarks to automatically calculate speed-ups provided by Dask in some examples. A possible extension would be to work on optimizing make_ufunc.

Possible mentors

  • Oriol Abril
  • Ravin Kumar

Required skills

People working on this project should be familiar with the computational library/ies aimed to use in enhancing ArviZ (i.e. Numba or Dask). Here it may be needed to know about the supported inference libraries and the internals of from_xyz converters, for example if using Dask to allow out of memory computation. Some benchmarking knowledge using asv is a plus.

Add refitting algorithms to ArviZ (Python)

Some of the probabilistic programming libraries that integrate with ArviZ allow to easily refit the model on a subset of the data (useful for cross-validation for example). The use of sampling wrappers to call the inference libraries from within ArviZ would allow to include algorithms which require refitting to ArviZ. Some interesting examples are the approximate Leave Future Out cross-validation, Importance Weighted Moment Matching or Simulation Based Calibration. The aim of this project is to extend the sampling wrappers and to implement algorithms requiring refitting of the model. See example usages of the current wrapper implementation and the corresponding api docs

Possible mentors

  • Oriol Abril

Required skills

People working on this project should be familiar with at least 2 different inference libraries (the more the better though) as well as with the internals of the conversion process. Moreover, they should understand the target algorithms both to design the sampling wrappers api with the final goal in mind and to be able to implement them using said wrappers.

Clone this wiki locally