RFC: add topk and / or argpartition #629

ogrisel · 2023-05-17T13:23:03Z

numpy provides an indirect way to compute the indices of the smallest (or largest) values of an array using: numpy.argpartition.

There is also a proposal to provide a higher level API, namely (arg)topk in numpy:

EHN: add numpy.topk numpy/numpy#19117

This PR relies on numpy.argpartition internally but it can probably later be optimized to avoid allocating a result array of the size of the input array when k is small.

Here is a quick review of some available implementations in related libraries:

torch.topk (no such thing as torch.argpartition)
- returns a tuple of values and indices
jax.lax.top_k
- returns a tuple of values and indices
- apparently it is quite slow on GPU
dask.array.topk
- returns only the values, I did not find a way to get the indices :(
cupy.argpartition but internally computes a full cupy.argsort which makes it very inefficient for large arrays and small k: O(nlog(n)) instead of O(n).

Motivation: (arg)topk is needed by popular baseline data-science workloads (e.g. k-nearest neighbors classification in scikit-learn) and is surprisingly non trivial to implement efficiently. For instance on GPUs, the fastest implementations are based on some kind of partial radix sort while CPU implementations would use more traditional partial sorting algorithms (as implemented in std:partial_sort or std::nth_element).

The text was updated successfully, but these errors were encountered:

ogrisel · 2023-05-17T13:25:09Z

Note: since argsort is part of the standard Array API, it would be possible to implement a generic yet inefficient fallback in array-api-compat while allowing to dispatch to a more efficient routine for libraries that provide it. This is what cupy.argpartition does for instance.

rgommers · 2023-05-17T18:33:31Z

Thanks for the proposal @ogrisel. It's actually surprising that coverage and performance across array libraries is so spotty. I dug up the NumPy mailing list discussion, and it seemed more or less positive, just unfinished and the name to use is a nicely-sized bikeshed.

Is this function something you already have in scikit-learn internally, or are you looking for something more efficient than the argsort or similar function you use now?

ogrisel · 2023-05-17T20:41:26Z

In scikit-learn, for k-nearest neighbors (bruteforce exact method for medium to high dimensional space), we use a routine optimized for multicore CPUs using Cython + OpenMP for pairwise distance (similar to scipy's cdist) fused with a topk reduction implemented in templated Cython. The topk reduction itself (called "argkmin" in scikit-learn) lives here:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp

This code can only be called as a reduction fused into the multithreaded pairwise distance computation kernel. It is orchestrated via:

https://github.com/scikit-learn/scikit-learn/blob/f5ec34e0f76277ba6d0a77d3033db0af83899b64/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py#L157

For CPU, I doubt than any Array API based solution will be able to compete both on speed and memory usage.

However, we are interested in implementing Array API support for an alternative numpy code-path in order to provide GPU support, e.g. via PyTorch or CuPy. The reducer used in the numpy code-path is there:

https://github.com/scikit-learn/scikit-learn/blob/f5ec34e0f76277ba6d0a77d3033db0af83899b64/sklearn/neighbors/_base.py#L704

It's based on numpy.argpartition followed by numpy.argsort of the top k values.

Note that to efficiently implement k-nearest neighbors in scikit-learn using the Array API, we would also need the Array API to provide scipy.spatial.distance.cdist.

I have not open an issue to discuss cdist yet. I wanted to probe the waters with topk first.

shoyer · 2023-06-01T20:41:41Z

JAX also has an approximate top-k implementation specifically tuned for TPUs: https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.approx_max_k.html

ogrisel · 2023-06-15T12:40:11Z

I am not sure if we want to include non-exact methods in the spec. I have the feeling that there are many ways to compute such approximations and that they will require different and evolving parametrizations with different speed-accuracy trade-offs.

Ref: data-apis#629 Ref: numpy/numpy#19117 Ref: numpy/numpy#15128 Ref: https://mail.python.org/archives/list/[email protected]/thread/F4P5UVTAKRJJ3OORI6UOWFSUEE5CNTSC/#PELUDW5ACUBHBNK5IVGWIWTQHBM2HXUP

kgryte · 2023-12-14T17:59:35Z

A PR has now been opened which proposes adding top_k and friends to the specification: #722. Please feel free to review and comment there with your concerns and feedback.

rgommers added the API extension Adds new functions or objects to the API. label May 17, 2023

ogrisel mentioned this issue Jun 15, 2023

Array API support for k-nearest neighbors models with the brute force method scikit-learn/scikit-learn#26586

Open

kgryte mentioned this issue Jun 29, 2023

Tracking issue for the 2023 revision of the array API specification #643

Closed

17 tasks

asmeurer mentioned this issue Oct 11, 2023

Seeking alternatives for setting array values with integer indexing data-apis/array-api-compat#62

Closed

kgryte linked a pull request Dec 14, 2023 that will close this issue

Add API specifications for returning the k largest elements #722

Open

kgryte added this to the v2024 milestone Jan 25, 2024

kgryte changed the title ~~Standard API for topk and / or argpartition~~ RFC: add topk and / or argpartition Apr 4, 2024

kgryte added RFC Request for comments. Feature requests and proposed changes. Needs Discussion Needs further discussion. labels Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: add topk and / or argpartition #629

RFC: add topk and / or argpartition #629

ogrisel commented May 17, 2023 •

edited

Loading

ogrisel commented May 17, 2023 •

edited

Loading

rgommers commented May 17, 2023

ogrisel commented May 17, 2023 •

edited

Loading

shoyer commented Jun 1, 2023

ogrisel commented Jun 15, 2023

kgryte commented Dec 14, 2023

RFC: add topk and / or argpartition #629

RFC: add topk and / or argpartition #629

Comments

ogrisel commented May 17, 2023 • edited Loading

ogrisel commented May 17, 2023 • edited Loading

rgommers commented May 17, 2023

ogrisel commented May 17, 2023 • edited Loading

shoyer commented Jun 1, 2023

ogrisel commented Jun 15, 2023

kgryte commented Dec 14, 2023

ogrisel commented May 17, 2023 •

edited

Loading

ogrisel commented May 17, 2023 •

edited

Loading

ogrisel commented May 17, 2023 •

edited

Loading