Empirical CDF enhancements #590

alyst · 2020-08-09T13:55:47Z

The PR improves the performance of ECDF and adds interpolation.

The major problem with the current code is that in the weighted case the partial sums have to be recalculated for each input. But this is not necessary, since all CDF values for weighted and non-weighted cases could be precalculated.
I suppose ECDF was written in pre-broadcast epoch, so it does not overload Base.Broadcast.broadcasted() for providing enhanced support for vectors. The PR deprecates ecdf(v::AbstractVector) in favor of standard dot notation. I've kept customized broadcasting, but now it just caches the CDF value of the last vector element.
Optionally, ECDF can enable interpolation (ecdf(..., interpolate=true)), which is handy for continuous empirical distributions. (It just linearly interpolates between the CDFs of adjacent values).

use the custom broadcast implementation of ecdf.(v)

precalculates the partial weigths sums

linearly interpolates cdf between adjacent values

nalimilan

Sorry for the long delay. This looks nice. I can't comment on the math but here are remarks about the code.

nalimilan · 2020-10-06T16:10:56Z

src/weights.jl

@@ -121,8 +121,8 @@ aweights(vs::RealArray) = AnalyticWeights(vec(vs))
    s = w.sum

    if corrected
-        sum_sn = sum(x -> (x / s) ^ 2, w)
-        1 / (s * (1 - sum_sn))
+        sum_w2 = sum(abs2, w)


Wasn't this here to avoid overflow?

nalimilan · 2021-02-08T15:54:32Z

src/deprecates.jl

+
+### Deprecate August 2020 (v0.33)
+function (ecdf::ECDF)(v::RealArray)
+    depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", "(ECDF::ecdf)(v::RealArray)")


More common, and possibly faster:

Suggested change

depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", "(ECDF::ecdf)(v::RealArray)")

depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", :ecdf)

nalimilan · 2021-02-08T15:55:04Z

src/empirical.jl

@@ -1,68 +1,142 @@
 # Empirical estimation of CDF and PDF

-## Empirical CDF
+"""
+Empirical Cumulative Distribution Function (ECDF).


Suggested change

Empirical Cumulative Distribution Function (ECDF).

ECDF{T <: Real, W <: Real, I}

Empirical Cumulative Distribution Function (ECDF).

nalimilan · 2021-02-08T15:56:34Z

src/empirical.jl

+    #  - `ECDF(x[i])`
+    #  - `1/(x[i+1] - x[i])`
+    #  - `ECDF(x[i+1]) - ECDF(x[i])` (the weight of `x[i+1]`)
+    sorted_values::Vector{Tuple{T, W, W, W}}


Isn't it less efficient to store values as a vector of tuples instead of as several vectors? In terms of memory use AFAIK alignment may require padding between entries.

nalimilan · 2021-02-08T16:00:41Z

src/empirical.jl

-        r[ord[i]] = weightsum
-        i += 1
+# broadcasts ecdf() over an array
+# caches the last calculated value


What do you mean by "caches"?

nalimilan · 2021-02-08T16:10:38Z

src/empirical.jl

    any(isnan, X) && throw(ArgumentError("ecdf can not include NaN values"))
-    isempty(weights) || length(X) == length(weights) || throw(ArgumentError("data and weight vectors must be the same size," *
-        "got $(length(X)) and $(length(weights))"))
+    evenweights = isnothing(weights) || isempty(weights)


Inference works better with ===:

Suggested change

evenweights = isnothing(weights) || isempty(weights)

evenweights = weights === nothing || isempty(weights)

nalimilan · 2021-02-08T16:11:56Z

src/empirical.jl

+                            "got $(length(X)) and $(length(weights))"))
+    T = eltype(X)
+    W0 = evenweights ? Int : eltype(weights)
+    W = isnothing(weights) ? Float64 : eltype(one(W0)/sum(weights))


Store sum(weights) to avoid calling it twice?

nalimilan · 2021-02-08T16:14:01Z

src/empirical.jl

+    push_valprev!() = push!(sorted_vals, (valprev, min(wsumprev/wsum, one(W)),
+                                          inv(val - valprev), valw/wsum))
+
+    @inbounds for i in ord
+        valnew = X[i]
+        if (val != valnew) || (i == last(ord))
+            (wsumprev > 0) && push_valprev!()


Doesn't this anonymous function trigger boxing of these variables, making the loop quite slow?

nalimilan · 2021-02-08T16:15:10Z

test/empirical.jl

+    @test_skip fnecdf.weights == fnecdfalt.weights
+    @test_skip fnecdf.weights != w1  #  check that w wasn't accidently modified in place
+    @test_skip fnecdfalt.weights != w2


nalimilan · 2021-02-08T16:15:52Z

test/empirical.jl

+    show(iobuf, obj)
+    return String(take!(iobuf))
+end
+
 @testset "ECDF" begin
    x = randn(10000000)
    fnecdf = ecdf(x)


I'd change all of these to ecdf.(x), and copy this to deprecated.jl.

ParadaCarleton · 2023-08-23T00:52:50Z

Is there anything here that's still worth it, or should I close?

alyst added 8 commits August 9, 2020 15:33

varcorrection(::AnalyticWeights): optimize

4557127

depcheck(): use ===

50e6291

deprecate ecdf(v::AbstractVector)

166238a

use the custom broadcast implementation of ecdf.(v)

ECDF: more tests for degenerated cases (1 & 0-val)

0a0e9ac

ECDF: faster implementation

ca27e5a

precalculates the partial weigths sums

ECDF: optional cdf interpolation

ca122ff

linearly interpolates cdf between adjacent values

add show(ECDF)

6db07bd

more ECDF docstrings & code comments

15680b6

alyst mentioned this pull request Oct 5, 2020

Enhance ranking code #589

Merged

alyst added 2 commits October 5, 2020 20:05

correct ECDF tuple description

d9b934f

add show(ECDF) tests

2570683

nalimilan reviewed Feb 8, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empirical CDF enhancements #590

Empirical CDF enhancements #590

alyst commented Aug 9, 2020

nalimilan left a comment

nalimilan Oct 6, 2020

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

nalimilan Feb 8, 2021

ParadaCarleton commented Aug 23, 2023

	depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", "(ECDF::ecdf)(v::RealArray)")
	depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", :ecdf)

	evenweights = isnothing(weights) \|\| isempty(weights)
	evenweights = weights === nothing \|\| isempty(weights)

Empirical CDF enhancements #590

Are you sure you want to change the base?

Empirical CDF enhancements #590

Conversation

alyst commented Aug 9, 2020

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParadaCarleton commented Aug 23, 2023