Switch to data-driven percentiles #2962

francois-rozet · 2023-07-21T13:03:13Z

This PR modifies the ranking system to use the actual quantiles (see #2857) of the statistics. This leads to significantly different, but generally higher, ranks for users. In particular, it is now impossible to be at C rank (minimum is C+). An alternative to lower the ranks would be to filter out inactive users before computing the quantiles.

What do you think @rickstaa, @qwerty541?

vercel · 2023-07-21T13:03:16Z

@francois-rozet is attempting to deploy a commit to the github readme stats Team on Vercel.

A member of the Team first needs to authorize it.

codecov · 2023-07-21T13:04:18Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (b56689b) 97.62% compared to head (091e304) 97.63%.
Report is 337 commits behind head on master.

❗ Current head 091e304 differs from pull request most recent head 2cc3e54. Consider uploading reports for the commit 2cc3e54 to get more accurate results

Files	Patch %	Lines
src/calculateRank.js	99.06%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2962      +/-   ##
==========================================
+ Coverage   97.62%   97.63%   +0.01%     
==========================================
  Files          24       24              
  Lines        5182     5249      +67     
  Branches      460      463       +3     
==========================================
+ Hits         5059     5125      +66     
- Misses        122      123       +1     
  Partials        1        1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

qwerty541

In general, the fact that it is now impossible to get a C rank is quite logical. If you look closely at the quantiles, you can see that at the beginning there are zeros everywhere. I would even suggest that a C+ rank would be no less rare, since a new user with empty statistics gets a percentile of 78.26086956521738 in the new system, and the threshold for getting a B rank is only 80. I think that the thresholds need to be changed, at least for the latest ranks. My suggestion is the following:

const THRESHOLDS = [1, 12.5, 25, 37.5, 45, 52.5, 60, 67.5, 75];

@francois-rozet I also want to clarify whether I understood correctly that the number of repositories now does not affect the rank in any way?

github-readme-stats/src/calculateRank.js

Line 84 in 091e304

repos: 0.0,

The rest looks good to me. Great improvement of ranking system.

I also want you to know that since you became collaborator you can push your branches directly into this repository. It's better because Vercel test deployment workflow will work and will make testing and reviewing easier.

anuraghazra · 2023-08-17T17:15:54Z

Don't have much idea about any of these ranking algorithm.

But whatever you folks think is a better system you can go ahead with but:

Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays
Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs

@rickstaa @francois-rozet @qwerty541

rickstaa · 2023-08-17T18:50:58Z

Don't have much idea about any of these ranking algorithm.

But whatever you folks think is a better system you can go ahead with but:

Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays

Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs

@rickstaa @francois-rozet @qwerty541

@anuraghazra, @francois-rozet I didn't have time yet to review this, but I will review when I finish my master's thesis. Neverteless, maybe we should not add the QUANTILES themselves in the code 🤔. Sticking the weights and means already in the codebase is better.

github-readme-stats/src/calculateRank.js

Lines 35 to 54 in 24ac78b

    
           const COMMITS_MEDIAN = all_commits ? 1000 : 250, 
        
             COMMITS_WEIGHT = 2; 
        
           const PRS_MEDIAN = 50, 
        
             PRS_WEIGHT = 3; 
        
           const ISSUES_MEDIAN = 25, 
        
             ISSUES_WEIGHT = 1; 
        
           const REVIEWS_MEDIAN = 2, 
        
             REVIEWS_WEIGHT = 1; 
        
           const STARS_MEDIAN = 50, 
        
             STARS_WEIGHT = 4; 
        
           const FOLLOWERS_MEDIAN = 10, 
        
             FOLLOWERS_WEIGHT = 1; 
        
           const TOTAL_WEIGHT = 
        
             COMMITS_WEIGHT + 
        
             PRS_WEIGHT + 
        
             ISSUES_WEIGHT + 
        
             REVIEWS_WEIGHT + 
        
             STARS_WEIGHT + 
        
             FOLLOWERS_WEIGHT;

This prevents any performance problems from occurring, and if the code contains a comment that links to a RANK_CALC.md Markdown doc file in which the calculation and quantiles are explained, I think that is enough. What do you think?

The ranking was improved since many users pointed out that it took a lot of work to get a rank other than A+ 😅. There were about three issues/discussions opened by people who were wondering what happened to their rank, but after I explained it to them and showed them that ranking up is now more accessible for the first ranks, no new issues were opened. However, we can document this change and the reasoning in a RANK_CALC.md doc file 👍🏻 .

francois-rozet · 2023-08-18T15:10:55Z

To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?

rickstaa · 2023-08-18T15:31:55Z

To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?

Hey @francois-rozet sorry for being unclear. My suggestion was that we can perform the calculation based on the quantiles offline and then just store the mean and weights in the code. This will prevent any performance issues and keeps the code clean. Then, We can store the calculation and quantiles somewhere and refer to them in a code comment 🤔.

francois-rozet · 2023-08-18T15:48:55Z

You cannot get the true percentiles just based on the median. Currently, we assume that the statistics follow exponential/log-normal distributions, which implicitly fixes the quantiles with respect to the median/mean. The percentiles we get are basically educated guesses.

If we want the true percentiles, we need the true quantiles.

qwerty541 · 2023-08-26T05:47:42Z

I have opened pull request #3141 which contains performance tests base implementation. After merging this one and pulling this branch with master we can check how quantiles affect on performance. But personally i do not expect much changes since there are no iterations through these quantiles arrays, it accessed everywhere by indexes.

francois-rozet · 2023-08-26T11:15:16Z

Actually, there are iterations through the arrays with the .findIndex() calls. But this can be improved from $O(n)$ to $O(\log n)$ by using a bisection search. I can implement that.

qwerty541 · 2023-08-27T00:27:48Z

@francois-rozet Yes, you're right, i didn't noticed .findIndex() call on 2 line at first look.

https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R2

I only noticed a call of the same function on line 136, but since this THRESHOLDS array contains only 9 elements and unlike from score() function it was called only once, I didn't attach much importance to it.

https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R136

It would be great if you implement that.

I hope my code below will help.

function findIndex(arr, target) {
  let left = 0;
  let right = arr.length - 1;
  let result = -1;

  while (left <= right) {
    const mid = Math.floor((left + right) / 2);

    if (arr[mid] < target) {
      result = mid; // Update the result and continue searching on the right half
      left = mid + 1;
    } else {
      right = mid - 1; // Search on the left half
    }
  }

  return result;
}

const sortedArray =  [
    0, 0, 0, 0, 2, 4, 8, 11, 15, 19, 23, 27, 32, 36, 41, 45, 50, 55, 60, 65, 71,
    76, 82, 87, 93, 99, 105, 111, 117, 124, 131, 137, 145, 151, 159, 166, 174,
    182, 190, 198, 207, 215, 225, 234, 244, 253, 264, 274, 285, 296, 306, 318,
    330, 342, 355, 368, 382, 396, 409, 424, 440, 457, 475, 493, 512, 531, 551,
    570, 593, 618, 643, 667, 695, 723, 752, 784, 815, 857, 893, 934, 984, 1037,
    1094, 1152, 1217, 1289, 1379, 1475, 1576, 1696, 1851, 2023, 2232, 2480,
    2835, 3242, 3885, 4868, 6614, 11801, 792319,
  ];
const targetValue = 2333;
const lowerIndex = findIndex(sortedArray, targetValue);

if (lowerIndex !== -1) {
  console.log(`Index of the first value lower than ${targetValue} is: ${lowerIndex}. The value is: ${sortedArray[lowerIndex]}`);
} else {
  console.log(`No value lower than ${targetValue} found.`);
}

rickstaa · 2023-10-13T09:29:34Z

@francois-rozet, @anuraghazra just a small heads up that I just merged @qwerty541 pull requests which allows us to safeguard against performance issues.

francois-rozet · 2023-12-06T13:16:13Z

Hello @anuraghazra, @rickstaa, @qwerty541, I have (finally) replaced the linear-time search by a log-time bisection search. However, thinking back, I am now a bit skeptical about this PR. It would great if the percentiles were purely "data-driven", but I feel like it is not worth the increase of code complexity (and code (and data) maintenance). Users seem to be quite happy with the current ranking system, which is easy to understand and maintain, even if it is slightly off.

I believe #2637 is a much more important issue for the project.

BhasherBEL · 2024-01-06T22:57:33Z

src/calculateRank.js

@@ -1,12 +1,114 @@
-function exponential_cdf(x) {
-  return 1 - 2 ** -x;
+function searchSorted(arr, x) {


As the size of the list is small and constant, it may be interesting to use a dictionary instead of a list and a "complex" find algorithm ?

I am not sure how you would use a dictionary to find the index i for which arr[i - 1] <= x < arr[i].

Switch to data-driven percentiles

091e304

github-actions bot added the ranks Feature, Bug fix, improvement related to ranking system. label Jul 21, 2023

rickstaa mentioned this pull request Jul 30, 2023

Threshold for A++ Grade in Readme Stats #226

Closed

qwerty541 reviewed Aug 9, 2023

View reviewed changes

qwerty541 mentioned this pull request Aug 25, 2023

Add performance tests base #3141

Merged

rickstaa mentioned this pull request Oct 17, 2023

tests: add gist card performance test #3372

Merged

Sub-linear score search

2cc3e54

BhasherBEL reviewed Jan 6, 2024

View reviewed changes

francois-rozet closed this by deleting the head repository Aug 27, 2024

mdmuhtasimfuadfahim mentioned this pull request Oct 3, 2024

How to Get Rate A+? #3930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to data-driven percentiles #2962

Switch to data-driven percentiles #2962

francois-rozet commented Jul 21, 2023 •

edited

Loading

vercel bot commented Jul 21, 2023

codecov bot commented Jul 21, 2023 •

edited

Loading

qwerty541 left a comment •

edited

Loading

anuraghazra commented Aug 17, 2023 •

edited

Loading

rickstaa commented Aug 17, 2023 •

edited

Loading

francois-rozet commented Aug 18, 2023 •

edited

Loading

rickstaa commented Aug 18, 2023

francois-rozet commented Aug 18, 2023 •

edited

Loading

qwerty541 commented Aug 26, 2023

francois-rozet commented Aug 26, 2023

qwerty541 commented Aug 27, 2023

rickstaa commented Oct 13, 2023

francois-rozet commented Dec 6, 2023

BhasherBEL Jan 6, 2024

francois-rozet Jan 6, 2024

Switch to data-driven percentiles #2962

Switch to data-driven percentiles #2962

Conversation

francois-rozet commented Jul 21, 2023 • edited Loading

vercel bot commented Jul 21, 2023

codecov bot commented Jul 21, 2023 • edited Loading

Codecov Report

qwerty541 left a comment • edited Loading

Choose a reason for hiding this comment

anuraghazra commented Aug 17, 2023 • edited Loading

rickstaa commented Aug 17, 2023 • edited Loading

francois-rozet commented Aug 18, 2023 • edited Loading

rickstaa commented Aug 18, 2023

francois-rozet commented Aug 18, 2023 • edited Loading

qwerty541 commented Aug 26, 2023

francois-rozet commented Aug 26, 2023

qwerty541 commented Aug 27, 2023

rickstaa commented Oct 13, 2023

francois-rozet commented Dec 6, 2023

BhasherBEL Jan 6, 2024

Choose a reason for hiding this comment

francois-rozet Jan 6, 2024

Choose a reason for hiding this comment

francois-rozet commented Jul 21, 2023 •

edited

Loading

codecov bot commented Jul 21, 2023 •

edited

Loading

qwerty541 left a comment •

edited

Loading

anuraghazra commented Aug 17, 2023 •

edited

Loading

rickstaa commented Aug 17, 2023 •

edited

Loading

francois-rozet commented Aug 18, 2023 •

edited

Loading

francois-rozet commented Aug 18, 2023 •

edited

Loading