Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to data-driven percentiles #2962

Closed
wants to merge 2 commits into from
Closed

Switch to data-driven percentiles #2962

wants to merge 2 commits into from

Conversation

francois-rozet
Copy link
Collaborator

@francois-rozet francois-rozet commented Jul 21, 2023

This PR modifies the ranking system to use the actual quantiles (see #2857) of the statistics. This leads to significantly different, but generally higher, ranks for users. In particular, it is now impossible to be at C rank (minimum is C+). An alternative to lower the ranks would be to filter out inactive users before computing the quantiles.

What do you think @rickstaa, @qwerty541?

@vercel
Copy link

vercel bot commented Jul 21, 2023

@francois-rozet is attempting to deploy a commit to the github readme stats Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions bot added the ranks Feature, Bug fix, improvement related to ranking system. label Jul 21, 2023
@codecov
Copy link

codecov bot commented Jul 21, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (b56689b) 97.62% compared to head (091e304) 97.63%.
Report is 337 commits behind head on master.

❗ Current head 091e304 differs from pull request most recent head 2cc3e54. Consider uploading reports for the commit 2cc3e54 to get more accurate results

Files Patch % Lines
src/calculateRank.js 99.06% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2962      +/-   ##
==========================================
+ Coverage   97.62%   97.63%   +0.01%     
==========================================
  Files          24       24              
  Lines        5182     5249      +67     
  Branches      460      463       +3     
==========================================
+ Hits         5059     5125      +66     
- Misses        122      123       +1     
  Partials        1        1              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@qwerty541 qwerty541 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the fact that it is now impossible to get a C rank is quite logical. If you look closely at the quantiles, you can see that at the beginning there are zeros everywhere. I would even suggest that a C+ rank would be no less rare, since a new user with empty statistics gets a percentile of 78.26086956521738 in the new system, and the threshold for getting a B rank is only 80. I think that the thresholds need to be changed, at least for the latest ranks. My suggestion is the following:

const THRESHOLDS = [1, 12.5, 25, 37.5, 45, 52.5, 60, 67.5, 75];

@francois-rozet I also want to clarify whether I understood correctly that the number of repositories now does not affect the rank in any way?

The rest looks good to me. Great improvement of ranking system.

I also want you to know that since you became collaborator you can push your branches directly into this repository. It's better because Vercel test deployment workflow will work and will make testing and reviewing easier.

@anuraghazra
Copy link
Owner

anuraghazra commented Aug 17, 2023

Don't have much idea about any of these ranking algorithm.

But whatever you folks think is a better system you can go ahead with but:

  • Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays
  • Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs

@rickstaa @francois-rozet @qwerty541

@rickstaa
Copy link
Collaborator

rickstaa commented Aug 17, 2023

Don't have much idea about any of these ranking algorithm.

But whatever you folks think is a better system you can go ahead with but:

  • Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays
  • Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs

@rickstaa @francois-rozet @qwerty541

@anuraghazra, @francois-rozet I didn't have time yet to review this, but I will review when I finish my master's thesis. Neverteless, maybe we should not add the QUANTILES themselves in the code 🤔. Sticking the weights and means already in the codebase is better.

const COMMITS_MEDIAN = all_commits ? 1000 : 250,
COMMITS_WEIGHT = 2;
const PRS_MEDIAN = 50,
PRS_WEIGHT = 3;
const ISSUES_MEDIAN = 25,
ISSUES_WEIGHT = 1;
const REVIEWS_MEDIAN = 2,
REVIEWS_WEIGHT = 1;
const STARS_MEDIAN = 50,
STARS_WEIGHT = 4;
const FOLLOWERS_MEDIAN = 10,
FOLLOWERS_WEIGHT = 1;
const TOTAL_WEIGHT =
COMMITS_WEIGHT +
PRS_WEIGHT +
ISSUES_WEIGHT +
REVIEWS_WEIGHT +
STARS_WEIGHT +
FOLLOWERS_WEIGHT;

This prevents any performance problems from occurring, and if the code contains a comment that links to a RANK_CALC.md Markdown doc file in which the calculation and quantiles are explained, I think that is enough. What do you think?

The ranking was improved since many users pointed out that it took a lot of work to get a rank other than A+ 😅. There were about three issues/discussions opened by people who were wondering what happened to their rank, but after I explained it to them and showed them that ranking up is now more accessible for the first ranks, no new issues were opened. However, we can document this change and the reasoning in a RANK_CALC.md doc file 👍🏻 .

@francois-rozet
Copy link
Collaborator Author

francois-rozet commented Aug 18, 2023

To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?

@rickstaa
Copy link
Collaborator

To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?

Hey @francois-rozet sorry for being unclear. My suggestion was that we can perform the calculation based on the quantiles offline and then just store the mean and weights in the code. This will prevent any performance issues and keeps the code clean. Then, We can store the calculation and quantiles somewhere and refer to them in a code comment 🤔.

@francois-rozet
Copy link
Collaborator Author

francois-rozet commented Aug 18, 2023

You cannot get the true percentiles just based on the median. Currently, we assume that the statistics follow exponential/log-normal distributions, which implicitly fixes the quantiles with respect to the median/mean. The percentiles we get are basically educated guesses.

If we want the true percentiles, we need the true quantiles.

@qwerty541
Copy link
Collaborator

I have opened pull request #3141 which contains performance tests base implementation. After merging this one and pulling this branch with master we can check how quantiles affect on performance. But personally i do not expect much changes since there are no iterations through these quantiles arrays, it accessed everywhere by indexes.

@francois-rozet
Copy link
Collaborator Author

Actually, there are iterations through the arrays with the .findIndex() calls. But this can be improved from $O(n)$ to $O(\log n)$ by using a bisection search. I can implement that.

@qwerty541
Copy link
Collaborator

@francois-rozet Yes, you're right, i didn't noticed .findIndex() call on 2 line at first look.

https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R2

I only noticed a call of the same function on line 136, but since this THRESHOLDS array contains only 9 elements and unlike from score() function it was called only once, I didn't attach much importance to it.

https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R136

It would be great if you implement that.

I hope my code below will help.

function findIndex(arr, target) {
  let left = 0;
  let right = arr.length - 1;
  let result = -1;

  while (left <= right) {
    const mid = Math.floor((left + right) / 2);

    if (arr[mid] < target) {
      result = mid; // Update the result and continue searching on the right half
      left = mid + 1;
    } else {
      right = mid - 1; // Search on the left half
    }
  }

  return result;
}

const sortedArray =  [
    0, 0, 0, 0, 2, 4, 8, 11, 15, 19, 23, 27, 32, 36, 41, 45, 50, 55, 60, 65, 71,
    76, 82, 87, 93, 99, 105, 111, 117, 124, 131, 137, 145, 151, 159, 166, 174,
    182, 190, 198, 207, 215, 225, 234, 244, 253, 264, 274, 285, 296, 306, 318,
    330, 342, 355, 368, 382, 396, 409, 424, 440, 457, 475, 493, 512, 531, 551,
    570, 593, 618, 643, 667, 695, 723, 752, 784, 815, 857, 893, 934, 984, 1037,
    1094, 1152, 1217, 1289, 1379, 1475, 1576, 1696, 1851, 2023, 2232, 2480,
    2835, 3242, 3885, 4868, 6614, 11801, 792319,
  ];
const targetValue = 2333;
const lowerIndex = findIndex(sortedArray, targetValue);

if (lowerIndex !== -1) {
  console.log(`Index of the first value lower than ${targetValue} is: ${lowerIndex}. The value is: ${sortedArray[lowerIndex]}`);
} else {
  console.log(`No value lower than ${targetValue} found.`);
}

@rickstaa
Copy link
Collaborator

@francois-rozet, @anuraghazra just a small heads up that I just merged @qwerty541 pull requests which allows us to safeguard against performance issues.

@francois-rozet
Copy link
Collaborator Author

Hello @anuraghazra, @rickstaa, @qwerty541, I have (finally) replaced the linear-time search by a log-time bisection search. However, thinking back, I am now a bit skeptical about this PR. It would great if the percentiles were purely "data-driven", but I feel like it is not worth the increase of code complexity (and code (and data) maintenance). Users seem to be quite happy with the current ranking system, which is easy to understand and maintain, even if it is slightly off.

I believe #2637 is a much more important issue for the project.

@@ -1,12 +1,114 @@
function exponential_cdf(x) {
return 1 - 2 ** -x;
function searchSorted(arr, x) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the size of the list is small and constant, it may be interesting to use a dictionary instead of a list and a "complex" find algorithm ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how you would use a dictionary to find the index i for which arr[i - 1] <= x < arr[i].

@francois-rozet francois-rozet closed this by deleting the head repository Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ranks Feature, Bug fix, improvement related to ranking system.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants