-
-
Notifications
You must be signed in to change notification settings - Fork 23.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to data-driven percentiles #2962
Conversation
@francois-rozet is attempting to deploy a commit to the github readme stats Team on Vercel. A member of the Team first needs to authorize it. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2962 +/- ##
==========================================
+ Coverage 97.62% 97.63% +0.01%
==========================================
Files 24 24
Lines 5182 5249 +67
Branches 460 463 +3
==========================================
+ Hits 5059 5125 +66
- Misses 122 123 +1
Partials 1 1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the fact that it is now impossible to get a C rank is quite logical. If you look closely at the quantiles, you can see that at the beginning there are zeros everywhere. I would even suggest that a C+ rank would be no less rare, since a new user with empty statistics gets a percentile of 78.26086956521738 in the new system, and the threshold for getting a B rank is only 80. I think that the thresholds need to be changed, at least for the latest ranks. My suggestion is the following:
const THRESHOLDS = [1, 12.5, 25, 37.5, 45, 52.5, 60, 67.5, 75];
@francois-rozet I also want to clarify whether I understood correctly that the number of repositories now does not affect the rank in any way?
github-readme-stats/src/calculateRank.js
Line 84 in 091e304
repos: 0.0, |
The rest looks good to me. Great improvement of ranking system.
I also want you to know that since you became collaborator you can push your branches directly into this repository. It's better because Vercel test deployment workflow will work and will make testing and reviewing easier.
Don't have much idea about any of these ranking algorithm. But whatever you folks think is a better system you can go ahead with but:
|
@anuraghazra, @francois-rozet I didn't have time yet to review this, but I will review when I finish my master's thesis. Neverteless, maybe we should not add the QUANTILES themselves in the code 🤔. Sticking the weights and means already in the codebase is better. github-readme-stats/src/calculateRank.js Lines 35 to 54 in 24ac78b
This prevents any performance problems from occurring, and if the code contains a comment that links to a The ranking was improved since many users pointed out that it took a lot of work to get a rank other than A+ 😅. There were about three issues/discussions opened by people who were wondering what happened to their rank, but after I explained it to them and showed them that ranking up is now more accessible for the first ranks, no new issues were opened. However, we can document this change and the reasoning in a |
To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles? |
Hey @francois-rozet sorry for being unclear. My suggestion was that we can perform the calculation based on the quantiles offline and then just store the |
You cannot get the true percentiles just based on the median. Currently, we assume that the statistics follow exponential/log-normal distributions, which implicitly fixes the quantiles with respect to the median/mean. The percentiles we get are basically educated guesses. If we want the true percentiles, we need the true quantiles. |
I have opened pull request #3141 which contains performance tests base implementation. After merging this one and pulling this branch with master we can check how quantiles affect on performance. But personally i do not expect much changes since there are no iterations through these quantiles arrays, it accessed everywhere by indexes. |
Actually, there are iterations through the arrays with the |
@francois-rozet Yes, you're right, i didn't noticed I only noticed a call of the same function on line 136, but since this It would be great if you implement that. I hope my code below will help. function findIndex(arr, target) {
let left = 0;
let right = arr.length - 1;
let result = -1;
while (left <= right) {
const mid = Math.floor((left + right) / 2);
if (arr[mid] < target) {
result = mid; // Update the result and continue searching on the right half
left = mid + 1;
} else {
right = mid - 1; // Search on the left half
}
}
return result;
}
const sortedArray = [
0, 0, 0, 0, 2, 4, 8, 11, 15, 19, 23, 27, 32, 36, 41, 45, 50, 55, 60, 65, 71,
76, 82, 87, 93, 99, 105, 111, 117, 124, 131, 137, 145, 151, 159, 166, 174,
182, 190, 198, 207, 215, 225, 234, 244, 253, 264, 274, 285, 296, 306, 318,
330, 342, 355, 368, 382, 396, 409, 424, 440, 457, 475, 493, 512, 531, 551,
570, 593, 618, 643, 667, 695, 723, 752, 784, 815, 857, 893, 934, 984, 1037,
1094, 1152, 1217, 1289, 1379, 1475, 1576, 1696, 1851, 2023, 2232, 2480,
2835, 3242, 3885, 4868, 6614, 11801, 792319,
];
const targetValue = 2333;
const lowerIndex = findIndex(sortedArray, targetValue);
if (lowerIndex !== -1) {
console.log(`Index of the first value lower than ${targetValue} is: ${lowerIndex}. The value is: ${sortedArray[lowerIndex]}`);
} else {
console.log(`No value lower than ${targetValue} found.`);
} |
@francois-rozet, @anuraghazra just a small heads up that I just merged @qwerty541 pull requests which allows us to safeguard against performance issues. |
Hello @anuraghazra, @rickstaa, @qwerty541, I have (finally) replaced the linear-time search by a log-time bisection search. However, thinking back, I am now a bit skeptical about this PR. It would great if the percentiles were purely "data-driven", but I feel like it is not worth the increase of code complexity (and code (and data) maintenance). Users seem to be quite happy with the current ranking system, which is easy to understand and maintain, even if it is slightly off. I believe #2637 is a much more important issue for the project. |
@@ -1,12 +1,114 @@ | |||
function exponential_cdf(x) { | |||
return 1 - 2 ** -x; | |||
function searchSorted(arr, x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the size of the list is small and constant, it may be interesting to use a dictionary instead of a list and a "complex" find algorithm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how you would use a dictionary to find the index i
for which arr[i - 1] <= x < arr[i]
.
This PR modifies the ranking system to use the actual quantiles (see #2857) of the statistics. This leads to significantly different, but generally higher, ranks for users. In particular, it is now impossible to be at C rank (minimum is C+). An alternative to lower the ranks would be to filter out inactive users before computing the quantiles.
What do you think @rickstaa, @qwerty541?