[BREAKING CHANGE] New Top Language Detection Method #1027
Replies: 37 comments 1 reply
-
Not sure but the current methods seems right to me. Imagine having 2 HTML, JS repositories with more 51% HTML in both. According to the new proposed method, I wouldn't know JavaScript. So in my opinion, the current method makes more sense. |
Beta Was this translation helpful? Give feedback.
-
@saurabhdaware We would not just take the primary language of the individual repo, we would also calculate top 10 langs of individual repos too. That is we are already doing :- So it would look like -: HTML ---------- some% |
Beta Was this translation helpful? Give feedback.
-
I am not sure if I understood. Wouldn't calculating top 10 languages of each repository same as calculating how much code the user has in bytes? that's how GitHub calculates the percentage as well no? |
Beta Was this translation helpful? Give feedback.
-
No we would just "count" them and in current method we get the "language.size" reduce and sum it up and then sort it. |
Beta Was this translation helpful? Give feedback.
-
Oh ok so in this example
I would have |
Beta Was this translation helpful? Give feedback.
-
Repo 1 - JS x1 & HTML x1 HTML - 50% We would just count them. |
Beta Was this translation helpful? Give feedback.
-
Oh cool. Seems good to me then. |
Beta Was this translation helpful? Give feedback.
-
I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of. |
Beta Was this translation helpful? Give feedback.
-
If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes? |
Beta Was this translation helpful? Give feedback.
-
Unfortunately we cannot do that, it would make the logic complex & we would have two different statistics. it would hamper the consistency. |
Beta Was this translation helpful? Give feedback.
-
I will firstly publish experimental query param to enable this and then if people likes it i would make it default. |
Beta Was this translation helpful? Give feedback.
-
I personally wouldnt use this new one, I think the current one is better.
You could exclude the language, but in my PR (#480) I am making it so you can just exclude a repo. Which is probs better. |
Beta Was this translation helpful? Give feedback.
-
@Bas950 yeah I can understand what you are saying, but the main reason why soo many people uses github-readme-stats is because of it's simplicity and ease of use. Of course we can add "exclude_repo" options and make it better but the thing is that not many people have the time/patience to go through all of their repositories and check which one has some vendor code and exclude them one by one, not to mention this is impractical for users who have lot of repos. So this is why i'm considering this new approach which would mitigate these issues. |
Beta Was this translation helpful? Give feedback.
-
@anuraghazra Does the current method only look at repos owned by the user or does it include other repos the user has contributed to? |
Beta Was this translation helpful? Give feedback.
-
Please check issue #450 where I've provied GraphQL (I don't remember if it works and if test it) query that show most stared repos, alternative maybe to get repos with most commits (but there are no order by number of commits yet). Using default 100 repos is stupid because the order can be random and those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo. |
Beta Was this translation helpful? Give feedback.
-
I think the suggestion (in that ticket) to use
I just ran it through the graphql explorer, it works 👍
For such repos, I would suggest using the
Not stupid. Easy. The API won't let you get more in a single request, so multiple requests would need to be made. |
Beta Was this translation helpful? Give feedback.
-
By stupid I mean default 100 by sorting like this. Even 10 repos is better if the sorting is done right, most faved repos or repos with most commits maybe most recent commits. Anything but default order which looks like random with fixed seed. |
Beta Was this translation helpful? Give feedback.
-
Thank you for clarifying. 🙌
Yes! I completely agree with this! I was thinking about this some more and I think the main issue here is conflicting use-cases... The GH API gives language stats for Repositories, not for Users. It might be easier to add a separate card and/or split the use-case to support both sides? @anuraghazra If @jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well. It would be a shame to let all the hard work and thoughts that went into this issue come to nothing... |
Beta Was this translation helpful? Give feedback.
-
@Potherca feel free to experiment with different ways to make it more accurate I can surely take a look at them and give some feedbacks on it. |
Beta Was this translation helpful? Give feedback.
-
Another possible way to count language stats is by using github's search api https://docs.github.com/en/rest/reference/search#search-code |
Beta Was this translation helpful? Give feedback.
-
see this api
https://codetabs.com/count-loc/count-loc-online.html
It could be helpful in taking into consideration total lines of codes of
specific repository and its language,
…On Sun, Jan 31, 2021 at 3:20 PM Anurag Hazra ***@***.***> wrote:
Another possible way to count language stats is by using github's search
api https://docs.github.com/en/rest/reference/search#search-code
[image: image]
<https://user-images.githubusercontent.com/35374649/106381019-0606dc80-63dc-11eb-9749-d4f3d2e90df6.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#481 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL7VWCMF2OV46WNDEEBQC3LS4UVH5ANCNFSM4RTPUQVQ>
.
|
Beta Was this translation helpful? Give feedback.
-
it will not work for the case where somone fork repo that is not git fork, but copy that have single commit. I have one repo like this that is written in C and Lua and it will give me those languages that I've never written even single line of it. |
Beta Was this translation helpful? Give feedback.
-
Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response? |
Beta Was this translation helpful? Give feedback.
-
No that sucks #481 (comment) |
Beta Was this translation helpful? Give feedback.
-
Is it that bad though? I guess the problem here is that we are unsure what we want these numbers to represent. I wasn't expecting these numbers to represent the distribution of total number of lines I wrote using each language. That would also be a bad estimate as in C I write quite many more lines for a simple sum compared to in Python. Doesn't necessarily mean I do more C than Python. Again, it really depends what one want these estimates to measure. At least for me doing multiple projects, it would be cool to have an estimate stating repository-wise what is the most common language you have used. Initially, this is actually what I thought these numbers represented. But there are scenarios where such a measure might be suboptimal, as aforementioned, if one assume otherwise. I do not think there is an optimum here that suits all users. Perhaps it could be an option to support both designs, or even multiple? That would at least solve my issue, and thus make me happy :] But perhaps having more than one design that estimate these measures might introduce even more noise into how to interpret these values... Idk anymore |
Beta Was this translation helpful? Give feedback.
-
It'd also be nice to be able to exclude forks. My card |
Beta Was this translation helpful? Give feedback.
-
I agree with @theLMGN . I have a fork on my repo, which I haven't contributed to yet, but which apparently contain a shit ton of C#, which I have never used. This results in my "Most used languages" to be roughly 80% C#, which is sort of funny considering I have contributed to roughly 40 open repos, of which are Python/C++. I'm also wondering if C# and C++ are switched, or if C++ code is misinterpreted as C# in github-readme-stats. According to github-readme-stats I do not do any C++, but I have contributed to C++ projects (forks). |
Beta Was this translation helpful? Give feedback.
-
@theLMGN couldn't you just use the exclude_repo option to exclude that one repository? Since you only did a minor PR, including this repo in the calculating is probably not necessary? |
Beta Was this translation helpful? Give feedback.
-
Why no add a argument |
Beta Was this translation helpful? Give feedback.
-
As you all might know there are various bugs/issues regarding the top languages calculation.
The problem
The main issue i see is that people often get confused by how the calculations are done.
Currently the top languages are calculated based on how much code in bytes you have in a particular language and then we choose the top languages.
This method is the main reason people are confused about the calculations, because normally users perceive how much they code in languages by how many repositories they have with that particular language.
Quirks with the current calculation method
The Solution
The most straight forward solution I see is that instead of calculating how much code they have, we can calculate how many repositories they have with the languages.
Related issues
#432 #403 #270 #136 #358
Beta Was this translation helpful? Give feedback.
All reactions