[BREAKING CHANGE] New Top Language Detection Method #1027

anuraghazra · 2020-09-20T07:38:27Z

anuraghazra
Sep 20, 2020
Maintainer

As you all might know there are various bugs/issues regarding the top languages calculation.

The problem

The main issue i see is that people often get confused by how the calculations are done.

Currently the top languages are calculated based on how much code in bytes you have in a particular language and then we choose the top languages.
This method is the main reason people are confused about the calculations, because normally users perceive how much they code in languages by how many repositories they have with that particular language.

Quirks with the current calculation method

Confusion about the calculations. (Top Languages Card not working properly #136 (comment))
Repositories might have vendor code or auto generated code which would make the calculations wrong. ([Top Languages] Blog repositories(xxxx.github.io) should not be counted. #153)
If some language have exaggerated code bytes then it becomes the dominant language. (Top language card not showing Python #358 (comment))
Users aren't satisfied with the method.

The Solution

The most straight forward solution I see is that instead of calculating how much code they have, we can calculate how many repositories they have with the languages.

Related issues

#432 #403 #270 #136 #358

saurabhdaware · 2020-09-20T07:54:54Z

saurabhdaware
Sep 20, 2020

Not sure but the current methods seems right to me. Imagine having 2 HTML, JS repositories with more 51% HTML in both.

According to the new proposed method, I wouldn't know JavaScript.

So in my opinion, the current method makes more sense.

0 replies

anuraghazra · 2020-09-20T08:01:48Z

anuraghazra
Sep 20, 2020
Maintainer Author

@saurabhdaware We would not just take the primary language of the individual repo, we would also calculate top 10 langs of individual repos too.

That is we are already doing :-

github-readme-stats/src/fetchers/top-languages-fetcher.js

Line 14 in 6e73a00

languages(first: 10, orderBy: {field: SIZE, direction: DESC}) {

So it would look like -:

HTML ---------- some%
Javascript ------ some%

0 replies

saurabhdaware · 2020-09-20T08:11:21Z

saurabhdaware
Sep 20, 2020

I am not sure if I understood. Wouldn't calculating top 10 languages of each repository same as calculating how much code the user has in bytes? that's how GitHub calculates the percentage as well no?

0 replies

anuraghazra · 2020-09-20T08:14:56Z

anuraghazra
Sep 20, 2020
Maintainer Author

No we would just "count" them and in current method we get the "language.size" reduce and sum it up and then sort it.

0 replies

saurabhdaware · 2020-09-20T08:23:16Z

saurabhdaware
Sep 20, 2020

Oh ok so in this example

Imagine having 2 HTML, JS repositories with more 51% HTML in both.

I would have
51% HTML
49% JavaScript
right?

0 replies

anuraghazra · 2020-09-20T08:26:41Z

anuraghazra
Sep 20, 2020
Maintainer Author

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1

HTML - 50%
JS - 50%

We would just count them.

0 replies

saurabhdaware · 2020-09-20T09:40:39Z

saurabhdaware
Sep 20, 2020

Oh cool. Seems good to me then.

0 replies

DenverCoder1 · 2020-09-21T09:50:15Z

DenverCoder1
Sep 21, 2020

I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of.

0 replies

DenverCoder1 · 2020-09-21T09:53:46Z

DenverCoder1
Sep 21, 2020

If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes?

0 replies

anuraghazra · 2020-09-21T14:04:45Z

anuraghazra
Sep 21, 2020
Maintainer Author

If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes?

Unfortunately we cannot do that, it would make the logic complex & we would have two different statistics. it would hamper the consistency.

0 replies

anuraghazra · 2020-09-21T14:05:58Z

anuraghazra
Sep 21, 2020
Maintainer Author

I will firstly publish experimental query param to enable this and then if people likes it i would make it default.

0 replies

Bas950 · 2020-09-23T19:31:05Z

Bas950
Sep 23, 2020

I personally wouldnt use this new one, I think the current one is better.

I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of.

You could exclude the language, but in my PR (#480) I am making it so you can just exclude a repo. Which is probs better.

0 replies

anuraghazra · 2020-09-24T06:18:23Z

anuraghazra
Sep 24, 2020
Maintainer Author

@Bas950 yeah I can understand what you are saying, but the main reason why soo many people uses github-readme-stats is because of it's simplicity and ease of use. Of course we can add "exclude_repo" options and make it better but the thing is that not many people have the time/patience to go through all of their repositories and check which one has some vendor code and exclude them one by one, not to mention this is impractical for users who have lot of repos.

So this is why i'm considering this new approach which would mitigate these issues.

0 replies

crazy-max · 2020-09-24T09:10:15Z

crazy-max
Sep 24, 2020

@anuraghazra Check https://github.com/github/linguist

1 reply

cicirello Jul 15, 2021

@anuraghazra definitely check out linguist. Some of the issues with language stats can be handled with no changes to github-readme-stats itself. For example, Linguist automatically attempts to exclude vendored code, documentation, and some other stuff from what is reported for a repo. If it gets it wrong, either including something that should be included, or the opposite, you can configure linguist for that repo in a .gitattributes. Updating the docs for github-readme-stats to explain this and link to relevant part of linguist docs will partially cover what users want.

mchelen-gov · 2020-09-24T18:05:43Z

mchelen-gov
Sep 24, 2020

Quirks with the current calculation method

Confusion about the calculations. (#136 (comment))

Repositories might have vendor code or auto generated code which would make the calculations wrong. ([Top Languages] Blog repositories(xxxx.github.io) should not be counted. #153)

If some language have exaggerated code bytes then it becomes the dominant language. (#358 (comment))

Users aren't satisfied with the method.

@anuraghazra Does the current method only look at repos owned by the user or does it include other repos the user has contributed to?

0 replies

jcubic · 2021-01-06T15:01:51Z

jcubic
Jan 6, 2021

Please check issue #450 where I've provied GraphQL (I don't remember if it works and if test it) query that show most stared repos, alternative maybe to get repos with most commits (but there are no order by number of commits yet).

Using default 100 repos is stupid because the order can be random and those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo.

0 replies

Potherca · 2021-01-30T21:08:26Z

Potherca
Jan 30, 2021

Please check issue #450 where I've provied GraphQL query

I think the suggestion (in that ticket) to use orderBy might be a good one. The next issue will, of course, be "Sort by what?", which will undoubtedly lead to "I want to sort by X, not Y, can you make it configurable?". But I think the basic premise is a good addition to resolving this rather sticky puzzle.

(I don't remember if it works and if test it)

I just ran it through the graphql explorer, it works 👍

those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo.

For such repos, I would suggest using the exclude_repo setting (or just creating a separate org and moving such repos there).
I don't think we should really expect such a level of intelligence from a project such as this. I think the KISS principle would apply here.

Using default 100 repos is stupid

Not stupid. Easy. The API won't let you get more in a single request, so multiple requests would need to be made.
Making more calls means more code, more work, potentially more issues, etc.

0 replies

jcubic · 2021-01-30T21:23:24Z

jcubic
Jan 30, 2021

By stupid I mean default 100 by sorting like this. Even 10 repos is better if the sorting is done right, most faved repos or repos with most commits maybe most recent commits. Anything but default order which looks like random with fixed seed.

0 replies

Potherca · 2021-01-31T08:48:25Z

Potherca
Jan 31, 2021

Thank you for clarifying. 🙌

Even 10 repos is better if the sorting is done right

Yes! I completely agree with this!

I was thinking about this some more and I think the main issue here is conflicting use-cases... The GH API gives language stats for Repositories, not for Users. It might be easier to add a separate card and/or split the use-case to support both sides?

@anuraghazra If @jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well.

It would be a shame to let all the hard work and thoughts that went into this issue come to nothing...

0 replies

anuraghazra · 2021-01-31T10:12:11Z

anuraghazra
Jan 31, 2021
Maintainer Author

anuraghazra If jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well.

@Potherca feel free to experiment with different ways to make it more accurate I can surely take a look at them and give some feedbacks on it.

0 replies

anuraghazra · 2021-01-31T10:20:34Z

anuraghazra
Jan 31, 2021
Maintainer Author

Another possible way to count language stats is by using github's search api https://docs.github.com/en/rest/reference/search#search-code

0 replies

ghost · 2021-01-31T15:15:45Z

ghost
Jan 31, 2021

see this api https://codetabs.com/count-loc/count-loc-online.html It could be helpful in taking into consideration total lines of codes of specific repository and its language,

…

On Sun, Jan 31, 2021 at 3:20 PM Anurag Hazra ***@***.***> wrote: Another possible way to count language stats is by using github's search api https://docs.github.com/en/rest/reference/search#search-code [image: image] <https://user-images.githubusercontent.com/35374649/106381019-0606dc80-63dc-11eb-9749-d4f3d2e90df6.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#481 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL7VWCMF2OV46WNDEEBQC3LS4UVH5ANCNFSM4RTPUQVQ> .

0 replies

jcubic · 2021-01-31T15:24:19Z

jcubic
Jan 31, 2021

it will not work for the case where somone fork repo that is not git fork, but copy that have single commit. I have one repo like this that is written in C and Lua and it will give me those languages that I've never written even single line of it.

0 replies

mushahidq · 2021-03-25T05:02:09Z

mushahidq
Mar 25, 2021

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1

HTML - 50%
JS - 50%

We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

0 replies

anuraghazra · 2021-03-25T14:13:24Z

anuraghazra
Mar 25, 2021
Maintainer Author

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1
HTML - 50%
JS - 50%
We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

No that sucks #481 (comment)

0 replies

andreped · 2021-03-30T23:47:32Z

andreped
Mar 30, 2021

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1
HTML - 50%
JS - 50%
We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

No that sucks #481 (comment)

Is it that bad though? I guess the problem here is that we are unsure what we want these numbers to represent. I wasn't expecting these numbers to represent the distribution of total number of lines I wrote using each language. That would also be a bad estimate as in C I write quite many more lines for a simple sum compared to in Python. Doesn't necessarily mean I do more C than Python. Again, it really depends what one want these estimates to measure.

At least for me doing multiple projects, it would be cool to have an estimate stating repository-wise what is the most common language you have used. Initially, this is actually what I thought these numbers represented. But there are scenarios where such a measure might be suboptimal, as aforementioned, if one assume otherwise.

I do not think there is an optimum here that suits all users. Perhaps it could be an option to support both designs, or even multiple? That would at least solve my issue, and thus make me happy :]

But perhaps having more than one design that estimate these measures might introduce even more noise into how to interpret these values... Idk anymore

0 replies

foxt · 2021-04-13T14:42:23Z

foxt
Apr 13, 2021

It'd also be nice to be able to exclude forks.

My card

Currently shows 86% Python because I'm making a minor PR to a Python repo.

0 replies

andreped · 2021-04-13T14:58:39Z

andreped
Apr 13, 2021

I agree with @theLMGN .

I have a fork on my repo, which I haven't contributed to yet, but which apparently contain a shit ton of C#, which I have never used. This results in my "Most used languages" to be roughly 80% C#, which is sort of funny considering I have contributed to roughly 40 open repos, of which are Python/C++.

I'm also wondering if C# and C++ are switched, or if C++ code is misinterpreted as C# in github-readme-stats. According to github-readme-stats I do not do any C++, but I have contributed to C++ projects (forks).

0 replies

andreped · 2021-04-13T15:02:17Z

andreped
Apr 13, 2021

@theLMGN couldn't you just use the exclude_repo option to exclude that one repository? Since you only did a minor PR, including this repo in the calculating is probably not necessary?

0 replies

y9c · 2021-06-20T07:52:51Z

y9c
Jun 20, 2021

Why no add a argument &by=xxx to resolve this paradox.
xxx can be repo, line, file

0 replies

[BREAKING CHANGE] New Top Language Detection Method #1027

anuraghazra Sep 20, 2020 Maintainer

The problem

Quirks with the current calculation method

The Solution

Related issues

Replies: 37 comments · 1 reply

anuraghazra Sep 20, 2020 Maintainer Author

anuraghazra Sep 20, 2020 Maintainer Author

anuraghazra Sep 20, 2020 Maintainer Author

anuraghazra Sep 21, 2020 Maintainer Author

anuraghazra Sep 21, 2020 Maintainer Author

anuraghazra Sep 24, 2020 Maintainer Author

Quirks with the current calculation method

anuraghazra Jan 31, 2021 Maintainer Author

anuraghazra Jan 31, 2021 Maintainer Author

anuraghazra Mar 25, 2021 Maintainer Author

anuraghazra
Sep 20, 2020
Maintainer

Replies: 37 comments 1 reply

anuraghazra
Sep 20, 2020
Maintainer Author

anuraghazra
Sep 20, 2020
Maintainer Author

anuraghazra
Sep 20, 2020
Maintainer Author

anuraghazra
Sep 21, 2020
Maintainer Author

anuraghazra
Sep 21, 2020
Maintainer Author

anuraghazra
Sep 24, 2020
Maintainer Author

anuraghazra
Jan 31, 2021
Maintainer Author

anuraghazra
Jan 31, 2021
Maintainer Author

anuraghazra
Mar 25, 2021
Maintainer Author