Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save commit data #5

Open
mihneadb opened this issue Feb 11, 2014 · 34 comments
Open

Save commit data #5

mihneadb opened this issue Feb 11, 2014 · 34 comments

Comments

@mihneadb
Copy link
Contributor

Use http://developer.github.com/v3/repos/commits/#list-commits-on-a-repository together with the "since" parameter to perform optimized queries.

Should use the 'CommitData' type.

@adaschevici
Copy link

Can you explain how they should be optimized?
Do you mean only return data for the query for a certain period?

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 4, 2014

So.. right now, for issues, milestones and pull requests we query github for "all the entries" - since all of these can change state (they can get closed, reopened, updated etc.).
However, commits don't change - and they are also a lot (for reasonably sized projects). This means that if we requested all the commits for a project all the time, we would unnecessarily burn through our rate limit.

What I suggest is that on every update we ask github for the next 5000 commits from where we left off at the last update (the API endpoint supports pagination but it will only give you at most 100 commits per page, so that means 50 requests).

It seems that the API gives you commits in descending chronological order, but we want to start with oldest first (so we can have all of them and further updates only have to add the newest commits).

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 4, 2014

Also, thinking more about this, we could just get the most recent 5000 commits every time we do an update and drop the old ones. I'm not sure really old commits provide us any value. What do you think?

@adaschevici
Copy link

👍 on newer commits

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 4, 2014

OK, then let's do that. :)
It shouldn't be hard to implement. getData should have a max_pages attribute (which will be 50 for this usecase), and then probably you can just use it with the proper URL (api.../owner/repo/commits?par_page=100) to fetch the data.

Do you want to implement this?

@adaschevici
Copy link

Sure i'll have a hack at it.

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 4, 2014

Great! Let me know if I can help.

I m not sure that the regex for the header processing holds for the
next_page url for the commits endpoint. You'll find out :P

On Tue, Mar 4, 2014 at 12:19 PM, Artur Daschevici
[email protected]:

Sure i'll have a hack at it.

Reply to this email directly or view it on GitHubhttps://github.com//issues/5#issuecomment-36614587
.

Mihnea Dobrescu-Balaur

@adaschevici
Copy link

The commits endpoint returns the results based on sha order and branch
top=branch_name&last_sha=sha_value
Not sure if they maintain the per_page pagination parameter across the requests because the subsequent requests come in without it.
And yes the parse headers breaks because of the different format.
I am thinking of adding in a type parameter to the getData, Maybe we should join the events and the commit data inside the same method.

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 6, 2014

The pagination aspect can be checked :). We can see how many we get and decide on a limit for that.

Why do you want to join the events and the commits?

@adaschevici
Copy link

They look somewhat similar, aside from the url steps are basically the same.

On Thu, Mar 6, 2014 at 12:08 PM, Mihnea Dobrescu-Balaur <
[email protected]> wrote:

The pagination aspect can be checked :). We can see how many we get and
decide on a limit for that.

Why do you want to join the events and the commits?

Reply to this email directly or view it on GitHubhttps://github.com//issues/5#issuecomment-36841134
.

Regards,
Arthur

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 6, 2014

Wait a sec - are we talking about the same thing?

getData has a type parameter already and it's used for all the endpoints:
https://github.com/uberVU/elasticsearch-river-github/blob/master/src/main/java/com/ubervu/river/github/GitHubRiver.java#L248

@adaschevici
Copy link

Ok.
My understanding was:
The function in the https://github.com/uberVU/elasticsearch-river-github/blob/master/data_processor/github.py#L87 where we get data for events.
What we want to do is add a method for getting commits at a limited rate in this function.

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 6, 2014

Well.. this is an issue in the elasticsearch-river-github repo, and you are talking about the elasticboard repo.
This is the elasticsearch plugin that fetches the github data into elasticsearch. The elasticboard/data_processor module doesn't fetch the data, it just queries it.

The github file you are referring to is there just with helpers for getting some data for prototype testing. I'll remove it since it's not used anymore.

@adaschevici
Copy link

Ok. Seems I was looking in the totally wrong place.

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 6, 2014

So do you want to get your hands dirty with some Java code? :D

@adaschevici
Copy link

Well I could do it if there is nobody else up for it but don't know if it
would be a good fit as it may take me longer than it would a guy that is
java oriented.

On Thu, Mar 6, 2014 at 1:06 PM, Mihnea Dobrescu-Balaur <
[email protected]> wrote:

So do you want to get your hands dirty with some Java code? :D

Reply to this email directly or view it on GitHubhttps://github.com//issues/5#issuecomment-36845187
.

Regards,
Arthur

@mihneadb
Copy link
Contributor Author

mihneadb commented Mar 6, 2014

Ok, that's fine. Maybe something in the elasticboard repo catches your eye
(it's python and JS :) ).

On Thu, Mar 6, 2014 at 12:10 PM, Artur Daschevici
[email protected]:

Well I could do it if there is nobody else up for it but don't know if it
would be a good fit as it may take me longer than it would a guy that is
java oriented.

On Thu, Mar 6, 2014 at 1:06 PM, Mihnea Dobrescu-Balaur <
[email protected]> wrote:

So do you want to get your hands dirty with some Java code? :D

Reply to this email directly or view it on GitHub<
#5 (comment)

.

Regards,
Arthur

Reply to this email directly or view it on GitHubhttps://github.com//issues/5#issuecomment-36845421
.

Mihnea Dobrescu-Balaur

@LosD
Copy link
Contributor

LosD commented Nov 6, 2014

Any reason this only goes for commits? While other items can get updated, using the "since" parameter will both get new and updated documents, at least for issues.

@mihneadb
Copy link
Contributor Author

mihneadb commented Nov 6, 2014

For issues, we get all of them all the time and we just replace old instances. It could be optimized to save some API calls, of course.

@LosD
Copy link
Contributor

LosD commented Nov 6, 2014

As well as some ElasticSearch space. My index seems to grow (slowly, though, even though I update every 10 minutes), probably because of ES' versioning.

I guess this might also open up for quicker updates (i.e. every minute), as calls would be much smaller. This is probably not important to everyone, but i.e. seeing almost instant improvement of stats on a QA dashboard when fixing bugs is a good motivator.

I might fork to play around and see if I can't contribute :)

@mihneadb
Copy link
Contributor Author

mihneadb commented Nov 6, 2014

That sounds great, please do! :)

@mihneadb
Copy link
Contributor Author

@LosD please let me know if you need any help with the code or whatnot.

@LosD
Copy link
Contributor

LosD commented Nov 14, 2014

Not really, I'm more or less done for issues and events, I've just been out travelling for a while.

I've settled for using the last seen event as the time to check updates from. Mostly because all issues changes will fire an event, and because the Modified header in the reply is not ISO8601.

Oh, there may be one thing: I'm considering a unit test for it, do you know of any pitfalls when unit testing from within river code? I guess not, since we just get passed a client.

@mihneadb
Copy link
Contributor Author

Sounds good, as long as we don't miss events (and we shouldn't miss any, I think).

I don't know about unit testing river code. This my first ES plugin and the first Java code in a long time, so your knowledge is most probably above mine on this one. :)

@LosD
Copy link
Contributor

LosD commented Nov 14, 2014

The missing stuff part is exactly why I'd like to unit test it. But it
might be hard to do properly: Simulating all the weird stuff that can
happen through the network may not be realistic.

Maybe the best way to test it properly would be to just start both
approaches against a high traffic public repository, and then look for
missed entries.

I'll figure something out :)

@mihneadb
Copy link
Contributor Author

Here's an idea:

Events are optimized for polling with the “ETag” header. If no new events have been triggered, you will 
see a “304 Not Modified” response, and your current rate limit will be untouched. There is also an “X-Poll-Interval” 
header that specifies how often (in seconds) you are allowed to poll. In times of high server load, the time 
may increase. Please obey the header.

source

So, dropping the full issues/commits/etc fetch and replacing it with very-often polling should lead to getting all events. And it is fine, because we don't exhaust the call limit, and we do it fast enough to not miss anything. The Events endpoint gives the last 300 events. I think it's safe to assume no repo has more than 300 events per.. 5 minutes. Or 1 minute, something like that. :)

What do you think?

@LosD
Copy link
Contributor

LosD commented Nov 14, 2014

I must admit that I completely missed that the Events endpoint works differently than the Issues endpoint. I had somehow convinced myself that the since parameter also worked for events.

Anyway, sounds good! And I guess there is no reason to do the Issue part much differently than I had considered: Before running, check for the time of latest Event (or Issue, it doesn't really matter, the important part is that we get anything that has happened since we last checked) fetched, and set the since parameter for that. Oh, and skip entirely if there is no new events, as that would also mean no new or modified issues.

In time, the events could be parsed, so the rest of the data could be fetched intelligently, but I guess the amount of open stuff is usually so low that it wouldn't matter, at least not unless someone get interested enough in closed stuff to do something about it.

Which reminds me, a Pull Requests is also an Issue, I guess it should be possible to find out if PRs has been closed by looking at the corresponding Issue?

I'll take a look sometime tomorrow (it's 10 in the evening here).

By the way, this has gone pretty far away from the original scope of this issue (adding commits, which I'm not even looking at :)). Maybe we should move the discussion to a new issue?

@mihneadb
Copy link
Contributor Author

Yep, the events endpoint is pretty smart. But we need to query it using the ETag magic to benefit of that.

Sounds good, indeed. There would be one more thing that would be cool to have - to support some amount of downtime (as much as possible). Basically, this means saving in the status entry of the river (which is in ES), or somewhere like that, the last checked timestamp. Such that, if the machine reboots for example, the river will carry on from where it stopped (in the limit of the last 300 events, of course..).

Yes, a PR is also an issue and it has the same field and actions (closed/opened).

No rush, it's even later here :).

Sure, feel free to open a new issue! Might be worth your while to do a minor sketch/spec of what you plan to do.

Thanks!

@LosD
Copy link
Contributor

LosD commented Nov 14, 2014

Actually, we already have something close to that: The newest event stored. It's just a simple max aggregate on the event created_at.

@mihneadb
Copy link
Contributor Author

Indeed, but not sure how long that ends up taking when you have many
events. Your call!

On Fri, Nov 14, 2014 at 11:36 PM, Dennis Du Krøger <[email protected]

wrote:

Actually, we already have something close to that: The newest event
stored. It's just a simple max aggregate on the event created_at.


Reply to this email directly or view it on GitHub
#5 (comment)
.

http://hootsuite.com
Mihnea Dobrescu-Balaur
Software Engineer | Hootsuite https://www.hootsuite.com
t: +40 720 533 802 | @mihneadb http://twitter.com/mihneadb
Find Hootsuite online:
[image: Hootsuite Blog RSS] http://blog.hootsuite.com [image: Facebook]
https://facebook.com/hootsuite [image: Twitter]
https://twitter.com/hootsuite [image: Youtube]
https://youtube.com/hootsuite [image: Instagram]
http://instagram.com/hootsuite [image: Google+]
https://plus.google.com/+HootSuite/posts
We are hiring in a big way! Apply now http://hootsuite.com/careers

Hootsuite empowers customers to transform messages into meaningful
relationships.

This email is being sent on behalf of Hootsuite Media, Inc
http://hootsuite.com/. If you are no longer interested in receiving
emails from Hootsuite, please click here
https://socialbusiness.hootsuite.com/unsubscribe.html.

Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6.

@LosD
Copy link
Contributor

LosD commented Nov 14, 2014

Hmmmm, that may be true. The ID is a simple integer, and as far as I know it is sequential. It should be very fast, so I guess it would be better to use it for the lookup.

@LosD
Copy link
Contributor

LosD commented Nov 16, 2014

I actually got something working! Now I can load as often as GiHub permits, and new issues are loaded incrementally.

Still have some issues:

  1. Events are not really loaded incrementally: The update only starts if there is changes, but all events are loaded no matter what. I haven't checked yet, but I hope the events arrive latest first. In that case, I can just check if the ID exists before storing it, and drop out as soon as one exists already.
  2. As the index mapping does not exist yet, the code that checks for most recent entry will throw some errors on the first load. Not really catastrophic, but does not look good in the logs.
  3. The code is not merged date with master yet. I'll get to that soon.

I'll make a PR as soon as 1 and 3 has been taken case of. If you are curious, you can take a look in https://github.com/LosD/elasticsearch-river-github/tree/feature/incremental-update :)

By the way, the ETag feature works for all resources, which would make it feasible to check all data for changes before reloading it, but I think it would require a bit of refactoring first. I already feel like I'm stretching the original design a bit too much with my changes.

@mihneadb
Copy link
Contributor Author

Nice, congrats!

  1. You mean we cannot fetch only the updated ones? Maybe we can use the "since" param for this. And yes, checking for duplicates sounds fine.
  2. Maybe we can do a try/catch for that first occurence?
  3. There are no major changes, should be easy!

We should definitely use the ETag stuff for all resources, don't worry about the design. This way is considerably better. I didn't use that approach in the first place because I went the MVP route. Now that we know this works and it is useful, it makes sense to improve the design. :)

Thanks again for your help!

@LosD
Copy link
Contributor

LosD commented Nov 17, 2014

I made a Pull Request with the issues fixed, see #15 and #16 :)

Oh, and you are very welcome, I'm just happy to help enhancing something that is making our life so much easier :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants