Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional Stripe::APIConnectionError in production #382

Closed
kgx opened this issue Feb 17, 2016 · 23 comments
Closed

Occasional Stripe::APIConnectionError in production #382

kgx opened this issue Feb 17, 2016 · 23 comments

Comments

@kgx
Copy link

kgx commented Feb 17, 2016

Every now and then, after upgrading to the new Stripe gem 1.3.5.1 with the updated ca bundle, a Stripe API call is failing in production with the following error:

Could not verify Stripe's SSL certificate. Please make sure that your network is not intercepting
 certificates. (Try going to https://api.stripe.com/v1 in your browser.) If this problem persists, let us 
know at [email protected].
(Network error: certificate verify failed)

The API works fine 99.9% of the time, so it could also be a network error. However all other external APIs are working reliably with SSL.

Our runtime environment contains the following:

  • JRuby 9.0.5.0
  • Debian Jessie
  • stripe gem 1.3.5.1
  • rest-client gem 1.8.0
  • jruby-openssl gem 0.9.15

Please advise on how I can troubleshoot this further, as I would like to avoid random payment processing issues for users.

@brandur
Copy link
Contributor

brandur commented Feb 17, 2016

Hey @kgx,

So in general, a CA bundle either has our root certificate or it doesn't, and so the bundle itself shouldn't be causing this sort of non-deterministic problem. An intermittent network error should also never manifest as a bad certificate.

I haven't seen very many cases like yours so far, but in the ones we have seen, it often turns out to be caused by a router or some other kind of appliance on a user's internal network that steps in to intercept requests under certain circumstances (i.e. an actual MITM thus resulting the error). Can you say pretty confidently that that's not going on in this case?

Thanks!

@kgx
Copy link
Author

kgx commented Feb 17, 2016

@brandur , OK good to know. The application is running in a docker container on Google Cloud, so there is some packet forwarding going on to reach outside world, but otherwise there should not be any MITM.

@nnc
Copy link

nnc commented Feb 22, 2016

@brandur I'm getting this error for each request since upgrading from 1.34 to 1.36 and after looking over your commits related to this change, I'm confused why that is.

If gem CA bundle is used on my system, it should work before and after the refresh of said bundle between those two gem versions, as both 1.34 and 1.36 should have CA certs for Stripe endpoints, right?
On the other hand if my system CA bundle is used, it should also work with 1.34 and 1.36, as my system bundle was not updated.
So it seems like 1.34 was using the gem CA bundle and 1.36 is somehow using my system CA bundle, though I don't see anything in your commits that would introduce this change in behavior.

What would be the best way to check which CA bundle is actually being used with 1.34 and 1.36?

@nnc
Copy link

nnc commented Feb 22, 2016

Actually last line for my errors is a bit different then one posted above:

(Network error: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed)

Ruby 2.3.0 on Ubuntu 10.04

@brandur
Copy link
Contributor

brandur commented Feb 22, 2016

What would be the best way to check which CA bundle is actually being used with 1.34 and 1.36?

I've been putting a logging statement right around here to see where the path is actually pointed to. From what I've been able to tell though, it seems like even the new versions indeed use the bundled CA file.

Is the problem that you're seeing intermittent as well? Or are you able to reproduce it regularly?

@nnc
Copy link

nnc commented Feb 22, 2016

In my case, its not intermittent. All requests fail with same error on 1.36.

Logging shows both versions use bundled CA file. And both versions use rest-client 1.6.9.

I guess next step is to figure out what CA file rest-client, and more importantly, net/http are using.

@nnc
Copy link

nnc commented Feb 22, 2016

Although, if there is some kind of issue with updated bundled CA file, this is how it would behave. Any idea how can we definitively exclude that option?

@brandur
Copy link
Contributor

brandur commented Feb 22, 2016

Hi @nnc, since you're able to reproduce this regularly, would you mind helping me narrow down the problem?

I've written a small test script here that just iterates through each CA bundle and checks whether rest-client can make a successful request with it:

https://github.com/brandur/ca-test

Would you mind running it on one of the problematic computers and pasting the output? Instructions are in the repo's README.

Sorry about the hassle here, but I'm trying to narrow down whether the problem is related to changes in the library, is machine-specific, or is possibly even related to some kind of change in server configuration. Unfortunately I'm having no luck at all reproducing the problem locally :/

@brandur
Copy link
Contributor

brandur commented Feb 22, 2016

And for the record, here's my output on an OSX box:

$ bundle exec ruby test.rb
Running request for: Empty bundle (should fail)
verification FAILED
Running request for: New cert bundle (cURL/Mozilla)
verification suceeded
Running request for: Old cert bundle (from Ubuntu)
verification suceeded
Running request for: rest-client default
verification suceeded
Running request for: Ubuntu system bundle
Bundle not found at: /etc/ssl/certs/ca-certificates.crt ... skipping

And here's my output on a Ubuntu box (12.04 LTS):

$ bundle exec ruby test.rb
Running request for: Empty bundle (should fail)
verification FAILED
Running request for: New cert bundle (cURL/Mozilla)
verification suceeded
Running request for: Old cert bundle (from Ubuntu)
verification suceeded
Running request for: rest-client default
verification suceeded
Running request for: Ubuntu system bundle
verification suceeded

@nnc
Copy link

nnc commented Feb 22, 2016

Hi @brandur, thanks for digging into this!

This is on Ubuntu 10.04:

$ bundle exec ruby test.rb
Running request for: Empty bundle (should fail)
verification FAILED
Running request for: New cert bundle (cURL/Mozilla)
verification FAILED
Running request for: Old cert bundle (from Ubuntu)
verification suceeded
Running request for: rest-client default
verification suceeded
Running request for: Ubuntu system bundle
verification suceeded

I tried it also on Ubuntu 12.04, with same Ruby version, and new cert succeeds there.

Let me know if you need me to try anything else on that Ubuntu 10.04 system.

@brandur
Copy link
Contributor

brandur commented Feb 22, 2016

@kgx Just for completeness, would you mind also running the script above on your Debian environment?

I suspect that we're dealing with two different problems here. @nnc, I believe that yours is stemming from being on quite an older version of Ubuntu (and as a result, probably OpenSSL) as well. I don't want to sound to overly prescriptive here, but 10.04's been EOLed for almost a year now and I'd strongly encourage you to upgrade to ensure that you're getting adequate security updates. I'm still not sure what the precise problem is, but there may have been some change in the format of a CRT/PEM that's not compatible with older versions of OpenSSL. For an immediate solution, I'd recommend either pulling our old bundle into your app, or setting it to your system bundle (if it's reasonably up-to-date):

Stripe.ca_bundle_path = "/etc/ssl/certs/ca-certificates.crt"

@kgx Your issue is still a bit of a mystery, but given that the problem is intermittent and you're on quite a recent version of Debian, it might be something else.

@brandur
Copy link
Contributor

brandur commented Feb 22, 2016

@nnc And BTW, thanks a lot for running that script! That was very helpful in getting this problem a little more localized.

@nnc
Copy link

nnc commented Feb 23, 2016

@brandur thanks for looking into this. I'll apply one of the workarounds for now.

@kgx
Copy link
Author

kgx commented Feb 25, 2016

@brandur, I just ran the script in production environment and the CA bundles work as expected.

Based on the intermittent results, I think we are dealing with either a networking issue (most likely) or a thread safety issue (least likely). We have not seen this SSL error for about 3 days now, but we have had 2 network timeouts while making a request to the Stripe API during this period.

I am going to continue to monitor for new occurrences of the SSL error. In the mean time do you have any ideas for things we should look at?

@kgx
Copy link
Author

kgx commented Mar 12, 2016

Ok so we have experienced this error about 15 times since my last post on 2/25. It mostly happens during webhook processing and batch jobs, so the end user impact has not been huge, but it is obviously concerning. I am going to open a new issue with jruby-openssl.

@nadavshatz
Copy link

We've been experiencing this in production when trying to charge customers. Quite the issue

@kgx
Copy link
Author

kgx commented Mar 12, 2016

Its definitely a thread safety issue. JRuby 9.5.0.0 I can recreate consistently on Debian and OS X with the following code:

#this does not work consistently
require 'stripe'
Stripe.api_key = 'some_api_key'
10.times.map do |i|
  Thread.new do
    puts "before multi-threaded api call #{ i }"
    Stripe::Customer.retrieve('some_customer_id')
    puts "after multi-threaded api call #{ i }"
  end
end.each(&:join)

(If the system is fast enough you may need to increase thread count to recreate)

@kgx
Copy link
Author

kgx commented Mar 12, 2016

OK so this thread-safety issue specifically occurs when using the new CA bundle included in the Stripe gem.

api_key = 'some_api_key'
headers = {
    :user_agent => "Stripe/v1 RubyBindings/#{Stripe::VERSION}",
    :authorization => "Bearer #{api_key}",
    :content_type => 'application/x-www-form-urlencoded'
}
request_opts = {
    :verify_ssl => OpenSSL::SSL::VERIFY_PEER,
#comment out this line and it works, otherwise fails
    :ssl_ca_file => Stripe::DEFAULT_CA_BUNDLE_PATH,
    :headers => headers,
    :method => :get,
    :url => 'https://api.stripe.com/v1/customers'
}
#this does not work consistently
20.times.map do |i|
  Thread.new do
    puts "before multi-threaded api call #{ i }"
    RestClient::Request.execute(request_opts)
    puts "after multi-threaded api call #{ i }"
  end
end.each(&:join)

@kgx
Copy link
Author

kgx commented Mar 12, 2016

@nadavshatz - Can you try the following monkey patch? This seems to be working for me to prevent the problem:

module Stripe
  private
  def self.execute_request(opts)
    RestClient::Request.execute(opts.except(:ssl_ca_file))
  end
end

But keep in mind you will be falling back to your system's default CA file, which needs to be adequately up-to-date and defeats the purpose of Stripe bundling one in the gem.

brandur pushed a commit that referenced this issue Mar 14, 2016
This attempts to give some semblance of thread safety to parallel
library requests by pre-assigning a certificate store, the
initialization of which seems to be a weak point for thread safety. Note
that while this does seem to empirically improve the situation, it
doesn't guarantee actual thread safety.

A longer term solution to the problem is probably to assign a per-thread
HTTP client for much stronger guarantees, but this is a little further
out. More discussion on that topic in #313.

Fixes #382.
brandur pushed a commit that referenced this issue Mar 14, 2016
This attempts to give some semblance of thread safety to parallel
library requests by pre-assigning a certificate store, the
initialization of which seems to be a weak point for thread safety. Note
that while this does seem to empirically improve the situation, it
doesn't guarantee actual thread safety.

A longer term solution to the problem is probably to assign a per-thread
HTTP client for much stronger guarantees, but this is a little further
out. More discussion on that topic in #313.

Fixes #382.
@brandur
Copy link
Contributor

brandur commented Mar 14, 2016

@kgx Thanks for continuing to dig into this. Nice find on the thread safety problem.

So I think that the fundamental trouble here is that calls to rest-client are not thread safe. It's hard to say what exactly the introduction of a new cert bundle changed, but it seems that something like a larger bundle managed to push a previously tenuous situation far enough over the edge to start being problematic.

Would you mind trying out the patch I wrote in #397? It basically uses a technique inspired by what your freedom patch above is causing rest-client to do internally by pre-initializing a CA store containing the gem's cert bundle. When I apply it to your test script with one line added to seed its value like so (the ||= operator used inside of ca_store is itself not thread-safe):

# seed a certificate store before starting to make any concurrent requests
Stripe.ca_store

api_key = 'some_api_key'
headers = {
    :user_agent => "Stripe/v1 RubyBindings/#{Stripe::VERSION}",
    :authorization => "Bearer #{api_key}",
    :content_type => 'application/x-www-form-urlencoded'
}

...

... I'm able to successfully avoid the peer validation problems on JRuby.

Note that this is definitely a hack rather than an actual guarantee of thread safety, but it should get you back to more or less where you were before the new bundle was introduced. I think that the "proper" solution to the problem is to introduced a configurable HTTP client as described in #313 and then make sure to seed one per thread.

Let me know what you think.

@brandur
Copy link
Contributor

brandur commented Mar 14, 2016

@nadavshatz Can we get more information on your situation please? Are you see total or intermittent failure? Are you also on JRuby? Is there anything else that might help us debug? Thanks!

@nadavshatz
Copy link

@brandur absolutely:
We are on JRuby 9.0.4.0, we stopped using 9.0.5.0 after seeing other issues with the AWS gem (jruby/jruby#3645)

We experience this issue in an intermittent fashion. We use Puma as the server which does use threads so I'm hoping that when you find the solution for @kgx it will work for us as well.
We see it when we try to charge/add users but only rarely. once every 10-20 requests maybe. sometimes even less often, which makes it really hard to debug.

Let me know if there is anything else I can include to help.

@brandur
Copy link
Contributor

brandur commented Mar 14, 2016

@nadavshatz Thanks! It sounds like your problem is likely related to @kgx's given the very similar setup.

If it's convenient to do so, you may want to try the branch in #397 mentioned above, combined with the addition of a Stripe.ca_store call during your initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants