-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use text-based columnar storage by default for big memory and processing savings #96
Conversation
…ing savings This changes the default data for mime-types from a JSON file to a set of text files, with one line per mime type, and one file per mime type attribute. This only loads the content_type and extensions for mime-types by default, reducing default memory usage of mime-types from over 22MB to just over 2MB. Additionally, it speeds up requiring from 0.38 seconds to about 0.10 seconds. Even when loading all of the attributes for the mime types, the memory usage of mime-types stays under 11MB, though load time increases to 0.52 seconds. I think that is an acceptable trade off. If the txt files are shipped instead of the json file, there is also a disk space savings of 167KB (392KB instead of 560KB). rake convert:yaml:txt has been added for converting the yaml files to the necessary txt files. In order to get the conversion to work, I had to make convert.rb turn on lazy loading for mime types. This should be completely backwards compatible if a path is given containing the types. The only current backwards compatibility difference is for the "text/directory - DEPRECATED by RFC6350" mime type, which looks to me like an invalid mime type anyway, and probably should be removed.
In light of #95, I want to get clearance from you that you will be fine if this is relicensed to be MIT-only when mime-types 3.x comes out. |
Definitely. I license all of my own libraries MIT. |
Thanks for this. I’m going to take a couple of days to review this, benchmark it, etc., and compare what you’ve made against my future plans for the library (as I now say in the README, |
Use a single string for each encoding. This decreases the number of encoding strings stored by default from 1909 to 4, saving about 700KB of memory by default on 64-bit MRI. While here, reduce object allocations while loading data from the mime type txt files.
Sounds good. I'm going to push another commit that adds pooling for encoding strings, since that seems like the lowest hanging fruit (only 4 possible strings). But I'll stop after that so you can review. If possible and you agree this is backwards compatible, could you consider releasing 2.6 with just this change? This will significantly reduce memory consumption for most applications using mime-types directly or indirectly, and it would be great to get this out without waiting for 2.99, which I'm guessing won't be released for another 6 months. |
If it’s backwards compatible, we can do that. The latest date that 2.99 and 3.0 will be released is November 2015; if they are ready before then, I will release them before then. I’m going to have to do a dependency search on mime-types again and probably suggest another “apocalypse” post to Peter Cooper because I want people who depend on the data that is being dropped to change their gem requirements to (The truth is that the data isn’t being dropped; just some old representations of it.) |
BTW, I do have to think about this one for a 2.x release, because I had not planned on turning on lazy loading by default until 3.0 and I have to consider whether this is going to be a breaking change. I’m comfortable with the file format change (it’s purely an internal implementation detail), but I would almost prefer having a different require that people can opt into (I would keep it for 3.0, but it would be the default). Say, gem 'mime-types', '~> 2.6', require: 'mime/types/lazy' In 3.0, I want to figure out how to have a |
Looks like there is a single assertion failing on JRuby. It doesn't always fail, though, it only fails sometimes. Of the Travis builds, it looks like it failed 3 times out of 4. I can get it to fail on my machine, but not as often. This appears to be either a race condition or an order-dependent test. I'm going to try to use minitest-bisect to see if it is order-dependent test and can be fixed. If it is a race condition, I'm not sure why it would be in the code I added, as the mutex should catch any attempt to access the lazy loaded data before it is loaded. If the bisect doesn't turn up anything, I'll try to add more debugging to see what could be going wrong. I'm OK with a separate require, or using a separate environment variable to enable it. As long as there is some way to enable it. |
A couple more review comments:
|
OK, the JRuby failure definitely appears to be an order dependent bug.
As for the PATH protocol, this pull request is backwards compatible in that if PATH is set by a packager, the old behavior is used. So people that set PATH won't get the reduced memory use, but everything will still work. Because PATH doesn't appear to deal with interdependent file sets, I'm not sure it can be used the same way for the txt file as it can for the YAML and JSON fails. There's probably a way to get it to work, but I'm guessing it will require additional changes. |
This was causing the test-order dependent bug on JRuby. The clear_cache_file code was taken from the cache test.
I never did get minitest_bisect to work on JRuby, but I was able to figure out the underlying test order bug and fix it. :) |
Yeah. Parts of the You’ve done quite a bit of amazing work here with this and you don’t have to solve all the problems. |
def _each_file_line(type, lookup=true) | ||
LOAD_MUTEX.synchronize do | ||
next if @loaded_attributes.include?(type) | ||
File.open(File.join(ROOT, "mime-types-#{type}.txt"), 'r:UTF-8') do |file| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do this a little cleaner with a Pathname and/or with each_with_index
type = "friendly"
ROOT = Pathname.new(File.expand_path('../../../data', __FILE__).freeze)
ROOT.join("mime-types-#{type}.txt").each_line($/, mode: "r:UTF-8").each_with_index do |line, index|
puts [line, index].inspect
end
It's fine as is, I just really like working with pathnames 😸
I ran my app with this PR and it looks great! There's a 39% decrease in memory useage (from 52.5 to 31.8). The mail gem goes from 19 mb to about 3 mb. The bigger memory delta is likely due to a high allocation factor, we could probably tune down by decreasing RUBY_GC_HEAP_GROWTH_FACTOR, but we would still expect to see a 10-16mb decrease. Everything looks good. My one suggestion might be the addition of a way for gems to explicitly declare and load the data columns they need at require time, that way we take advantage of CoW when using a forking webserver. Maybe something like Here's my data. Before
After
|
I’m almost ready with this. Another couple of days and I’ll have whipped it into shape with the restrictions that I had wanted. The interface will (for 2.6) default to not using the columnar store (so the gem will continue to include both formats). To get the columnar store by default, |
@halostatue That seems like a fine compromise to me. After 2.6 is released, I'll submit a pull request to mail to require mime-types 2.6 and use I assume that in future versions, if columnar is made the default, |
That is correct. The JSON capabilities will probably stick around until mime-types 4, although deprecated and replaced with a default columnar store. The only thing that may change is if we find a more efficient pure-Ruby columnar store rather than text. I’ll commit now that, until at least mime-types 5, requiring Given that I see at least a couple of years growth once I make mime-types 3, we’re probably talking around 2020 or so. |
@jeremyevans, @schneems: I’m still a couple of days away from finalizing the release—I take my documentation fairly seriously, and there are no tests regarding columnar support yet, but I wanted to see if you guys had time to look over what I’ve got on my working branch prior to rebase/squash and merge. https://github.com/mime-types/ruby-mime-types/tree/2.6-work One thing that I’ve done is added a
I have not yet timed a “full” load with the columnar format. I haven't yet profiled to see where the slowdown might be, but that’ll be something that definitely gets tackled for the 3.0 release and part of why I am happy to make this an opt-in release. I strongly suspect that there are performance boosts to be found with minimal cost in terms of memory. But that’s for a future time after 2.6. |
I looked over the 2.6-work tree and it looks good to me. I'm not sure why the benchmark is showing columnar being 2x slower than JSON. I know when I benchmarked it, it was substantially faster, and even if you were loading all attributes from the columnar store, it was only about 30% slower. But in my benchmarking, I was only loading the types once, not multiple times. I was basically doing:
and comparing the times before and after the change. |
It’s possible I regressed something with the PATH support changes, but I don’t think so. I know that the benchmark is slightly unrealistic. |
We probably want to reach out to the top users of mime/types to get them to test and switch over to the new store once this is released. I whipped up a script that gets reverse dependencies from rubygems and sorts them by download number. Here's the script and the top 50 gems: https://gist.github.com/schneems/e09c95da37a8c50047a8 No surprise, mail is number 1. |
I did a run through of all of the most recent commits compared to master. On the surface everything looks good. Let me know if you want some help running any specific benchmarks. |
I was going to reach out to Peter Cooper, as well. |
Mime-types 2.6.1+ has a columnar store that only loads the data that is needed for a drastically reduced memory footprint. cc/ @jeremyevans @halostatue See: mime-types/ruby-mime-types#96. [fixes #1873]
== 2.6.1 / 2015-05-25 * Bugs: * Make columnar store handle all supported extensions, not just the first. * Avoid circular require when using the columnar store. == 2.6 / 2015-05-25 * New Feature: * Columnar data storage for the MIME::Types registry, contributed by Jeremy Evans (@jeremyevans). Reduces default memory use substantially (the mail gem drops from 19 Mib to about 3 Mib). Resolves {#96}[mime-types/ruby-mime-types#96], {#94}[mime-types/ruby-mime-types#94], {#83}[mime-types/ruby-mime-types#83]. Partially addresses {#64}[mime-types/ruby-mime-types#64] and {#62}[mime-types/ruby-mime-types#62]. * Development: * Removed caching of deprecation messages in preparation for mime-types 3.0. Now, deprecated methods will always warn their deprecation instead of only warning once. * Added a logger for deprecation messages. * Renamed <tt>lib/mime.rb</tt> to <tt>lib/mime/deprecations.rb</tt> to not conflict with the {mime}[https://rubygems.org/gems/mime] gem on behalf of the maintainers of the {Praxis Framework}[http://praxis-framework.io/]. Provided by Josep M. Blanquer (@blanquer), {#100}[mime-types/ruby-mime-types#100]. * Added the columnar data conversion tool, also provided by Jeremy Evans. * Documentation: * Improved documentation and ensured that all deprecated methods are marked as such in the documentation. * Development: * Added more Ruby variants to Travis CI. * Silenced deprecation messages for internal tools. Noisy deprecations are noisy, but that's the point. == 2.5 / 2015-04-25 * Bugs: * David Genord (@albus522) fixed a bug in loading MIME::types cache where a container loaded from cache did not have the expected +default_proc+, {#86}[mime-types/ruby-mime-types#86]. * Richard Schneeman (@schneems) provided a patch that substantially reduces unnecessary allocations. * Documentation: * Tibor Szolár (@flexik) fixed a typo in the README, {#82}[mime-types/ruby-mime-types#82] * Fixed {#80}[mime-types/ruby-mime-types#80], clarifying the relationship of MIME::Type#content_type and MIME::Type#simplified, with Ken Ip (@kenips). * Development: * Juanito Fatas (@JuanitoFatas) enabled container mode on Travis CI, {#87}[mime-types/ruby-mime-types#87]. * Moved development to a mime-types organization under {mime-types/ruby-mime-types}[https://github.com/mime-types/ruby-mime-types].
This is based on mime-types/ruby-mime-types#96, and some kind words from Alex Burkhart. By using the columnar store in the mime types gem we should be able to save 10-20MB of memory per ruby process.
Mime-types 2.6.1+ has a columnar store that only loads the data that is needed for a drastically reduced memory footprint. cc/ @jeremyevans @halostatue See: mime-types/ruby-mime-types#96. [fixes #1873]
This changes the default data for mime-types from a JSON
file to a set of text files, with one line per mime type, and one
file per mime type attribute. This only loads the content_type and
extensions for mime-types by default, reducing default memory usage
of mime-types from over 22MB to just over 2MB. Additionally, it
speeds up requiring from 0.38 seconds to about 0.10 seconds.
Even when loading all of the attributes for the mime types, the
memory usage of mime-types stays under 11MB, though load time
increases to 0.52 seconds. I think that is an acceptable trade off.
If the txt files are shipped instead of the json file, there is also
a disk space savings of 167KB (392KB instead of 560KB).
rake convert:yaml:txt has been added for converting the yaml files
to the necessary txt files. In order to get the conversion to work,
I had to make convert.rb turn on lazy loading for mime types.
This should be completely backwards compatible if a path is given
containing the types. The only current backwards compatibility
difference is for the "text/directory - DEPRECATED by RFC6350" mime
type, which looks to me like an invalid mime type anyway, and
probably should be removed.
This has been tested on ruby 1.9 - 2.2 and jruby 1.7
I've tested this with the mail gem tests and mail also passes all tests
when using this.
This should address #94 and #83 and possibly also #64 and #62.