Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage/overhead? #123

Closed
ioquatix opened this issue Oct 9, 2016 · 56 comments
Closed

Memory usage/overhead? #123

ioquatix opened this issue Oct 9, 2016 · 56 comments

Comments

@ioquatix
Copy link

ioquatix commented Oct 9, 2016

I've been reviewing memory usage in my web framework. I'm sure you already know this but mime-types is a memory hog, accounting for 55% of my base web application memory usage.

> bundle exec derailed bundle:mem
TOP: 11.8633 MiB
  utopia: 11.0156 MiB
    mail: 9.0 MiB
      mime/types/columnar: 6.6094 MiB (Also required by: utopia)
        mime/types: 6.6055 MiB
          mime/types/registry: 6.3438 MiB
      mail/message: 0.6797 MiB
      mail/field: 0.4883 MiB
      mail/core_extensions/string/multibyte: 0.4766 MiB
        mail/multibyte: 0.4492 MiB (Also required by: mail)
    yaml: 0.5117 MiB (Also required by: mail/message)
      psych: 0.5117 MiB
  rake: 0.8203 MiB
    rake/rake_module: 0.4609 MiB
      rake/application: 0.4453 MiB (Also required by: rake)

To me, this seems a bit strange, as the mime-types.json file is only ~400Kbytes.

Is this because you are building a lot of in-memory data structures for the mime types?

If the data is 400Kbytes, and the problem is ruby overhead, what about compiling the data into a C extension and providing a query mechanism? If it's read-only constants in a C library the overhead would be reduced significantly and also shared between processes.

Perhaps you can give me some ideas of whether this memory usage is reasonable according to your own experiences and then perhaps we can brainstorm some ideas to reduce memory usage.

@ioquatix ioquatix changed the title Memory usage and transactional API Memory usage/overhead? Oct 9, 2016
@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

The reason why I mentioned "transactional API" in the title was because I was thinking about how to deal with memory usage. If it turns out the current overheads can't be reduced, what about implementing an API that loads and discards the data after use, e.g.

MIME::Types.database do |database|
    database.mime_types_for(".exe")
end
# After this block is done, the database is closed and all data related to it is freed.

@halostatue
Copy link
Member

The 6Mb sounds a little high (it used to be ~17); there’s a bit of stuff I can do to reduce usage that I haven’t done yet (there’s an issue in this tracker about a value pool) mostly because I can’t figure out the best way to determine the memory use vs speed overhead of the pool (the pool is a self-referencing hash):

   pool = {}
   application = 'application'
   pool[application] = application

This is important because every single MIME type has a content type (application, image, text, etc.). I’ll look at a few other things.

I’ve been doing a bit of playing (I can use derailed bundle:mem with mime-types directly, which is useful), and a few various changes have reduced the memory use by about 1.5–2Mb. I need to play with the pieces to see which ones are actually effective vs ineffective, but…

@halostatue halostatue self-assigned this Oct 9, 2016
@halostatue halostatue added this to the Future milestone Oct 9, 2016
@halostatue halostatue added the Bug label Oct 9, 2016
@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

I'm having a bit of a hack.. the results will be interesting. Generating a mime_types.c file with a static struct based on the mime-types-data gem. It's only taken about 10 minutes. If I can compile this.. it will be interesting to see what the size in memory is like. I could provide fast lookup by generating static hash tables too. We could use the same hash function as Ruby.. it just wouldn't be a Ruby data structure until you actually needed it to be so.

If you had to have preference as to the name of this gem, would you prefer mini-mime-types or mime-types-compiled (or something along those lines)

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

Just FYI, it should be possible to have a loadable .a file thats < 500Kbytes, has fast lookup, and includes all the data..

@halostatue
Copy link
Member

Um…probably mime-types-mini; this is also something that I‘ve been thinking about trying to make as a pure Ruby gem. What we’d want to do is have it override the autoloading of the MIME types registry (which can currently be done with an environment variable). What I’m really looking for in mime-types 4 is to not load the registry by default, and only load it when you need it. I think that’s a year away, though.

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

Okay, this is a basic example of what I've generated..

https://github.com/ioquatix/mime-types-mini/blob/master/ext/mime_types.c

It just needs a few Ruby functions to hook it up to the existing infrastructure.

Is there any way to reuse parts of MIME::Type without pulling in the data?

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

It compiles to about 200k so I'm pretty sure this can be a significant improvement over what you've got.. What actually makes it balloon out to 6Mbytes? Is it Ruby overhead?

@halostatue
Copy link
Member

It looks like it’s partially Ruby overhead. I’ve got memory use down to 2.5MiB by using the value pool and a couple of other enhancements, which is probably good enough for a point release soon if this is something that I want to pursue.

Derailed provides some useful information with bundle:objects. Before the value pool, there are 93,567 objects allocated (7.3 MiB) and 30,716 objects retained (2.7 MiB). After, there are 98,982 objects allocated (8.4 MiB) and 22,171 objects retained (2.6 MiB). Startup allocates more objects (the value pool does value.dup.freeze) and is ~5 times slower than without the value pool. I haven’t got any decent benchmarks on what the performance changes are for everything else.

One thing that looks troublesome is the use of Set, and where a more efficient data structure than Set may be desirable (about 1.2MiB of memory is retained in Set objects, compared to the 1.4MiB of memory retained in various mime-types objects: 755k for type, 360k for columnar data; 240k for container; 187k for the registry type). So after the value pool version is released, I’ll look at what I can do to reduce set. After using the value pool, Set is still about 1.2MiB, and the value pool is about 850k, but the types themselves are much lower use because of the shared resources.

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

For me, and perhaps for many others, what makes the most sense is having a static library with fast lookup and low overhead.

I've almost got it working, using perfect hashing (gperf). The problem is that it isn't easy to add things to this hash table at run time, so things like adding mime types isn't possible.

The compiled code is going to be < 500Kbytes and the run time performance for the queries I want to do should be significantly better than Ruby hash tables. It will be an interesting experiment.

As an aside it would be nice if you could make Mime::Types slightly more configurable. As it stands, it's not easy to replace Mime Types because so many libraries depend on it, but it would be nice if I could plug in Mime::Types::Mini and have it just work with all the existing code that uses the Mime::Types API.

@halostatue
Copy link
Member

The changes for the value pool are here: https://github.com/mime-types/ruby-mime-types/tree/qa-tool-value-pool. It includes other stuff I was working on for a point release, so…

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

@halostatue Okay, so I've got the basics working.

The compiled database I'm generating is 200k, has almost zero load time, and does quick lookups. It's just a proof of concept.

https://github.com/ioquatix/mime-types-mini

To try it out

require 'mime/types/mini'

Mime::Types::Mini::Database.content_type_for_extension("svg")
 => "image/svg+xml" 

There is a lot of flexibility here but I've just done simple K-V mapping.

I've used gperf to generate "perfect hashing" hash tables, so the hash table is customised to the data..

It's not slow.. perhaps it's not fast either, but:

Benchmark.realtime{Mime::Types::Mini::Database.content_type_for_extension("svg")}
 => 6.5820058807730675e-06

Ideas? Thoughts? Where to go from here?

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

BTW, I've released a really crappy 0.1.0 gem so you can install it as a proof of concept.

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

Okay, so did some more mucking about with this.

Firstly, I took a look at the mail gem. And a few others. I found that at least in the case of the mail gem, the use case was really limited. Given a filename, tell me the content type.

So, I think whatever I do, I'll focus on an API that works for this use case as fast and memory efficiently as possible.

I've built an API in mime-types-mini and found it's at least 4 times faster:

Ruby 2.3.0 at 2016-10-10 01:11:30 +1300
Warming up --------------------------------------
         MIME::Types    16.685k i/100ms
   Mime::Types::Mini    56.401k i/100ms
Calculating -------------------------------------
         MIME::Types    181.928k (± 5.4%) i/s -    917.675k in   5.059039s
   Mime::Types::Mini    714.074k (± 6.6%) i/s -      3.553M in   4.999965s

Comparison:
   Mime::Types::Mini:   714074.1 i/s
         MIME::Types:   181928.1 i/s - 3.93x  slower

This is with no real optimisation or anything else. It's also 20x faster to load which is pretty significant since it's going to be a large chunk of application startup time for a lot of users.

Finally, memory usage. As we already know, MIME::Types uses about 6Mbytes on average, and yo might have got that down to 2Mbytes?

mime-types-mini is using < 200Kbytes for the entire dataset.

I'm keen to push forward but I'd like to work with you to update mime-types gem.

I'm just not quite sure how to fit your existing APIs around what I see as being what people are actually doing with the gem. Here is a rough assessment:

The mail gem doesn't know how to load MIME::Types: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail.rb#L11-L16 - the odd require and fallback should never have been necessary.

The mail gem doesn't know how to handle MIME::Types: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail/attachments_list.rb#L97-L102 - the odd encoding issue and use of .first - i.e. I've got this extension.. let's basically just choose any related content-type.. is there something special about the first one?

The mail gem doesn't know how to match MIME::Type to a given record: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail/attachments_list.rb#L61-L65 - again just using the first one that comes up. Is this the right one? It might be in some cases, and not in others. Finally, the only purpose of this is for guessing the encoding of the file.

Well, I think I'd like to look at how other projects are using MIME::Types before making a decision about the kind of API that's required. But, I think the current API is ambiguous given how it's being used and poorly aligning with requirements.

@halostatue Just so we are on the same page, I think you are an awesome developer and this critique is purely technical. Please don't take anything here personally.

@halostatue
Copy link
Member

So, I have one hard requirement: nothing compiled goes into mime-types proper. Period. I don't build extension-based libraries for a lot of reasons, but mostly because doing so breaks compatibility for all Rubies and platforms.

As to your notes about how mail uses mime-types:

  1. mime-types now uses a columnar store, not the JSON file. This capability was introduced in mime-types 2.6.1 and made default in 3, but mail support{s,ed} all versions of mime-types from 1 to 3. When going from 2 to 3, I didn’t want to break the lower-memory (columnar) use case, enabled by the requiring mime/types/columnar.
  2. The encoding bit is news to me, because I only use UTF-8 for everything, and have since mime-types 2—so this may be compatibility with previous versions of mime-types. Also, the use of #first here is an algorithmic mistake, IMO, but it’s something that a lot of people who use mime-types do:
pry> MIME::Types.type_for('doc')
=> [#<MIME::Type: application/msword>,
 #<MIME::Type: application/word>,
 #<MIME::Type: application/x-msword>,
 #<MIME::Type: application/x-word>,
 #<MIME::Type: text/plain>]

All mime-types actually tells people is “here are some MIME content types associated with the extensions .doc”. They are sorted alphabetically, then by various other factors (and, frankly, I see a bug with the way that the priority sort is done, because it should be application/msword, text/plain, application/word, application/x-msword, and application/x-word—all registered types should be sorted before unregistered types, so it should be sorted on registered? then alphabetical). On VMS, .doc was the standard text file extension. So…using #first here is wrong but common. Using the version that is most appropriate to your use case is right. There are some thoughts I have about having additional parameters or data to try to add some real heuristics…but there’s no good usage data (yes, logically we can guess that .doc is probably application/msword, but what data can we have that allows a computer to do so?).‡
3. This is the same thing, but in reverse; it is a little more likely to be correct to use #first here, especially with mime-types 3 and later, since support for platform MIME types have been removed (it used to be possible to have two variants of text/plain that were differentiated by system (e.g., VMS vs not). Even now, it is possible to have two variants of text/plain:

MIME::Types['text/plain'].size #=> 1
t = MIME::Types['text/plain'].first
t.extensions = %w('xtx')
MIME::Types.add(t)
MIME::Types['text/plain'].size #=> 2

That should print a warning, but for some reason it isn’t and that appears to be a minor regression. Whether it’s a good idea or not is a totally different question.

The use of #first is problematic here, and you’ll also see it with t.extensions.first, which I have simplified to t.preferred_extension…except that there is also t.preferred_extension= allowing the mime data to explicitly include the preferred extension. The data hasn’t yet been updated to provide this, but the logic so far is “prefer the extension that is listed first”.

mime-types was created in 2003 because I needed it for a script that I was writing and I ported it from Perl. It has turned out to be very useful, and I want to keep most of the useful behaviours and make it a full consumer of the mime types data. However, over the last 13 years, how people use it does not quite match what it does. They do, however, use the mutability to add their own records (see various closed bugs in this repo before I moved data out).

I split out mime-types-data because I wanted to enable people to make their own interfaces that match what they expect using data that is in a fairly regularized format (even the columnar data is in a regularized format). I have plans on making mime query APIs in a lot of different languages that either consume the columnar data or the JSON file. There’s a lot more data provided in mime-types than people are using, and I think that the data is a good thing.

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

mime-types-data is awesome and really useful, so thanks for taking the initiative to do that.

Perhaps what would make the most sense is a light weight C (or C++ with C interface) library which could easily be invoked either via an extension or via FFI. You could use the FFI backed library where possible or a Ruby/Python implementation if that isn't possible.

I fundamentally have an issue with mime types being mutable and the mime type database being mutable at a global level. This is touched on briefly here #45

The way I think this should work:

module Mime
    class Type
        def initialize(content_type, encoding, extensions)
            @content_type = content_type
            @encoding = encoding
            @extensions = extensions
        end

        attr :content_type
        attr :encoding
        attr :extensions
    end

    class Container
        def initialize(next = nil)
            @next = next

            @extensions = {}
            @content_types = {}
        end

        attr :extensions
        attr :content_types

        # Add a mime type to this container
        def << mime_type
            mime_type.extensions.each do |extension|
                @extensions[extension] = mime_type
            end

            @content_types[mime_type.content_type] = mime_type
        end

        def content_type_for_extension(extension)
            @extensions.fetch(extension) { @next.content_type_for_extension(extension) }
        end

        def mime_type_for_content_type(content_type)
            @content_types.fetch(extension) { @next.content_type_for_extension(extension) }
        end

        alias [] mime_type_for_content_type
    end

    class RFCBlahData
        def initialize(next = nil)
            @next = next
        end

        # Implemented in C land or wherever.
        # def content_type_for_extension(extension)
        # def mime_type_for_content_type(content_type)
    end
end

my_mimes = Mime::Container.new(Mime::RFCBlahData)
my_mimes << Mime::Type.new("foo/bar", "binary", "bar")

foo_bar_mime_type = my_mimes["foo/bar"]
text_plain_mime_type = my_mimes["text/plain"]

Ideally the library (e.g. mail) provides a way to specify my_mimes to be used for a given mail connection, or defaults to some sane global mime container.

@halostatue
Copy link
Member

There’s a lot more data than what you have specified, and given that I don’t have an automated process for updating the data (since the data is ultimately generated by humans, it requires some intervention periodically), I’m not comfortable making the registry or objects immutable as part of mime-types.

I might be able to be convinced that there can be something mutable that can be frozen after a configuration phase, but again—that would have to be a major version bump as it would be incompatible, or it would need to be a separate gem—which I would be happy to have in the mime-types organization if you want it here.

One of the things I’m stumbling toward (and have been for a while, pretty much since you submitted #45 a few years ago) is making it so that MIME::Types (the container class, not the default registry) is easier to use with a subset of the data, and maybe make it easier to provide a subset of the data as it applies to your application’s configuration without needing to load the entire 9,000 mime-type default registry. At the same time, I also have requests like #67 and (to a degree) #115 that look at adding features and details in the types and the registry. Most people use mime-types through the MIME::Types.type_for and MIME::Types.[] interfaces without even realizing that they are proxies around instance methods MIME::Types#type_for and MIME::Types#[] for the default registry.

@ioquatix
Copy link
Author

ioquatix commented Oct 9, 2016

There’s a lot more data than what you have specified

i totally agree on this point. But, keep in mind most usage (e.g. mail, which is a big one) don't use any of this additional data. It is a different use case which is not well served by this library. There are too many decisions which are not being made correctly (e.g. the usage of #first).

I'm still thinking about this, but a gem which focuses purely on content types and file extensions would be really useful.

Once you have the content type string, it should be easy to use the mime-types gem for further information if required.

@ioquatix
Copy link
Author

So, here is my thoughts.

A new gem, content-types which focuses purely on mapping content type to file extension and file extension to content type.

By default, it doesn't include any data, but only data structures.

Then, pull in additional dependencies, e.g. content-types-iana. This gem has a native pure ruby implementation but tries to use a compiled extension if possible.

Content::Types::IANA would be it's own immutable registry, ideally using the C implementation.

The fields we have in the C registry are largely irrelevant to memory consumption, but there is a case to be made for the most common use case - e.g. don't include obsolete content types (could be a separate registry Content::Types::IANA::Obsolete for example.

I've deliberately preferred content-types over media-types because I've already implemented MediaTypes according to the relevant RFCs (including correct parsing) in http-accept gem. But, I guess this could be pulled out into a separate media-types gem. If you'd be willing to support this under the mime-types org, I'd be happy to do this with your support. We could also move the http-accept gem here too.

In my mind (and generally from RFCs), a media type is a concrete "text/plain". It's the same thing as a content type which is confused because it's the name of the header in http. A media range is "text/*" - i.e. it specifies a range of valid media types. The name "mime-type" is largely replaced with "media-type" in modern terminology. It's good that any gems we make help users understand the modern terminology. Hence it might make sense to use media-types as a gem name. Thoughts?

@halostatue
Copy link
Member

Note that, for the most part, the IANA data does not include extensions. I think I’ve seen it in some of the data, and in a few RFCs, but all of the extension data that I have is pulled from other sources (including the Apache list and various PRs over the years) as well as a bit of judgement.

More people want some of the unregistered types than you think (it’s fairly easy to get an IANA type added in the */vnd.* namespace, but most people don’t bother), because it’s practical to the solutions that they want.

I’ll need to think about what you’ve written a bit more, but it sounds mostly reasonable, and content-types is a reasonable name. I’m not as sure about the separate gems for the registry; it feels unnecessary. Separate requires, sure. Separate gems?

@ioquatix
Copy link
Author

I'll think a bit more about separate gems or not. I just felt like it would be nice to keep compiled code out of the main repo, but I guess it doesn't matter too much.

@ioquatix
Copy link
Author

It would also mean, to a certain extent, that data and functionality could be updated independently.. which might not be a bad thing.

@halostatue
Copy link
Member

Let me clarify: I am not suggesting content-types contains the compiled data. I am suggesting that having multiple data gems is the problem.

@ioquatix
Copy link
Author

Okay, so you'd prefer to see content-types-data which contains, say Content::Types::Data::IANA and Content::Types::Data::Unofficial (and perhaps Content::Types::Data::All)

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

Yeah, and if you read earlier, you'll see this: https://github.com/ioquatix/mime-types-mini which is: 200Kbytes vs 6Mbytes of memory used, and faster too since it uses a perfect hash function.

@stereobooster
Copy link

stereobooster commented Apr 9, 2017 via email

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

Yes, I've mapped mesh data directly to Vulkan so it renders fast. I know my shit. Do you know what a perfect hash function is?

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

The perfect hash lookup is about 4x faster than a native Ruby hash table. So, I'm not sure what you are proposing, perhaps you can implement it and put your ideas on the table so we can compare it.

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

By the way, I'm not trying to be an asshole, I just got a bit frustrated with the direction of the conversation. I think that there is plenty here which backs up my (completely reasonable) assertions and I don't think you have a bad idea, I understand the value of not needing something to be compiled, but if you are serious about it then just implement the hell out of it, and come back here once you have some good benchmarks.

@stereobooster
Copy link

Yes, now I see what is the difference. I thought most of the performance/memory gain was from just the fact of using C instead of Ruby. So in this case using external binary file would make not much difference. But you went one step further and used perfect hash.

@stereobooster
Copy link

So again, to be clear. What I and @halostatue suggest is to use capnproto to save data in external binary file, and use your C extension to load this data (not Ruby implementation), and use general hash function to search this data. Main memory gain is from the fact that you use effective C in memory representation instead of Ruby. Hash function itself doesn't impact on memory consumption it affects on number cycles required to search data in table. Please correct me if you see flaw in what I'm saying. General hash function will perform a little bit worse than perfect hash, but I don't think this will be the problem, because problem that you address is memory no CPU cycles.

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

So again, to be clear. What I and @halostatue suggest is to use capnproto to save data in external binary file, and use your C extension to load this data (not Ruby implementation), and use general hash function to search this data.

You won't save that much memory this way, and their is a pretty big performance overhead loading the data into a Ruby hash.

Main memory gain is from the fact that you use effective C in memory representation instead of Ruby. Hash function itself doesn't impact on memory consumption it affects on number cycles required to search data in table.

The perfect hash function is a signifiant gain in memory because it essentially compresses the hash keys. The hash function itself doesn't affect memory consumption directly, but the representation in memory definitely does. Additionally, loading a dynamically shared library means it would be shared between processes as static immutable code/data. That's also a huge win.

General hash function will perform a little bit worse than perfect hash, but I don't think this will be the problem, because problem that you address is memory no CPU cycles.

The problem I address using my approach:

  • Memory usage significantly reduced, 200Kb vs 6+Mbytes.
  • Static immutable code/data is shared between processes so this is 200Kb over all ruby processes that load the same shared library.
  • 4x improvement in performance for hash lookup using perfect hash function.
  • Near zero startup overhead.

@stereobooster
Copy link

stereobooster commented Apr 9, 2017

You won't save that much memory this way, and their is a pretty big performance overhead loading the data into a Ruby hash.

Why Ruby hash? C implementation of hash table.

The perfect hash function is a signifiant gain in memory because it essentially compresses the hash keys.

There is all the same data stored in generated hash table, which would be stored in general hash table e.g. hash_key, original key, value. Plus general hash table will store pointer to next value in the bucket (or something similar depending on implementation).

#line 685 "Extensions.gperf"
      {"Z", "application/x-compressed"},

Main memory improvement comes from C itself, not perfect hash function.

Memory usage significantly reduced, 200Kb vs 6+Mbytes.

Yes and thats fantastic result. Thanks for doing it.

But if you would use all the same implementation + external data file the difference would be the same significant.

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

Why Ruby hash? C implementation of hash table.

Because Ruby hash is generic for Ruby objects. It's not that efficient for a static lookup.

There is all the same data stored in generated hash table, which would be stored in general hash table e.g. hash_key, original key, value. Plus general hash table will store pointer to next value in the bucket (or something similar depending on implementation).

Your right in the sense that all the entries are in the C code. Of course it has to be somewhere! But, your wrong if you think it's anything like a general hash table. There is no hash_key, no probing, no buckets, no next pointer or any other crap like that to slow down the lookup.

But if you would use all the same implementation + external data file the difference would be the same significant.

Nope, it wouldn't. Because there would be a large startup time, and it can't be shared easily using a DSO.

@stereobooster
Copy link

Because Ruby hash is generic for Ruby objects. It's not that efficient for a static lookup.

I mean do not use Ruby hash. Use C implementation of Hash table. Like you do now. Use the C implementation of Hash table. But instead of using ideal Hash function use general Hash function.

There is no hash_key, no probing, no buckets, no next pointer or any other crap like that to slow down the lookup.

And you think that this is the source of difference 200k vs 6mb?

It seems you have strong believe that I suggest to use Ruby code for this task. It is not like that.

perfect hash general hash Ruby implementation
simplified C hash table general C hash table Ruby hash table
200k ?k 6mb

Of course your solution would be the fastest and most optimal, I never argued with that. But suggested solution would be just a bit slower and you can swap data files easily.

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

I mean do not use Ruby hash. Use C implementation of Hash table. Like you do now. Use the C implementation of Hash table. But instead of using ideal Hash function use general Hash function.

Yes, and it would have a large start-up time to populate such a data structure, and it would have a large non-sharable memory allocation. Two VERY big issues people currently face when using large Ruby code bases, so big that people try to work around it using application servers that do "prefork" initialisation.

A perfect hash algorithm has got almost nothing to do with a general purpose hash table data structure. They don't even need to compute a general hash value at run-time. The hash computation is directly tied to the lookup, and it's a direct array lookup with a single key comparison if required.

And you think that this is the source of difference 200k vs 6mb?

Yes, my solution is the fastest, and most optimal, that I could come up with, and that's what I was aiming for. I think if you are interested, why don't you implement a general purpose C hash in a fork of my code, and then compare it. Startup time, lookup time, and memory usage. Keep in mind that in the above memory usage is actually:

200k per DSO context (e.g. OS). 6Mbytes per process. Whatever you are proposing with a general purpose hash table, is probably also per-process.

Of course your solution would be the fastest and most optimal, I never argued with that. But suggested solution would be just a bit slower and you can swap data files easily.

Swapping data files is not something people usually do, but I've also suggested a way in which this can be achieved more efficiently above, by using nested wrappers, which I think is actually the best solution, rather than having users modify global state.

I appreciate you taking the time to think about this issue. Let's work towards something that makes sense.

@ioquatix
Copy link
Author

ioquatix commented Apr 9, 2017

Here is what I think makes sense.

  • A data in some standard format: JSON, MsgPack, YAML, XML, CSV, TSV, whatever as long as it's standard and easily loadable. Don't care about performance much. Ideally, dataset is divided based on where it comes from, e.g. IANA, unofficial, etc. These are semi-immutable by gem version. This already exists as mime-types-data which is awesome, but I think there is quite a bit of junk in there that could be organised better?
  • A tool to generate native code for efficient, immutable content type lookups. Already done POC.
  • A pure ruby tool to do the same lookups. Already exists in several places.
  • A general purpose interface for doing this, as well as doing nested lookup as proposed. Doesn't really exist.

We would then expose MIME::Types via the same interface.

People who use MIME::Types would have a drop-in replacement which is more efficient in both space and time. Code that uses MIME::Types should expose configuration/arguments which allows for the injection of different providers. Given how hard it would be to get everyone to change from using MIME::Types, I'm not sure which strategy is best, but one of:

  • MIME::Types gem becomes a data-agnostic gem. It provides a light-weight interface as it currently does for constructing a mime type registry, and perhaps exposes it at a global level. The backend can be replaced by a user simply saying something like MIME::Types.provider = MIME::Types::Mini.new or something to that effect.
  • A new gem which at a minimum implements the abstract operations required, and then make it possible to use MIME::Types as a provider, as well as other gems.

@SamSaffron
Copy link

@ioquatix maybe it makes sense to release a non-specific gem here, example possible interface:

h = BigHash.load(filename)
h["test"]
"value"

Then this pattern and encoding structure can be reused in other places that need access to a static hash of sorts, for example, yaml translations have the exact same problem of consuming WAY too much memory these days. So many RVALUES are just sitting there being scanned every major GC for zero value.

Advantage of this kind of approach is that it can have a naive Ruby implementation and high performance MRI or JRuby implementation.

It would also be significantly simpler to integrate into this project cause it could become an explicit dependency.

@ioquatix
Copy link
Author

I did think about this actually.

It would be nice.

But it would have to be compiled on the fly. It's not impossible, it's just a lot of work.

Honestly, there is another question to ask - is mime-types simply loading too much unnecessary data in the first place?

@SamSaffron
Copy link

@ioquatix 99.99% of the time the consumers don't need the vast army of information mime-types provides... it is why I created https://github.com/discourse/mini_mime

The 99.99% case is that people include the mail gem in rails and it in turn includes mime-types just to do a couple of very specific lookups, under pretty specific conditions. The mail gem now uses mini_mime which means that future versions of rails will not depend on full mime types gem.

@ioquatix
Copy link
Author

Thanks for your effort making mini_mime. However, it appears the design is still fundamentally broken due to the dependence on global state. The mail gem should have an argument or configuration parameter which takes in a class implementing a standard API (e.g. perhaps the one you provide). It should be up to the user to provide a mime provider, but there could be a sane default, e.g. on a global provider. But it should be possible to provide a custom provider. Was this ever done?

@SamSaffron
Copy link

However, it appears the design is still fundamentally broken due to the dependence on global state.

Can you explain this? I am not following, mini_mime loads nothing into memory on boot and uses an LRU cache to cycle stuff in as needed topping up at about 400 RVALUEs in memory.

For context:

User.first in Rails is around a 400 RVALUE allocated operation.

@ioquatix
Copy link
Author

Okay, well, you encourage people to use the api: https://github.com/discourse/mini_mime/blob/356b01907d11e30a39ab16f4bace2aa9bd80d699/lib/mini_mime.rb#L4-L11

How does one add a new mime type?

How does one remove an existing mime type?

How does one provide a subset of mime types to the mail gem, and a different subset to some other gem?

@SamSaffron
Copy link

sure we could expand the interface to support this or add a general interface to the mail gem, its not a problem that was solved cause nobody had it :)

@ioquatix
Copy link
Author

As you said yourself, most people don't need 99% of what mime-types provides. So, your not really solving the problem at all, your just making the data set smaller. The problem of the DESIGN still exists, and it's affecting everyone.

@SamSaffron
Copy link

Well it depends on how you define "the problem" :)

For Discourse and all Discourse users "the problem" is that

  1. Loading mime-type gem bloats memory by multiple megabytes and pollutes RVALUE space with 10s of thousands of RVALUEs. (This is the biggest problem)
  2. Loading mime-type gem slightly slows down boot

Both of these are solved by it. Anyway totally fine to create a generic interface like we have for execjs and implement a bridge but I feel it is overkill.

Do you have a specific project you are working on where you need to customize mime types and support non IANA supported or registered mime types?

@ioquatix
Copy link
Author

Yes.

@ioquatix
Copy link
Author

This was discussed here: #45

@ioquatix
Copy link
Author

I like how you've solved the problem, don't get me wrong. But, I just thing somewhere in this big kludge of solutions, is something better and we should aim for it.

@ioquatix
Copy link
Author

@halostatue thanks for your continued effort to look after this code.

You gonna be at Ruby Kaigi?

@halostatue
Copy link
Member

I won’t. I don’t do a lot of conferences, and whole RK is one I’d attend (it’s not in the US), I’m not doing Ruby full time.

I’ve closed this issue, by the bye, since I don’t see an easy way to solve this in the constraints that I’ve placed on this library (must be pure Ruby as to not disadvantage any particular implementation of Ruby). I think that a code-generation approach based on mime-types-data is likely to work well, but should still allow people to customize the values if possible.

I’m planning on doing another round of tests on the value pool implementation, but I don’t know what the best benchmarking tools in Ruby these days are as I’ve moved mostly to Elixir for day-to-day work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants