-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage/overhead? #123
Comments
The reason why I mentioned "transactional API" in the title was because I was thinking about how to deal with memory usage. If it turns out the current overheads can't be reduced, what about implementing an API that loads and discards the data after use, e.g.
|
The 6Mb sounds a little high (it used to be ~17); there’s a bit of stuff I can do to reduce usage that I haven’t done yet (there’s an issue in this tracker about a value pool) mostly because I can’t figure out the best way to determine the memory use vs speed overhead of the pool (the pool is a self-referencing hash): pool = {}
application = 'application'
pool[application] = application This is important because every single MIME type has a content type ( I’ve been doing a bit of playing (I can use derailed bundle:mem with mime-types directly, which is useful), and a few various changes have reduced the memory use by about 1.5–2Mb. I need to play with the pieces to see which ones are actually effective vs ineffective, but… |
I'm having a bit of a hack.. the results will be interesting. Generating a If you had to have preference as to the name of this gem, would you prefer |
Just FYI, it should be possible to have a loadable .a file thats < 500Kbytes, has fast lookup, and includes all the data.. |
Um…probably |
Okay, this is a basic example of what I've generated.. https://github.com/ioquatix/mime-types-mini/blob/master/ext/mime_types.c It just needs a few Ruby functions to hook it up to the existing infrastructure. Is there any way to reuse parts of MIME::Type without pulling in the data? |
It compiles to about 200k so I'm pretty sure this can be a significant improvement over what you've got.. What actually makes it balloon out to 6Mbytes? Is it Ruby overhead? |
It looks like it’s partially Ruby overhead. I’ve got memory use down to 2.5MiB by using the value pool and a couple of other enhancements, which is probably good enough for a point release soon if this is something that I want to pursue. Derailed provides some useful information with One thing that looks troublesome is the use of Set, and where a more efficient data structure than Set may be desirable (about 1.2MiB of memory is retained in Set objects, compared to the 1.4MiB of memory retained in various mime-types objects: 755k for type, 360k for columnar data; 240k for container; 187k for the registry type). So after the value pool version is released, I’ll look at what I can do to reduce set. After using the value pool, Set is still about 1.2MiB, and the value pool is about 850k, but the types themselves are much lower use because of the shared resources. |
For me, and perhaps for many others, what makes the most sense is having a static library with fast lookup and low overhead. I've almost got it working, using perfect hashing (gperf). The problem is that it isn't easy to add things to this hash table at run time, so things like adding mime types isn't possible. The compiled code is going to be < 500Kbytes and the run time performance for the queries I want to do should be significantly better than Ruby hash tables. It will be an interesting experiment. As an aside it would be nice if you could make Mime::Types slightly more configurable. As it stands, it's not easy to replace Mime Types because so many libraries depend on it, but it would be nice if I could plug in Mime::Types::Mini and have it just work with all the existing code that uses the Mime::Types API. |
The changes for the value pool are here: https://github.com/mime-types/ruby-mime-types/tree/qa-tool-value-pool. It includes other stuff I was working on for a point release, so… |
@halostatue Okay, so I've got the basics working. The compiled database I'm generating is 200k, has almost zero load time, and does quick lookups. It's just a proof of concept. https://github.com/ioquatix/mime-types-mini To try it out
There is a lot of flexibility here but I've just done simple K-V mapping. I've used gperf to generate "perfect hashing" hash tables, so the hash table is customised to the data.. It's not slow.. perhaps it's not fast either, but:
Ideas? Thoughts? Where to go from here? |
BTW, I've released a really crappy 0.1.0 gem so you can install it as a proof of concept. |
Okay, so did some more mucking about with this. Firstly, I took a look at the mail gem. And a few others. I found that at least in the case of the mail gem, the use case was really limited. Given a filename, tell me the content type. So, I think whatever I do, I'll focus on an API that works for this use case as fast and memory efficiently as possible. I've built an API in
This is with no real optimisation or anything else. It's also 20x faster to load which is pretty significant since it's going to be a large chunk of application startup time for a lot of users. Finally, memory usage. As we already know, MIME::Types uses about 6Mbytes on average, and yo might have got that down to 2Mbytes?
I'm keen to push forward but I'd like to work with you to update I'm just not quite sure how to fit your existing APIs around what I see as being what people are actually doing with the gem. Here is a rough assessment: The mail gem doesn't know how to load MIME::Types: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail.rb#L11-L16 - the odd require and fallback should never have been necessary. The mail gem doesn't know how to handle MIME::Types: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail/attachments_list.rb#L97-L102 - the odd encoding issue and use of .first - i.e. I've got this extension.. let's basically just choose any related content-type.. is there something special about the first one? The mail gem doesn't know how to match MIME::Type to a given record: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail/attachments_list.rb#L61-L65 - again just using the first one that comes up. Is this the right one? It might be in some cases, and not in others. Finally, the only purpose of this is for guessing the encoding of the file. Well, I think I'd like to look at how other projects are using MIME::Types before making a decision about the kind of API that's required. But, I think the current API is ambiguous given how it's being used and poorly aligning with requirements. @halostatue Just so we are on the same page, I think you are an awesome developer and this critique is purely technical. Please don't take anything here personally. |
So, I have one hard requirement: nothing compiled goes into mime-types proper. Period. I don't build extension-based libraries for a lot of reasons, but mostly because doing so breaks compatibility for all Rubies and platforms. As to your notes about how mail uses mime-types:
pry> MIME::Types.type_for('doc')
=> [#<MIME::Type: application/msword>,
#<MIME::Type: application/word>,
#<MIME::Type: application/x-msword>,
#<MIME::Type: application/x-word>,
#<MIME::Type: text/plain>] All mime-types actually tells people is “here are some MIME content types associated with the extensions MIME::Types['text/plain'].size #=> 1
t = MIME::Types['text/plain'].first
t.extensions = %w('xtx')
MIME::Types.add(t)
MIME::Types['text/plain'].size #=> 2 That should print a warning, but for some reason it isn’t and that appears to be a minor regression. Whether it’s a good idea or not is a totally different question. The use of mime-types was created in 2003 because I needed it for a script that I was writing and I ported it from Perl. It has turned out to be very useful, and I want to keep most of the useful behaviours and make it a full consumer of the mime types data. However, over the last 13 years, how people use it does not quite match what it does. They do, however, use the mutability to add their own records (see various closed bugs in this repo before I moved data out). I split out mime-types-data because I wanted to enable people to make their own interfaces that match what they expect using data that is in a fairly regularized format (even the columnar data is in a regularized format). I have plans on making mime query APIs in a lot of different languages that either consume the columnar data or the JSON file. There’s a lot more data provided in mime-types than people are using, and I think that the data is a good thing. |
Perhaps what would make the most sense is a light weight C (or C++ with C interface) library which could easily be invoked either via an extension or via FFI. You could use the FFI backed library where possible or a Ruby/Python implementation if that isn't possible. I fundamentally have an issue with mime types being mutable and the mime type database being mutable at a global level. This is touched on briefly here #45 The way I think this should work:
Ideally the library (e.g. mail) provides a way to specify |
There’s a lot more data than what you have specified, and given that I don’t have an automated process for updating the data (since the data is ultimately generated by humans, it requires some intervention periodically), I’m not comfortable making the registry or objects immutable as part of mime-types. I might be able to be convinced that there can be something mutable that can be frozen after a configuration phase, but again—that would have to be a major version bump as it would be incompatible, or it would need to be a separate gem—which I would be happy to have in the mime-types organization if you want it here. One of the things I’m stumbling toward (and have been for a while, pretty much since you submitted #45 a few years ago) is making it so that |
i totally agree on this point. But, keep in mind most usage (e.g. mail, which is a big one) don't use any of this additional data. It is a different use case which is not well served by this library. There are too many decisions which are not being made correctly (e.g. the usage of I'm still thinking about this, but a gem which focuses purely on content types and file extensions would be really useful. Once you have the content type string, it should be easy to use the mime-types gem for further information if required. |
So, here is my thoughts. A new gem, By default, it doesn't include any data, but only data structures. Then, pull in additional dependencies, e.g.
The fields we have in the C registry are largely irrelevant to memory consumption, but there is a case to be made for the most common use case - e.g. don't include obsolete content types (could be a separate registry I've deliberately preferred In my mind (and generally from RFCs), a media type is a concrete "text/plain". It's the same thing as a content type which is confused because it's the name of the header in http. A media range is "text/*" - i.e. it specifies a range of valid media types. The name "mime-type" is largely replaced with "media-type" in modern terminology. It's good that any gems we make help users understand the modern terminology. Hence it might make sense to use |
Note that, for the most part, the IANA data does not include extensions. I think I’ve seen it in some of the data, and in a few RFCs, but all of the extension data that I have is pulled from other sources (including the Apache list and various PRs over the years) as well as a bit of judgement. More people want some of the unregistered types than you think (it’s fairly easy to get an IANA type added in the I’ll need to think about what you’ve written a bit more, but it sounds mostly reasonable, and |
I'll think a bit more about separate gems or not. I just felt like it would be nice to keep compiled code out of the main repo, but I guess it doesn't matter too much. |
It would also mean, to a certain extent, that data and functionality could be updated independently.. which might not be a bad thing. |
Let me clarify: I am not suggesting |
Okay, so you'd prefer to see |
Yeah, and if you read earlier, you'll see this: https://github.com/ioquatix/mime-types-mini which is: 200Kbytes vs 6Mbytes of memory used, and faster too since it uses a perfect hash function. |
did you take a time to read about capnproto? It is binary format which can
be directly mapped runtime memory
|
Yes, I've mapped mesh data directly to Vulkan so it renders fast. I know my shit. Do you know what a perfect hash function is? |
The perfect hash lookup is about 4x faster than a native Ruby hash table. So, I'm not sure what you are proposing, perhaps you can implement it and put your ideas on the table so we can compare it. |
By the way, I'm not trying to be an asshole, I just got a bit frustrated with the direction of the conversation. I think that there is plenty here which backs up my (completely reasonable) assertions and I don't think you have a bad idea, I understand the value of not needing something to be compiled, but if you are serious about it then just implement the hell out of it, and come back here once you have some good benchmarks. |
Yes, now I see what is the difference. I thought most of the performance/memory gain was from just the fact of using C instead of Ruby. So in this case using external binary file would make not much difference. But you went one step further and used perfect hash. |
So again, to be clear. What I and @halostatue suggest is to use capnproto to save data in external binary file, and use your C extension to load this data (not Ruby implementation), and use general hash function to search this data. Main memory gain is from the fact that you use effective C in memory representation instead of Ruby. Hash function itself doesn't impact on memory consumption it affects on number cycles required to search data in table. Please correct me if you see flaw in what I'm saying. General hash function will perform a little bit worse than perfect hash, but I don't think this will be the problem, because problem that you address is memory no CPU cycles. |
You won't save that much memory this way, and their is a pretty big performance overhead loading the data into a Ruby hash.
The perfect hash function is a signifiant gain in memory because it essentially compresses the hash keys. The hash function itself doesn't affect memory consumption directly, but the representation in memory definitely does. Additionally, loading a dynamically shared library means it would be shared between processes as static immutable code/data. That's also a huge win.
The problem I address using my approach:
|
Why Ruby hash? C implementation of hash table.
There is all the same data stored in generated hash table, which would be stored in general hash table e.g. hash_key, original key, value. Plus general hash table will store pointer to next value in the bucket (or something similar depending on implementation). #line 685 "Extensions.gperf"
{"Z", "application/x-compressed"}, Main memory improvement comes from C itself, not perfect hash function.
Yes and thats fantastic result. Thanks for doing it. But if you would use all the same implementation + external data file the difference would be the same significant. |
Because Ruby hash is generic for Ruby objects. It's not that efficient for a static lookup.
Your right in the sense that all the entries are in the C code. Of course it has to be somewhere! But, your wrong if you think it's anything like a general hash table. There is no hash_key, no probing, no buckets, no next pointer or any other crap like that to slow down the lookup.
Nope, it wouldn't. Because there would be a large startup time, and it can't be shared easily using a DSO. |
I mean do not use Ruby hash. Use C implementation of Hash table. Like you do now. Use the C implementation of Hash table. But instead of using ideal Hash function use general Hash function.
And you think that this is the source of difference 200k vs 6mb? It seems you have strong believe that I suggest to use Ruby code for this task. It is not like that.
Of course your solution would be the fastest and most optimal, I never argued with that. But suggested solution would be just a bit slower and you can swap data files easily. |
Yes, and it would have a large start-up time to populate such a data structure, and it would have a large non-sharable memory allocation. Two VERY big issues people currently face when using large Ruby code bases, so big that people try to work around it using application servers that do "prefork" initialisation. A perfect hash algorithm has got almost nothing to do with a general purpose hash table data structure. They don't even need to compute a general hash value at run-time. The hash computation is directly tied to the lookup, and it's a direct array lookup with a single key comparison if required.
Yes, my solution is the fastest, and most optimal, that I could come up with, and that's what I was aiming for. I think if you are interested, why don't you implement a general purpose C hash in a fork of my code, and then compare it. Startup time, lookup time, and memory usage. Keep in mind that in the above memory usage is actually: 200k per DSO context (e.g. OS). 6Mbytes per process. Whatever you are proposing with a general purpose hash table, is probably also per-process.
Swapping data files is not something people usually do, but I've also suggested a way in which this can be achieved more efficiently above, by using nested wrappers, which I think is actually the best solution, rather than having users modify global state. I appreciate you taking the time to think about this issue. Let's work towards something that makes sense. |
Here is what I think makes sense.
We would then expose People who use
|
@ioquatix maybe it makes sense to release a non-specific gem here, example possible interface:
Then this pattern and encoding structure can be reused in other places that need access to a static hash of sorts, for example, yaml translations have the exact same problem of consuming WAY too much memory these days. So many RVALUES are just sitting there being scanned every major GC for zero value. Advantage of this kind of approach is that it can have a naive Ruby implementation and high performance MRI or JRuby implementation. It would also be significantly simpler to integrate into this project cause it could become an explicit dependency. |
I did think about this actually. It would be nice. But it would have to be compiled on the fly. It's not impossible, it's just a lot of work. Honestly, there is another question to ask - is mime-types simply loading too much unnecessary data in the first place? |
@ioquatix 99.99% of the time the consumers don't need the vast army of information mime-types provides... it is why I created https://github.com/discourse/mini_mime The 99.99% case is that people include the |
Thanks for your effort making |
Can you explain this? I am not following, mini_mime loads nothing into memory on boot and uses an LRU cache to cycle stuff in as needed topping up at about 400 RVALUEs in memory. For context: User.first in Rails is around a 400 RVALUE allocated operation. |
Okay, well, you encourage people to use the api: https://github.com/discourse/mini_mime/blob/356b01907d11e30a39ab16f4bace2aa9bd80d699/lib/mini_mime.rb#L4-L11 How does one add a new mime type? How does one remove an existing mime type? How does one provide a subset of mime types to the mail gem, and a different subset to some other gem? |
sure we could expand the interface to support this or add a general interface to the mail gem, its not a problem that was solved cause nobody had it :) |
As you said yourself, most people don't need 99% of what mime-types provides. So, your not really solving the problem at all, your just making the data set smaller. The problem of the DESIGN still exists, and it's affecting everyone. |
Well it depends on how you define "the problem" :) For Discourse and all Discourse users "the problem" is that
Both of these are solved by it. Anyway totally fine to create a generic interface like we have for execjs and implement a bridge but I feel it is overkill. Do you have a specific project you are working on where you need to customize mime types and support non IANA supported or registered mime types? |
Yes. |
This was discussed here: #45 |
I like how you've solved the problem, don't get me wrong. But, I just thing somewhere in this big kludge of solutions, is something better and we should aim for it. |
@halostatue thanks for your continued effort to look after this code. You gonna be at Ruby Kaigi? |
I won’t. I don’t do a lot of conferences, and whole RK is one I’d attend (it’s not in the US), I’m not doing Ruby full time. I’ve closed this issue, by the bye, since I don’t see an easy way to solve this in the constraints that I’ve placed on this library (must be pure Ruby as to not disadvantage any particular implementation of Ruby). I think that a code-generation approach based on mime-types-data is likely to work well, but should still allow people to customize the values if possible. I’m planning on doing another round of tests on the value pool implementation, but I don’t know what the best benchmarking tools in Ruby these days are as I’ve moved mostly to Elixir for day-to-day work. |
I've been reviewing memory usage in my web framework. I'm sure you already know this but mime-types is a memory hog, accounting for 55% of my base web application memory usage.
To me, this seems a bit strange, as the mime-types.json file is only ~400Kbytes.
Is this because you are building a lot of in-memory data structures for the mime types?
If the data is 400Kbytes, and the problem is ruby overhead, what about compiling the data into a C extension and providing a query mechanism? If it's read-only constants in a C library the overhead would be reduced significantly and also shared between processes.
Perhaps you can give me some ideas of whether this memory usage is reasonable according to your own experiences and then perhaps we can brainstorm some ideas to reduce memory usage.
The text was updated successfully, but these errors were encountered: