-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WiP]: Loading verilator as shared libraries #233
base: master
Are you sure you want to change the base?
Conversation
This should help with the IPC overhead.
When I attempted this (the remains are in https://github.com/freechipsproject/chisel3/tree/testers2%2Binterprocesstester), I ran into a number of problems:
There may be a way to interpose an expendable class loader to do the I think this is ultimately possible, but we'll need some mechanism to better coordinate tests using the same simulation, and loading and unloading different simulations. |
Seems worth throwing in here that @stevenmburns says that cocotb VCS is 20x faster than the testers VCS implementation. Cocotb is in python but there may tricks that are relevant to us |
…neric_java_home Use System.properties for java_home; report System.load() failures
Thanks Jim, I didn't experience issues reloading a shared library on OS X, to my surprise. I guess dylibs work differently. My understanding is that shared libraries come and go with the classloader, but I'm not really sure how that gets translated to the OS/process level. I suppose another option is to uniqify the class name per compilation unit. I bet you could load a lot of libraries before running into problems |
Also re:VCS, I think you could do this same thing, but I'd be careful about the libraries it pulls in. I think it has more of a runtime than verilator. |
The problems I experienced were/are on OSX, with Java 8 and 11. |
Really? That's strange. I wonder why I didn't see it. I was using sbt in interactive mode running the verilator test. Here's my java version string:
|
Here is a git repo they will demonstrate the 20x runtime difference. https://github.com/stevenmburns/cocotb-clamper The bitonic sorter example is in there as well as others. |
Try |
Here are the runtime numbers. Chisel test with treadle: 4 sec |
Thanks @stevenmburns. I think I'll look at that as a comparison point for this PR. @ucbjrl, that does fail for me. I got an error in reset, though- it doesn't look like it is having any problems loading a shared library. My harness uses global variables in shady ways, so my guess is this the result of one tester re-initializing the global state while another tester is running. I should be passing state via the calling object rather than abusing |
But each shared library defines some of the same symbols. How could you load more than one at a time? Unless you ensure there are no global symbols in common between simulations (i.e., each simulation has its own namespace and defines no globals), I suspect you'll have problems trying to load more than one. And even if you could load more than one, eventually you'll start running up against memory limitations. But in any case, yes, we should reduce the amount of global state. |
I can imagine that it's OK for multiple loaded libraries to have the same symbols- I think when you load one, it will assign the native methods to the current addresses and won't get overwritten by later libraries getting loaded. This seems to indicate hundreds of thousands of shared libraries are OK to dynamically load. |
Use JNI to save/get a pointer to state.
I got rid of global variables. Instead, I save a pointer to state in the wrapper class. Every function first uses JNI to get the pointer. This has a substantial negative impact on performance, so it is probably worth thinking about. I think we can cache the pointer without ill effect b/c other testers will get mapped to different addresses. |
This makes an enormous difference, it's back to the original performance. Not sure why this particularly different than using globals the way I was before, but it isn't crashing for me. I think there's a better way to cache all this, though. |
This should help with the IPC overhead. Not sure how much of a pain it will be to make this portable.
The way it works is this:
Things are hacky, but less hacky than I'd have thought. Getting this to work on your system shouldn't be too crazy, it will mostly depend on getting your verilator CFLAGS right. I haven't noticed any weird crashes.
As far as performance goes, on my GCD test I'm getting ~400kHz sustained. I think that's over a 5x speed-up if our old numbers are right.