-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving to an org / making a golden standard for NetCDF files in Julia #57
Comments
Concerning
What I meant is that is could be maybe part of a different package, maybe |
Yeap, I do. Justifying make them easy to be seen from other's perspectives :) |
@Alexander-Barth is https://github.com/meggart/DiskArrays.jl along the lines of what you had in mind for By the way, I'm thinking to propose to effectively deprecate NetCDF.jl in favor of this package. We could do only bugfixes there, and point to this package for further development. I haven't discussed this with anyone yet, so @meggart do chime in. (Plus we'd have to ask the other contributors in a NetCDF.jl issue.) But at this point I don't see strong reasons to keep both around. I do like the name NetCDF.jl better. Perhaps you can take it if we deprecate the original haha. |
Thank you, I really applaud you for suggesting this, as I was going to do it anyway, but since I am not a contributor in NetCDF.jl, I would appear quite a mean person... :P |
Yes, One of the reason that I am reluctant to propose NCDatasets to go into a Julia organization, is that I do not want to add to the confusion for Julia users who are not quite sure which library to pick and I do not want to give the impression that there is a competition between organizations. But if there is just one non-deprecated netcdf package, then it should rather be part of an organization. The most expertise in NetCDF lies indeed in the JuliaGeo organization, so it would be a good choice as an organization. The name In any case, I will wait before releasing version 1.0. |
Nice typo in your first sentence ;) In that case, if @meggart agrees, it is probably best to first create an issue in NetCDF.jl as I mentioned, and go from there. To be honest, I have not done an extensive evaluation between the two. Are you aware of any significant things that you can do with NetCDF.jl that you cannot do with NCDatasets.jl? |
Haha, what a wired thing to have in one's mind! ;-)
I compared the docs, and I did not see any issue form this perspective. In NCDatasets.jl, we currently cannot read compound types. Can you read such data in |
No I'm quite sure NetCDF.jl doesn't support that either. I see no reference to it other than in the auto wrapped code, and have not come across compound types myself. |
I think there is a feature in NetCDF.jl that is not present in this package. The only and most important reason for me to use NetCDF.jl instead of this package is type stability when reading from objects representing an NC variable. For example lets generate an example file: import NCDatasets
import NetCDF
NetCDF.nccreate("dataset.nc", "v1", "longitude",1000, "latitude", 500, "time", 31, atts=Dict("scale_factor"=>5.0, "add_offset"=>1.0), t=UInt8)
NetCDF.open("dataset.nc","v1",mode = NetCDF.NC_WRITE) do v
v[:,:,:] = rand(UInt8, 1000,500, 31)
end; and when we read a slice from a ds = NCDatasets.Dataset("dataset.nc")
v = ds["v1"]
@code_warntype v[:,:,1]
while for NetCDF.jl: NetCDF.open("datasettest.nc","v1") do v
@code_warntype v2[:,:,1]
end
Where the call to the C function gets inlined. This is not a problem per se, since one can use function barriers to make this type stable. However, as I remember this was the initial reason for the split of the packages that I as a maintainer was hesitant to accept a PR that would apply NCDatasets.jl/src/core/indexing_in_vars.jl Lines 223 to 231 in d6f71f4
Before I switched my main focus on Zarr I had a solution to this in mind how to apply scale_factor and add_offset while still being type-stable and without making intermediate copies, but I don't know the codebase of NCDatasets well enough to easily see if they would apply here. |
Regarding the future for these packages, we might think about reaching out to JuliaIO. There are NetCDF files in the wild which are nit in the Geo domain (like chemistry) and where CF conventions don't apply. So maybe it would be an option to move NetCDF.jl to JuliaIO (in case they would accept it) as a basic and more lowlevel package and have NCDatasets as part of JuliaGeo which covers more CF-specific functionality. |
@meggart, thanks for pointing out the type-stability which is indeed a concern. I am currently working on a branch where I requite that all attributes affecting the type of the element be present early enough in order to compute the type of an element based of the type of But so far, the type stability issues are still present.
This sounds also like a good solution to me. |
I think I have now a solution to make it type stable. Essentially, all variables affecting the element-type are now part of the @inline CFtransform_offset(data,add_offset) = data + add_offset
@inline CFtransform_offset(data,add_offset::Nothing) = data If I define a variable with only a NetCDF.nccreate("dataset_only_scale.nc", "v1", "longitude",1000, "latitude", 500, "time", 31, atts=Dict("scale_factor"=>5.0), t=UInt8)
ds = NCDatasets.Dataset("dataset_only_scale.nc"); v = ds["v1"]; @code_typed v[1,1,1] Returns now:
But the important part is that for the example above, all types are now concrete: ds = NCDatasets.Dataset("dataset.nc"); v = ds["v1"]; @code_warntype v[:,:,1]
If for some reason, one wants the raw data, this is possible via, the following code (this is now type-stable too): ds = NCDatasets.Dataset("dataset.nc"); v = ds["v1"]; @code_warntype v.var[:,:,1] I noticed that in NCDatasets.jl/src/core/indexing_in_vars.jl Line 156 in 33d3244
|
woohoo, huge progress! |
Nice! Having a singleton dimension is quite useful to store the time information. If it returns only a 2D array, then there is no dimension left to store the associated timestamp. |
@Balinus, you can keep the singleton dimension if you read the full variable, like |
If its of any worth, my opinion would be that |
Just wanted to mentioned that I opened this issue at JuliaIO requesting to move NetCDF.jl there. Just in case someone wants to add something. |
Sounds good to me! We'd probably need an extra clear section in the readme on which package to use. One thing I'm a bit concerned about is maintenance of NetCDF.jl. I'm not so interested to keep doing things on NetCDF.jl if JuliaGeo chooses NCDatasets. Will you still want to do that @meggart? Because we should also be clear with the JuliaIO folks if it's current maintainers effectively abandon it. I see the point that NCDatasets has some geo specific features whereas NetCDF doesn't. The question is, are those features an issue for non geo users? |
I still can't see the reason of both existing, I hope someone writes a reason for this one day... |
I just came across this comment recently, which made me think. I propose that I mention Fabian and Martijn too in the License file (+ a link to github for a full list of contributors) for the use of the netcdf_c.jl file and the error handling code in |
Yes, regarding pure maintenance of the package I would like to continue, because there is some legacy code of mine that I don't want to update. I know that I am not as quick in replying to issues as others, but I think a waiting time of one or two weeks until an issue gets a reply should be acceptable for a package like this.
I think there is no strong enough reason to delete one of them. There is already quite some legacy code around that I think would be nice if it would continue to work. On a personal note I think that the scope of a package named NetCDF.jl should not be to implement as many features as possible but to provide basic IO-functionalities needed to read and write files and leave the rest to the user. I really like the added functionalities of NCDatasets.jl, but many of these are not really unique to NetCDF datasets. For example the Multifile Dataset is basically an implementation of a lazily concatenated array. Why not make a general Julia package for this so that it is also usable for stacking HDF5 files or similar? The same is true for CF conventions. I really appreciate that @Alexander-Barth factored out CFTime.jl and I am using it in my code for encoding and decoding time in Zarr files. To me, this is exactly the way to go. The same is true for applying add_offset and scale_factor, which can be simply done in a lazy manner using https://github.com/JuliaArrays/MappedArrays.jl if one does not want to make a copy of the data. If you look at the python world, I find myself hardly using the python-netcdf4 package anymore, because it is so easy to read any kind of data using xarray, which uses different IO packages as drivers in the background. So in my opinion to use resources efficiently we should concentrate on making rather minimalistic IO packages (and this is where I personally see a package like NetCDF.jl) and build convenience packages on top of that which can handle more than a single backend and give more user-friendliness. |
Ok good. I would not recommend deleting NetCDF.jl altogether, but even if it is in bugfix-only mode, that would be good to signal to users. Waiting a week or two for replies is of course ok in any sense. In general I agree with your favored approach of keeping focused packages and factoring out / reusing packages for non NetCDF specific features. I think the reality is however that sometimes this takes too long, and bundling things together in a package allows to quickly provide users with convenient API's that they are used to from packages like xarray. Developments in the ecosystem (and your own with for instance DiskArrays) may catch up however and should make it easier to apply your favored approach. Perhaps if we keep NetCDF.jl around, we could move the NetCDF IO core of NCDatasets to NetCDF.jl and make NCDatasets depend on that? I don't mind exactly how we do it in the end, as long as there is a clear story for users, and ideally minimal duplication of core NetCDF IO / binary wrapping code. |
If you are interested in why: it is simply that I am not familiar enough with raw HDF5 files and it is only recently that I became aware that CF conventions are used outside of NetCDF files. An essential part of the multifile code (with an aggregation of a given dimension) is implemented in a submodule CatArrays which does not import anything from Factoring-out the CF part, would unfortunately be a lot harder. |
This is very good to know. I will see what I can do. I also found https://github.com/JuliaArrays/LazyArrays.jl but for my purposes they do not fit because they are developed around being in-memory and having fast access to single elements. So your code is definitely a better start for a separate package that could be re-used with any DiskArray |
Perhaps another package for stacked arrays to look at: https://github.com/mcabbott/LazyStack.jl. |
It would be good to revive this discussion to see if we can come to an agreement within the community as to path forward. It's been 3 years and both packages are reasonably mature now and the two package ecosystem isn't all that helpful. My vote would be to move NetCDF.jl to JuliaIO, move NCDatasets.jl to JuliaGeo and then to slowly remove duplication in NCDatasets.jl by building on top of NetCDF.jl. This would make it easier for users to determine which package is most appropriate to use and contibute to. To me this seems like the most sustainable solution in the long-term. @Alexander-Barth and @meggart do you think we can find a path forward that would work for everyone? |
(continuing from an offtopic discussion in #47 (comment) )
I've spent some time arguing why I think having packages in orgs instead of owners is a good thing. Here is my summary: having more contributors is only a small part of why I am suggesting this. Its more so about improving discoverability, improving trust by the community, inviting more collaborators (for some, having your name as an owner can be decisive in not contributing), having more sets of eyes trying to converge on the best design decisions, having more maintainers that can review and merge PRs as well as answer Issues, even fairness (because if this one person is the one that will absolutely take all decisions according to their one personal agenta, its not that fair for the rest of the contributors or the user community I'd say). These are some of the benefits among others. Obviously there are numerous downsides for you personally, e.g. sharing "power". Others have also argued why orgs can be "bad" here: https://discourse.julialang.org/t/newcomer-contributor-in-juliageo-and-co-help-me-get-started/32480 but I remain in favor of orgs.
Besides, such a massively central thing like "loading nc datasets" which I would imagine is used by so many people working in these fields, is just too important to be lacking those aforementioned features. I am a huge proponent of having that "one" package that does that "one" thing as good as possible. I will be for sure spending the next two years working with .nc files, and thus I want to spend time and effort making this golden standard possible, and even channeling some of the community into this effort. In my eyes, this seems to be easier to do with an org.
Ultimately it is obviously up to your personal agenta what happens with this repo, but I just hope that you can see that my suggestions come truly from a "improve the community experience" and not "I want to take power from you". I would believe that other people also share the same view as me. I will continue contributing here weekly of course, but I am also convinced that such a central part of the ecosystem should part of an organization. Of course, you don't have to decide immediately. As I contribute more and more here, the benefits of collaboration might skew your opinion towards the org.
The first reason for opening this discussion is simply laying down the facts from my perspective. The second reason for opening this issue, is that I'd like @Alexander-Barth to please give a bit more transparency about what should belong in this repo or not. This request is motivated by the following comments:
The aforementioned comments hint a lack of transparency on what would be accepted as a PR or not. Working on a feature that when submitted as a PR will be rejected as "unfitting" can be very demotivating... Can you please give some hints on what you think should be here and what not?
My journey in netcdf files started by porting Python code to Julia. Python code used
xarray
for their .nc stuff, and unfortunately, truth be told,xarray
has very long list of features that do not exist in Julia (I was even considering of usingxarray
for all my.nc
related work actually). As I decided to go with the native approach and implement these features instead, doing it here is the only thing natural for me.But that's only if they are accepted obviously :P
The text was updated successfully, but these errors were encountered: