-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote sensing in R #56
Comments
Nice topic!
I think a major challenge is that we have petabytes rather than terabytes of RS data, and that that requires very different tools, skills and resources to manage than what R is meant for.
I personally don't think of this in terms of "ready or not" (nor in terms of "recent leaps"), but several people, including me, are actively working in this direction, and some more coordination between them would probably be good for everyone. Another question is whether the RS community is ready for R, or more in general for doing statistical inference rather than ML. |
Thanks for your comment!
I think there has always been a strong connection between R and RS, especially in research (see |
Heh, the question which I was asking myself for a while..
@edzer is right, peta. And IMHO is not about data which is available, rather is about knowledge what's behind the data and how to utilize it. And then, how to process it. I would say, that the tools are available, in one or another form, it's just a matter of choosing the proper one / build a stable workflow.
Yes, they are. They are scattered across many blogs, publications or just a vignettes.. Some of them are very superficial, they only touch on the subject, some are a bit better. I had a dream :). A publication which shows the power of open data in spatial science. What might be the outcome, from where get the data and how to process it. If your goal is to spend the rest of the life to work on such multi-volume live edition, then I'm in. Regards, |
Here is @appelmar great presentation and examples about this topic: https://appelmar.github.io/CONAE_2022/ |
Dear @kadyb @edzer @applemar @gsapijaszko and others, @kadyb asked me to share my thougts on the matter here. Apologies in advance for a long thread. In what follows, I will concentrate on topics related to big Earth observation (EO) data analysis in R, which by extension implies also dealing with small data sets. (a) Access to big EO data collections: Apart from GEE, most EO cloud providers support the STAC protocol. In R, supporting STAC has been well addressed by the (b) Creation of EO data cubes from cloud collections: this is an issue vastly underestimated by EO developers, especially those in Python. Part of the problem is the conflation between the abstract concept of (c) Satellite image time series analysis: despite my obvious conflict of interest, I will argue that (d) Single image analysis: remarkably, there is currently no package in R that supports traditional 2D image processing using data cubes. (e) Object-based image analysis: another area where R is lacking in effort. We need a package that would do OBIA in connection with EO data cubes. Even simple region-growing segmentation algorithms are missing. (f) Deep learning algorithms: another area where R lags behind Python. For image time series, our team at INPE was fortunate to be supported by Charlotte Pelletier, who helped us to convert current state-of-the-art PyTorch algorithms such as her own (g) Converting DL algorithms from PyTorch to R: Translating from (h) Image data formats: here, it appears to be an ongoing battle between ZARR and COG. Most data providers such as MPC and AWS provide their 2D imagery in COG. However, for multidimensional data used in Oceanography and Meteorology, ZARR is replacing netCDF. For R developers, the good news is that GDAL supports both ZARR and COG. The bad news is that their significant differences have relevant consequences for efficiency when processing large data sets. (i) Parallel processing using COG: The COG data format is well-suited for parallel processing. Since images are stored in chunks, processing large data sets in virtual machines with many cores is (conceptually) simple. Each VM core receives a chunk, processes it, and writes the result to disk. Partial results are then joined to get the final product. All happens behind the scenes. By contrast, Python users that use (h) Active and self-supervised learning: this is another area where we need serious investment. The quality of training data is the defining factor in EO product accuracy. However, there is much more current effort (as measured by published papers) in new algorithms than in better learning methods. I work directly with remote sensing experts who understand the landscape. They go to the field and assign labels to places. Then, they take an image of the area close to the date where they were in the field, and use the samples collected to classify the image. In the large-scale work we are doing in Brazil, it is hard for them to assign labels to multi-temporal data. Bare soil areas in the dry season will be wetlands in the rainy season. There may be two or three crops per year. Farmers may mix grain production with cattle raising. Further progress on big EO data analysis requires methods to support data labelling. The need for self-supervised learning is well recognised in the deep learning community. Please read what Yann LeCun (VP of Meta for AI) wrote in this blogpost: https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/ (i) GEE x MPC: for me, there is no choice. GEE is frozen. In MPC, we can make real progress, developing and testing new methods. To sum up and consider @kadyb question
I firmly believe that progress in big EO data analysis is driven by community efforts, and not independent developers producing the next algorithm. The R ecosystem is much more community-driven than Python. Developing |
Thanks for this thread, I have a question for @gilbertocamara , you mention that "The COG data format is well-suited for parallel processing. Since images are stored in chunks, processing large data sets in virtual machines with many cores is (conceptually) simple". Have you actually seen any of this being implemented? I have a colleague that is very good with DASK, so it would be very interesting to do some benchmarking of this two approaches |
Dear @derek-corcoran-barrios, we use the "sits" package for large-scale land use classification for Brazilian biomes such as Amazon (4 million km2) and Cerrado (2 million km2). For your information, LUCC classification in Amazon uses 4,000 Sentinel-2 images, each with 10-meter resolution, and 10,000 x 10,000 pixels. The data cube has 12 bands per image and 23 temporal instances per year. The total data is 220 TB. 150,000 samples were selected to train the deep learning classification algorithm. These data sets do not fit in main memory, even in case The "sits" software optimizes I/O access to large EO collection using COG files. The comparison with Python(xarray/DASK) has to be considered with care. Xarray is primarily an in-memory data structure, and DASK would then be limited to paralellizing data in memory. For big areas such as the Brazilian biomes, the combination of xarray/DASK faces challenges of scalability. We can provide a script that uses data in Microsoft Planetary Computer to run a classification of a large area in Brazil and provide the training data so that you can try to replicate the result using xarray/DASK and we can compare the results. |
Thanks @gilbertocamara it would be amazing, we have a hybrid team in our workplace with python and R for spatial analyses, we are right now debating best packages/librearies and formats. It would be awesome if you could provide such script. Thanks |
Since the first post on this topic, the situation has improved and more training materials and examples have appeared. STAC (SpatioTemporal Asset Catalogs): STAC with GDALCUBES:
SITS (Satellite Image Time Series): openEO:
rsi (Retrieving Satellite Imagery): |
Very happy to see that first link here! FWIW, my understanding is that those tutorials are going to be incorporated into stacspec.org in a few months (this is attached to a redesign of the website, so no precise timeline), at which point that'll be the actual URL. Just wanted to drop crumbs for future readers in case that link stops working (and to say that, should anyone have feedback on the tutorials, I'd love to incorporate suggestions!) |
For more resources on SITS and satellite image time series in general, please see my interview with Robin Cole: https://www.youtube.com/watch?v=0_wt_m6DoyI |
@kadyb, the package |
I would like to raise topic and hear your opinions about status of remote sensing in R. Do you also have the impression that there has been a technological leap recently, but the available teaching materials and workflows are outdated? There are so many new things comparing to the past that it can be overwhelming. Here I listed some hot topics:
There is probably a lot more, but I have no idea about that.
I think the most popular book for remote sensing in R is the one from Aniruddha Ghosh and Robert Hijmans (https://rspatial.org/terra/rs/index.html), but it covers very basic topics and focuses on local computing, not in the cloud. Online tutorials also focus on very simple topics and use the old
{raster}
package. Nevertheless, most of the mentioned things are covered on the r-spatial blog.Basically, we have terabytes of remotely acquired data, but is the R ecosystem ready for that and do we have the specialists? What are your thoughts?
The text was updated successfully, but these errors were encountered: