Skip to content

Latest commit

 

History

History
82 lines (64 loc) · 3.44 KB

notes.org

File metadata and controls

82 lines (64 loc) · 3.44 KB

MapAlgebra

Benchmarking

PNGs - Manual indexing vs. Shared Memory

This is a comparison of approaches in performing the transformation Raster p r c PixelRGBA8 -> Image PixelRGBA8. The built-in encodePalettedPng function provides a similar transformation of Raster p r c Word8 -> Image Pixel8, but that wouldn’t allow us to use the alpha channel. It is quite fast, but we can’t consider it here.

The manual indexing approach uses the generateImage function from JuicyPixels while indexing through every element of the Raster. This is not parallelizable.

-- | Manual indexing method (no memory sharing).
indexing :: Raster p r c PixelRGBA8 -> Image PixelRGBA8
indexing (Raster a) = generateImage f w h
  where (Z :. h :. w) = R.extent a
        f c r = R.unsafeIndex a (Z :. r :. c)

The shared memory approach borrows some code from JuicyPixel-repa to construct an Image from a fully computed Array. The Array is given the F type hint (“Foreign”) so that we just need to pass a pointer to it in order to build the Image (since the internal data in Image is a Data.Vector.Storable).

-- | Memory sharing approach.
shared :: Raster p r c PixelRGBA8 -> Image PixelRGBA8
shared (Raster a) = Image w h $ S.unsafeFromForeignPtr0 (R.toForeignPtr arr) (h*w*z)
  where (Z :. h :. w :. z) = R.extent arr
        arr = runIdentity . R.computeP $ toRGBA a

This uses computeP as well, assuming that all input Rasters will be large enough to make this worth it. I’ve benchmarked elsewhere that it’s worth it to use computeP even for 256x256 rasters.

All times are in milliseconds. The first two extra ops were simple local addition, and the 3rd op is a focal addition, which seems to have much more overhead.

Manual Indexing

CoresClassify1 op2 ops3 ops
18.4509.79013.79269.4
29.1610.4714.85291.7
49.5811.0815.21309.8
811.1713.1217.7407.7

Shared Memory

CoresClassify1 op2 ops3 ops
131.5736.5348.131064
217.1219.6925.29539.3
414.0810.4122.16350.5
811.1212.5915.56219.1

Since the shared memory approach uses computeP, its performance improves as more cores are added. This is the environment we’d be using the library in anyway (say, 16 or 32 cores), so the shared memory approach should be used here.

PNG encoding w/ LLVM

All benchmarks were ran with RTS+ -N4

TrialgenerateImage (μs)256 (ms)1024 (ms)
LLVM - traverse428.25143.2
LLVM - unsafeTraverse449.8168
Native - traverse11010.77175.6
Native - unsafeTraverse11212.81202.7

Take aways:

  • LLVM is good.
  • traverse is mysteriously faster, at least for my ForeignPtr approach to image conversion. Is there a way that would use U? Repa claims that U is the best for numerical opeartions.