Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use compute instead of geometry shaders #5

Open
jeske opened this issue May 29, 2019 · 18 comments
Open

use compute instead of geometry shaders #5

jeske opened this issue May 29, 2019 · 18 comments
Labels
discussion Further discussion is needed

Comments

@jeske
Copy link
Collaborator

jeske commented May 29, 2019

...or at least support compute shaders as an alternative...

...because (a) it'll work on Metal / Mac, and (b) the performance of geometry shaders sucks

https://forum.unity.com/threads/ios-11-metal2-has-no-geometry-shader.499676/#post-3315734

https://stackoverflow.com/questions/50557224/metal-emulate-geometry-shaders-using-compute-shaders

@Looooong
Copy link
Owner

Looooong commented May 29, 2019

My voxelization implementation is based on this source, which provides a good insight on the subject matter.

The thing about voxelization that it is just as simple as render object to the screen, but instead of writing color to the screen buffer, we collect the depth value and output to a volume buffer. The problem arises when the depth gradient of a particular triangle is high (ddx(depth) > 1.0 || ddy(depth) > 1.0), "crack" will form on the result voxel volume.

cracking

To solve this issue, we just need to project the triangle onto a plane where the projection area is the largest. That means we would need 3 rendering passes to project the scene into 3 different axes.

projection

The nice thing about the geometry shader is that we can combine all the 3 rendering passes into one. Because the voxel volume is a cube and the projection axes X, Y, Z are orthogonal to each other, we only need to swizzle the X, Y, Z components of vertex position in voxel space. This can be easily done in geometry shader by calculating the triangle normal and selecting the corresponding projection axis. You can see it here:

// Swap coordinate axis for largest projection area
float3 SwizzleAxis(float3 position, uint axis) {
// Method 1:
// switch (axis) {
// case AXIS_X:
// position = position.yzx;
// break;
// case AXIS_Y:
// position = position.zxy;
// break;
// }
// Method 2: Is it faster?
uint a = axis + 1;
float3 p = position;
position.x = p[(0 + a) % 3];
position.y = p[(1 + a) % 3];
position.z = p[(2 + a) % 3];
return position;
}
[maxvertexcount(3)]
void geom (triangle v2g i[3], inout TriangleStream<g2f> triStream)
{
float3 normal = normalize(abs(cross(i[1].vertex - i[0].vertex, i[2].vertex - i[0].vertex)));
uint axis = AXIS_Z;
// Choose an axis with the largest projection area
if (normal.x > normal.y && normal.x > normal.z) {
axis = AXIS_X;
} else if (normal.y > normal.x && normal.y > normal.z) {
axis = AXIS_Y;
}
[unroll]
for (int j = 0; j < 3; j++) {
g2f o;
o.position = mul(VoxelToProjection, float4(SwizzleAxis(i[j].vertex, axis), 1.0));
#if defined(UNITY_REVERSED_Z)
o.position.z = 1.0 - o.position.z;
#endif
o.normal = i[j].normal;
o.axis = axis;
o.uv = i[j].uv;
triStream.Append(o);
}
}

About the problem on Metal API, without geometry shader, we just use 3 rendering passes to voxelize the scene, which might triple the processing time. I ran the GPU profiler, and the processing time of voxelization stage with geomertry shader is quite trivial with respect to other processing stages. So I don't think tripling the processing time doesn't matter much.

If I were to implement this on compute shader, I doubt that this would add more complexity to the project and re-invent the wheel. And I'm a simple man, I hate complexity (╯°□°)╯︵ ┻━┻

@Looooong
Copy link
Owner

Looooong commented May 29, 2019

Btw, about the geometry shader performance, your source was written back in 2015, which is quite old. The developers from the game Factorio (which I'm a fan of) tested geometry shader on variety of PCs last year and found that new GPU executes geometry shader better than the older generation.

Apart from the processing performance, we have to consider the amount of work into organizing the data before passing them to the voxelizer. The final result is not only to detect which voxel the scene occupy, but also have to gather the material properties for that voxel in order to perform voxel cone tracing and indirect lighting at the later stages.

For now, I will stick to the current implementation because it is more convenient for handling inputs, outputs and vertices transformation.

@jeske
Copy link
Collaborator Author

jeske commented May 29, 2019

It is good to know they improved the performance of geometry shaders.

AFAIK, the compute-shader method does not require three passes, and it does not change the code much. The geometry shader calculations simply move into a compute shader. It takes the same input, it produces the same output. Instead of one draw call, you get one compute call and one draw call.

This is described in the third link I posted.. here:

https://stackoverflow.com/questions/50557224/metal-emulate-geometry-shaders-using-compute-shaders

Metal does not have geometry shaders so I [emulated] them using a compute shader. I pass in my vertex buffer into the compute shader, do what a geometry shader would normally do, and write the result to an output buffer. I also add a draw command to an indirect buffer. I use the output buffer as the vertex buffer for my vertex shader. This works fine, but I need twice as much memory for my vertices, one for the vertex buffer and one for the output buffer.

This developer complains that the compute-shader version takes double the memory, because it has an input buffer and output buffer. Perhaps there is a way around this by using the same buffer for input and output from the compute shader.

@Looooong
Copy link
Owner

Looooong commented May 30, 2019

Here are 2 problems:

1. Double-memory because of input and output buffers

Unity uses a list of vertex positions and a list of triangles to store Mesh data. The list of triangles contains indices that refer to the list of index positions. This data structure helps reduce memory footprint because a same vertex position can be referred by multiple triangles.

Now, we need to voxelize the the mesh data. We need to separate the triangles, and "rotate" them to face the projection plane accordingly. This means that vertex positions on the some of the triangles changed. For example, a vertex, that is used by 2 triangles, can be separated into 2 vertices with different position. Therefore, the output mesh data might be different than the input mesh data if any of the triangles is rotated. This is the reason why we need separate input buffer and output buffer to process mesh data.

2. Provide the compute shader with vertices data

Once we get the meshes that need to be voxelized, we need to pass these mesh data to the compute shader. We can do something like this, which uses ComputeBuffer.SetData. But this thing is slow as hell, because data has to be moved from CPU bound to GPU bound. Another method can be used is CommandBuffer.SetComputeFloatParams, which has the same issue.

I don't know about this, but I think Unity has an internal mechanism that is used to transfer, not only mesh data, but also UVs, textures, normals and tangents data to the internal render pipeline, which is very fast. This mechanism is used by CommandBuffer.DrawMesh, CommandBuffer.DrawRenderer, ScriptableRenderContext.DrawRenderers and few more methods.

In conclusion, we need to find this fast mechanism to pass renderer data to the compute shader. Otherwise, issuing a draw call 3 times is probably faster than just moving the data back and forth.


That's what I think about the problems. Moreover, in my experience, I think that compute shader is very good at generating mesh for procedural draw call, not modifying the existing one. This example uses both compute shader and geometry shader to render grass affected by the wind and tramples, which is kinda cool and demonstrates the power and usefulness of geometry shader.

@Looooong Looooong added the discussion Further discussion is needed label May 30, 2019
@jeske
Copy link
Collaborator Author

jeske commented May 30, 2019

I understand. Thanks for your response!

In the case of #2, I can't find any discussion of ComputeBuffer.SetData being incredibly slow. Perhaps there is some synchronization or other issue.

Probably better to move to the new SRP in 2019 before worrying about Mac anyhow.

It is interesting that Unity Mac OpenGL supports Geometry Shaders but not compute shaders, and Unity Mac Metal supports Compute Shaders but not Geometry Shaders.

@Looooong
Copy link
Owner

I think that it is pretty obvious that people don't discuss about it. The thing about this is that you have to move the data from the CPU memory to the GPU memory through the computer data bus. If you have ever studied computer architecture, you would have know that CPU usually wait for I/O because fetching data from RAM is usually slower than the CPU executing instructions in 1-2 cycles. The same thing applies to GPU. GPU with the architecture, designed to be able to execute instruction fast in parallel, might actually wait for the data transfer from RAM to the GPU memory.

Is there a way to solve this issue? Yes, we just need to find a way to access mesh data that is (probably) already available on the GPU. Unity does have Mesh.GetNativeIndexBufferPtr and Mesh.GetNativeVertexBufferPtr that point to internal graphics API, but these things are native.

You can test the ComputeBuffer.SetData for yourself. I had tested it before when I was developing the voxelizer. After the scene voxelization, I tried to get voxel data with ComputeBuffer.GetData and the frame dropped at high voxel resolution. After that, I found out that I could use CommandBuffer.DrawProcedural to visuallize the voxel data, which is already available on the GPU:

public void RenderMipmap(ScriptableRenderContext renderContext, Camera camera, VXGI vxgi) {
var transform = Matrix4x4.TRS(vxgi.origin, Quaternion.identity, Vector3.one * vxgi.bound);
_command.BeginSample(_command.name);
if (vxgi.mipmapSampler == MipmapSampler.Point) {
_command.EnableShaderKeyword("RADIANCE_POINT_SAMPLER");
} else {
_command.DisableShaderKeyword("RADIANCE_POINT_SAMPLER");
}
_command.SetGlobalFloat("Level", Mathf.Min(vxgi.level, vxgi.radiances.Length));
_command.SetGlobalFloat("Step", vxgi.step);
_command.DrawProcedural(transform, material, (int)Pass.Mipmap, MeshTopology.Quads, 24, 1);
_command.EndSample(_command.name);
renderContext.DrawSkybox(camera);
renderContext.ExecuteCommandBuffer(_command);
_command.Clear();
}

Pass
{
Name "Mipmap"
CGPROGRAM
#pragma target 4.5
#pragma vertex vert
#pragma fragment frag
#pragma multi_compile __ RADIANCE_POINT_SAMPLER
#include "UnityCG.cginc"
#include "Packages/com.looooong.srp.vxgi/Runtime/CGIncludes/Utilities.cginc"
#include "Packages/com.looooong.srp.vxgi/Runtime/CGIncludes/Radiances/Sampler.cginc"
struct v2f
{
float4 position : SV_POSITION;
float3 view : POSITION1;
};
// 6 inner faces of a cube
static float3 Vertices[24] = {
float3(0.0, 0.0, 0.0),
float3(1.0, 0.0, 0.0),
float3(1.0, 1.0, 0.0),
float3(0.0, 1.0, 0.0),
float3(0.0, 0.0, 0.0),
float3(0.0, 1.0, 0.0),
float3(0.0, 1.0, 1.0),
float3(0.0, 0.0, 1.0),
float3(0.0, 0.0, 0.0),
float3(0.0, 0.0, 1.0),
float3(1.0, 0.0, 1.0),
float3(1.0, 0.0, 0.0),
float3(1.0, 1.0, 1.0),
float3(1.0, 0.0, 1.0),
float3(0.0, 0.0, 1.0),
float3(0.0, 1.0, 1.0),
float3(1.0, 1.0, 1.0),
float3(1.0, 1.0, 0.0),
float3(1.0, 0.0, 0.0),
float3(1.0, 0.0, 1.0),
float3(1.0, 1.0, 1.0),
float3(0.0, 1.0, 1.0),
float3(0.0, 1.0, 0.0),
float3(1.0, 1.0, 0.0),
};
static float DitherPattern[4][4] = {
0.0000, 0.5000, 0.1250, 0.6250,
0.7500, 0.2200, 0.8750, 0.3750,
0.1875, 0.6875, 0.0625, 0.5625,
0.9375, 0.4375, 0.8125, 0.3125
};
float Level;
float Step;
v2f vert(uint id : SV_VertexID)
{
float3 v = Vertices[id];
v2f o;
o.position = UnityObjectToClipPos(v);
o.view = UnityObjectToViewPos(v);
return o;
}
half4 frag(v2f i) : SV_TARGET
{
float3 view = i.view;
float3 unit = view * Step / view.z;
view += unit * DitherPattern[i.position.x % 4][i.position.y % 4];
float3 coordinate = mul(transpose(UNITY_MATRIX_IT_MV), float4(view, 1.0));
half4 color = half4(0.0, 0.0, 0.0, 0.0);
while ((view.z <= 2 * Step) && (TextureSDF(coordinate) > -0.000001)) {
half4 sample = SampleRadiance(coordinate, Level);
color = sample + color * (1 - sample.a);
view += unit;
coordinate = mul(transpose(UNITY_MATRIX_IT_MV), float4(view, 1.0));
}
return color;
}
ENDCG
}

Another time, I was developing the light injection mechanism by using CommandBuffer.SetGlobalFloatArray to inject 64/128 light indices. And the result is the same, frame rate drops. For now, we only support 16 different lights within the voxel space 😅

P/s: use the CPU/GPU profiler and frame debugger to see the processing that of these operations. Try uploading an array with the length of 1.000/10.000/100.000/1.000.000.

@Looooong
Copy link
Owner

It is interesting that Unity Mac OpenGL supports Geometry Shaders but not compute shaders, and Unity Mac Metal supports Compute Shaders but not Geometry Shaders.

This post saids that is because Apple refused to support modern OpenGL versions 🤣

@jeske
Copy link
Collaborator Author

jeske commented May 31, 2019

Yes, I understand Compute Architecture. I am a 45 y/o Computer Engineer Programmer. I understand GPU and CPU hardware much more than I understand Unity.

Of course transferring data to the GPU takes time. However, it takes the same amount of time to send the data to the GPU whether it is into a ComputeBuffer or a VB/IB buffer. This is normally done ahead of time, when the mesh is created, not every frame.

I think I understand now that Unity is hard-coded to put Mesh data into VB/IB buffers. And even though all graphics APIs have mechanisms for Compute Shaders to see VB/IB buffers, Unity Compute Shaders have no such mechanism.

At first I thought something like Mesh.GetNativeIndexBuffer_AsComputeBuffer() and Mesh.GetNativeVertexBuffer_AsComputeBuffer(), would help.. but Unity also seems hard-coded to issue direct draw calls during renderContext.DrawRenderers().

Instead, each Renderer would need to be drawn through a compute-shader -- by binding VB/IB data to a compute shader, call the compute shader, then callCommandBuffer.DrawProcedural on the output. This could happen as a hard-coded new mode of SRC.DrawRenderers() or perhaps by creating a Delegate mode for DrawRenderers().

Is it possible to rewrite a custom version of SRC.DrawRenderers()? If so then I think a way to hand VB/IB data to a compute shader would be sufficient?

I made a post on the SRP feedback thread.

Looks like Mac support may be easier in a Xenko port.


As for ComputeBuffer.GetData, being slow, this is not a Unity issue. It is always slow if attempting to use it in the same frame, because has to wait for the compute task to finish, create a synchronization barrier to force all data to be flushed into GPU RAM, then schedule the data for DMA and wait for it to get to CPU RAM. As you found, the solution to this is use CommandBuffer.DrawProcedural to allow GPU drawing to read from the ComputeBuffer output that is already on the GPU.

@Looooong
Copy link
Owner

So, we will put this issue on hold. Meanwhile, are you interested in reviewing my PRs when I modify the code? In the future, I want to restructure the codebase properly to make it easier for others to understand or collaborate on this project.

@jeske
Copy link
Collaborator Author

jeske commented May 31, 2019

Yes, I would be happy to!

Do you plan to make improvements next? or update for 2019 SRP? If the latter, it may make sense to make a release/tag/branch for 2018.3 first.

@Looooong
Copy link
Owner

Currently, here is my plan:

  • Make use of Constant Buffer to reduce the costs of draw call setup.
  • Implement shadow mapping for high priority lights for better quality.
  • Implement a high quality reflection algorithm, like screen space reflection, that can be blended with the voxel reflection based on material smoothness.
  • Port to Unity 2019.
  • Extend LWRP/HDRP with my code. With this, we can integrate the existing Post-processing stack. Also, ProBuilder is broken on my render pipeline, but it works for the other two :(
  • Implement signed distance field for voxel volume. Currently, lighting visibility is calculated by marching a ray through voxel volume, but the result doesn't look so good. With the SDF, the visibility function can be more accurate. I don't know if I can implement this anytime soon.

@Looooong
Copy link
Owner

Looooong commented May 31, 2019

I wish to improve the quality as much as I can before putting a version tag on it.

@jeske
Copy link
Collaborator Author

jeske commented May 31, 2019

The soft shadows they get in NVidia VXGI 2.0 / VXAL are pretty impressive.

I think frosted voxel refraction can also be quite interesting in VXGI, as in this example from Armory3d:

image

@Looooong
Copy link
Owner

Ah yes, refraction, I almost forget about it. I implemented refraction before, it only works if light pass through a single layer of glass to the camera. Because the glass acts as a "lens" to "see" the voxel world, it doesn't work with multiple layer of glasses. I want to implement subsurface scattering as well. Let's add them to the list.

About the soft shadow, I think it is pretty easy, we just need to change the visibility function from ray tracing to cone tracing.

@jeske
Copy link
Collaborator Author

jeske commented May 31, 2019

Can you go into github project settings and enable the Wiki? This is an easy place to keep some simple installation instructions and notes.

Also, github has a nice "todo list" feature, where you use - [ ] or - [x]markdown for bullets and it draws as checkboxes.. so you could take your above plan and either put it in a new issue with checkboxes, or put it in a wiki page with checkboxes

  • todo 1
  • done

@Looooong
Copy link
Owner

Looooong commented May 31, 2019

Yes, I already have Wikis enabled. About the plan, I will have it setup inside Github Project page. I will add you to the list of collaborators so you can see it.

@jeske
Copy link
Collaborator Author

jeske commented May 31, 2019

I've been reading the code, trying to understand it.... does it do Toridial addressing for the voxel buffer to reuse parts of the voxel buffer from frame-to-frame? As described here?

image

@Looooong
Copy link
Owner

I have just reorganized the file structure with minimal code modifications. Hope it doesn't affect you much.

I didn't implement Toridial addressing. I have never heard of clipmap until now. This is a very interesting resource you have here. I will spend the weekend researching it.

One question: is it applicable for anisotropic voxel cascade (implemented in "The Tomorrow Children")? Because I'm planning to implement that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Further discussion is needed
Projects
None yet
Development

No branches or pull requests

2 participants