Here is a very brief introduction to the FSR algorithm. You can find more details from this Siggraph talk.
FSR is a two-pass algorithm. The first pass, EASU (Edge Adaptive Spatial Upsamling), is for upscaling. EASU uses a 12-tap pattern as following:
b c
e f g h
i j k l
n o
These 12 taps are convoluted using a stretched lanczos(2) kernel. The stretch factor is determined from the gradients of nearby pixels. In a nutshell, the final shaped filter is an elliptical kernel whose major axis is perpendicular to the gradient of the current pixel.
The second pass is called RCAS (Robust Contract Adaptive Sharpening), which is used for enhancing the image details. RCAS uses a 5 tap sharpening operator in a cross pattern:
w
w 1 w
w
The negative weight ‘w’ is carefully chosen per pixel to avoid clipping (exceeds the [0, 1] range).
We integrated AMD FSR into our engine, and it worked great on PCs. Good performance and nice image quality, AMD YES!!!. Then we tested it on an iPhone 12 ^{1} with great expectations, but the result frustrated us a lot. The EASU pass takes about 5.4 ms, and RCAS takes about 0.9 ms ^{2}. A 6.3 ms cost in total is absolutely unacceptable.
Ok, can we optimize it? Because the cost of the RCAS is reasonable to some extent, our attention is focused on the EASU pass. After several attempts, we managed to reduce the EASU cost from 5.4 ms to 1.8 ms, which is apparently a big win. The following describes how we do that.
At first we used the fp32 version of EASU for simplicity, but on mobile devices fp16 operations are basically twice the throughput of fp32 operations. Since our shader system doesn’t support uint16/int16, we have to tweak the original FsrEasuH
code a little bit to make it compiles. Simple modification but high profit, the fp16 version only takes 4.7 ms.
Since EASU uses a lanczos kernel which has negative lobes, ringing artifacts may appear along the strong edges. To eliminate ringing, EASU clips the filtered result against the min-max value from the nearest 2x2 quad. This deringing process spawns a lot of arithmetic instructions. I tried to remove the clipping code completely, then the cost of EASU was reduced to 4.3 ms, another 0.4 ms saved! But how is the ringing? It surprised me that the artifacts are barely noticeable except in some rare cases, I’m comfortable with that.
The EASU pass has an analysis phase to determine the stretch and rotation of the lanczos kernel. In this phase, the analysis routine FsrEasuSetH
is called 4 times and the outputs are bilinearly interpolated. Instead of computing 4 times and interpolating the outputs，can we interpolate the inputs and call FsrEasuSetH
once? In theory it won’t give you the correct result for any non-linear functions, but let’s try it. I modified the analysis process like this:
// Compute bilinear weight.
// x y
// z w
AH4 ww = AH4_(0.0);
ww.x =(AH1_(1.0)-ppp.x)*(AH1_(1.0)-ppp.y);
ww.y = ppp.x *(AH1_(1.0)-ppp.y);
ww.z =(AF1_(1.0)-ppp.x)* ppp.y ;
ww.w = ppp.x * ppp.y ;
// Direction is the '+' diff.
// A
// B C D
// E
AH1 lA = dot(ww, AH4(bL, cL, fL, gL));
AH1 lB = dot(ww, AH4(eL, fL, iL, jL));
AH1 lC = dot(ww, AH4(fL, gL, jL, kL));
AH1 lD = dot(ww, AH4(gL, hL, kL, lL));
AH1 lE = dot(ww, AH4(jL, kL, nL, oL));
// Here FsrEasuSetH is inlined as following:
AH1 dc=lD-lC;
AH1 cb=lC-lB;
AH1 lenX=max(abs(dc),abs(cb));
lenX=ARcpH1(lenX);
AH1 dirX=lD-lB;
lenX=ASatH1(abs(dirX)*lenX);
lenX*=lenX;
// Repeat for the y axis.
AH1 ec=lE-lC;
AH1 ca=lC-lA;
AH1 lenY=max(abs(ec),abs(ca));
lenY=ARcpH1(lenY);
AH1 dirY=lE-lA;
lenY=ASatH1(abs(dirY)*lenY);
AH1 len = lenY * lenY + lenX;
AH2 dir = AH2(dirX, dirY);
It’s such a rude simplification, with which the EASU cost comes to 3.8 ms. But does this decrease the upscaling quality? Not very much! The edge features are still preserved very well, I can hardly tell the difference between before and after.
For non-edge pixels, the original EASU still accumulates all the 12 taps. An intuitive idea for optimizing this case is to use a simple bilinear interpolation instead:
AH2 dir2=dir*dir;
AH1 dirR=dir2.x+dir2.y;
if (dirR<AH1_(1.0/64.0)) {
// Early quit for non-edge pixels
pix.r = dot(ww, AH4(fR, gR, jR, kR));
pix.g = dot(ww, AH4(fG, gG, jG, kG));
pix.b = dot(ww, AH4(fB, gB, jB, kB));
return;
}
Since most threadgroups go into this early-quit branch, the cost should be reduced a lot. But after profiling we only see a 0.2 ms improvement. Although the ALU pressure reduced rapidly, the GPU counters show that now the shader is completely texture bound.
So we have to optimize the TEX ops. Let’s focus on the early-quit code path: In this path, we only need 5 bilinearly interpolated lumas to ensure this pixel falls into the non-edge case. How about we sample these 5 taps directly using a bilinear sampler, and defer other taps until we really need them? Not only can it save texture fetches for the early-quit path, but it also saves some ALUs that do the bilinear interpolation manually.
AF2 pp=(ip)*(con0.xy)+(con0.zw);
AF2 tc=(pp+AF2_(0.5))*con1.xy;
AH3 sA=FsrEasuSampleH(tc-AF2(0, con1.y));
AH3 sB=FsrEasuSampleH(tc-AF2(con1.x, 0));
AH3 sC=FsrEasuSampleH(tc);
AH3 sD=FsrEasuSampleH(tc+AF2(con1.x, 0));
AH3 sE=FsrEasuSampleH(tc+AF2(0, con1.y));
AH1 lA=sA.r*AH1_(0.5)+sA.g;
AH1 lB=sB.r*AH1_(0.5)+sB.g;
AH1 lC=sC.r*AH1_(0.5)+sC.g;
AH1 lD=sD.r*AH1_(0.5)+sD.g;
AH1 lE=sE.r*AH1_(0.5)+sE.g;
This change improves the performance dramatically, the EASU cost goes from 3.6 ms to 1.8 ms. Now the total cost of FSR is 1.8 ms (EASU) + 0.9 ms (RCAS) = 2.7 ms. Don’t forget without FSR we still need a pass to copy the offscreen target to back buffer, which takes about 0.7 ms. So the net cost of our optimized FSR is 2.7 ms - 0.7 ms = 2.0 ms. Not extremely efficient but we are satisfied.
OK, that’s it. The full source code is listed below (Gist link), hope you find it useful. We also made a sample based on the official FSR demo, you can play with it and compare the quality of our optimized version with the original. If you have any questions or have other ideas to optimize it further, please don’t hesitate to leave a comment.
Divergence-free noise is an ingenious technique that is extremely suitable for driving particles to move like real fluid motion. The most widely used divergence-free noise generator is called Curl Noise ^{1}, I first knew it from smash’s great article ^{2} about the making of the blunderbuss demo, in which the fluid effect impressed me deeply. Later I also tried to reproduce that effect (github project), though the result is not as good as smash’s masterpiece.
If you don’t know the curl noise, its core idea is based on this identity: $\nabla \cdot \nabla \times \equiv 0$, i.e. the divergence of a curl is zero. So to get a divergence-free vector field, you can firstly generate 3 random scaler fields, then calculate the gradient of each, and finally compute the curl from those gradients, now the result is divergence-free by construction.
While mathematically simple and elegant, curl noise is not very cheap to generate. In practice we usually need 4d noise (3d position + time), which makes things worse, because the higher the dimension, the more expensive to calculate. One day when I was looking over some vector calculus materials and expecting some clues to optimize my noise generator, I occasionally found some identities that are particularly interesting:
$$ \nabla \cdot (A \times B) = B \cdot (\nabla \times A) - A \cdot (\nabla \times B) $$
$$ \nabla \times (\nabla \phi) = 0 $$
From above two equations we have:
$$ \nabla \cdot (\nabla \phi \times \nabla \psi) = 0 $$
That means the divergence of the cross product of two gradient fields is always zero. Now we have another method to generate divergence-free noise:
Compare to curl noise which needs 3 gradients, this method is computationally cheaper because it only needs calculating 2 gradients. At first I thought this method was new and named it Bitangent Noise, since it’s tangent to both the level surface of field $\phi$ and field $\psi$. But later I found it was already invented by Ivan DeWolf in 2005 ^{3}. (I’m wondering why it is so less popular than curl noise.) Since Ivan DeWolf didn’t give it a name, I’ll stick to this punny name “Bi-tangent Noise”.
Okay, now we have bitangent noise to replace curl noise, the remaining task is to implement it efficiently. Since we really don’t want to introduce any external dependencies (e.g. LUT Textures), Ian McEwan & stegu’s pure ALU-based simplex noise implementation ^{4} seems to be a good start. Using simplex noise with analytic derivatives, we can implement bitangent noise straightforwardly like following:
float3 BitangentNoise3D(float3 p)
{
float3 dA = SimplexNoise3DGrad(p);
float3 dB = SimplexNoise3DGrad(p + float3(31.416, -47.853, 12.679));
return cross(dA, dB);
}
float3 BitangentNoise4D(float4 p)
{
float3 dA = SimplexNoise4DGrad(p).xyz;
float3 dB = SimplexNoise4DGrad(p + float4(31.416, -47.853, 12.679, 113.408)).xyz;
return cross(dA, dB);
}
From above code we can see clearly the cost of bitangent noise is basically twice the cost of simplex noise. For 4d case, there are some trickeries (e.g.) to optimize it further, but they all have the drawback of losing some randomness. Can we do it better without sacrificing quality?
Notice that to get 2 gradients, we simply called SimplexNoiseNDGrad
twice with different inputs, which prevent sharing of all the intermediate results which are highly dependent on the inputs. What if we just use one SimplexNoiseNDGrad
routine and compute 2 gradients simultaneously? And indeed we can do that by providing two different sets of gradients attached to simplex corners. Here comes the last problem: how to pick two random gradients for each simplex corner? Fortunately we have PCGs (Permuted Congruential Generators) ^{5} ^{6}:
// Permuted congruential generator (only top 16 bits are well shuffled).
uint2 pcg3d16(uint3 p)
{
uint3 v = p * 1664525u + 1013904223u;
v.x += v.y*v.z; v.y += v.z*v.x; v.z += v.x*v.y;
v.x += v.y*v.z; v.y += v.z*v.x;
return v.xy;
}
uint2 pcg4d16(uint4 p)
{
uint4 v = p * 1664525u + 1013904223u;
v.x += v.y*v.w; v.y += v.z*v.x; v.z += v.x*v.y; v.w += v.y*v.z;
v.x += v.y*v.w; v.y += v.z*v.x;
return v.xy;
}
Using PCG we can generate 2 independent random values from the skewed coordinate of each simplex corner, then we can easily pick two random gradients according to these 2 hash values. By feeding those gradients to the standard simplex noise procedure we can generate 2 noise derivatives at one time with very few extra computations, which makes our bitangent noise generator just a little more expensive than simplex noise (about 30%, see the performance data below).
These performance data are measured on a Nvidia GTX 1060 card, where each noise function is executed 1280 * 720 * 10 times.
Noise Function | Cost | Desc |
---|---|---|
snoise3d | 1530 μs | stegu’s 3d simplex nosie |
SimplexNoise3D | 1153 μs | optimized 3d simplex noise |
snoise4d | 2578 μs | stegu’s 4d simplex nosie |
SimplexNoise4D | 1798 μs | optimized 4d simplex noise |
BitangentNoise3D_ref | 2991 μs | 3d bitangent noise, reference version |
BitangentNoise3D | 1534 μs | optimized 3d bitangent noise |
BitangentNoise4D_ref | 4365 μs | 4d bitangent noise, reference version |
BitangentNoise4DFast_ref | 3152 μs | 4d bitangent noise, low quality |
BitangentNoise4D | 2413 μs | optimized 4d bitangent noise |
Bridson, R., Hourihan, J., Nordenstam, M. 2007. “Curl-Noise for Procedural Fluid Flow”. ACM Trans. Graph. 26, 3, Article 46 (July 2007), ↩︎
Matt “Smash” Swoboda, “A thoroughly modern particle system”. https://directtovideo.wordpress.com/2009/10/06/a-thoroughly-modern-particle-system/ ↩︎
Ivan DeWolf, “Divergence-Free Noise”. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.7627&rep=rep1&type=pdf ↩︎
McEwan, Ian, David Sheets, Mark Richardson, and Stefan Gustavson. “Efficient computational noise in GLSL.” Journal of Graphics Tools 16, no. 2 (2012): 85-94. ↩︎
Mark Jarzynski and Marc Olano, “Hash Functions for GPU Rendering”, Journal of Computer Graphics Techniques (JCGT), vol. 9, no. 3, 21-38, 2020 ↩︎
Epic Games, UnrealEngine/Random.ush. https://github.com/EpicGames/UnrealEngine ↩︎
In screen-space decals rendering, normal buffer is required to reject pixels projected onto near-perpendicular surfaces. But back then I was working on a forward pipeline, so no normal buffer was outputted. It seemed the best choice was to reconstruct it directly from depth buffer, as long as we could avoid introducing errors, which was not easy though. Fortunately, accurate normal reconstruction is impossible in theory but possible in practice, we eventually found a way inspired by Humus’s SDAA idea ^{2}, which is more accurate but also more expensive than Turánszki’s method. However, it’s worth the cost because decals are highly sensitive to the reconstruction errors. Following shows decals rendered in purple with different normal reconstruction strategies.
To understand how those artifacts occur and disappear, I drawn a picture to illustrate the two typical types of discontinuities in depth buffer, in which the eye and arrow denote the camera position and direction respectively, and the blue dots denote the depth samples.
In figure (1), Turánszki’s method works very well. 3 taps are enough to eliminate errors: since $|d-c|$ is less than $|b-c|$, we can say that point $c$ is more likely on segment $de$ rather than segment $ab$. But this is not the case in figure (2): although $|b-c| < |d-c|$, the point $c$ is apparently on segment $de$ instead of $ab$. This observation perfectly explains why the improved approach can only remove part of the artifacts in decal rendering.
So we can conclude that 3 taps (on each direction) is inadequate. Humus’s SDAA ^{2} uses 5 sample taps along with a second depth layer to calculate edge locations, here we can use the same 5-tap pattern to determinate whether $c$ is on $ab$ or $de$. (Unlike SDAA, we don’t have to calculate the accurate edge location, so the second-depth buffer is not needed.) Following describes the method step by step.
Note this method can locate $c$ correctly in both figure (1) and (2). Now we can apply this algorithm twice to get horizontal and vertical derivatives, then a cross product gives you the normal accurately. Here is the pseudo shader code.
// Try reconstructing normal accurately from depth buffer.
// input DepthBuffer: stores linearized depth in range (0, 1).
// 5 taps on each direction: | z | x | * | y | w |, '*' denotes the center sample.
float3 ReconstructNormal(texture2D DepthBuffer, float2 spos: SV_Position)
{
float2 stc = spos / ScreenSize;
float depth = DepthBuffer.Sample(DepthBuffer_Sampler, stc).x;
float4 H;
H.x = DepthBuffer.Sample(DepthBuffer_Sampler, stc - float2(1 / ScreenSize.x, 0)).x;
H.y = DepthBuffer.Sample(DepthBuffer_Sampler, stc + float2(1 / ScreenSize.x, 0)).x;
H.z = DepthBuffer.Sample(DepthBuffer_Sampler, stc - float2(2 / ScreenSize.x, 0)).x;
H.w = DepthBuffer.Sample(DepthBuffer_Sampler, stc + float2(2 / ScreenSize.x, 0)).x;
float2 he = abs(H.xy * H.zw * rcp(2 * H.zw - H.xy) - depth);
float3 hDeriv;
if (he.x > he.y)
hDeriv = Calculate horizontal derivative of world position from taps | z | x |
else
hDeriv = Calculate horizontal derivative of world position from taps | y | w |
float4 V;
V.x = DepthBuffer.Sample(DepthBuffer_Sampler, stc - float2(0, 1 / ScreenSize.y)).x;
V.y = DepthBuffer.Sample(DepthBuffer_Sampler, stc + float2(0, 1 / ScreenSize.y)).x;
V.z = DepthBuffer.Sample(DepthBuffer_Sampler, stc - float2(0, 2 / ScreenSize.y)).x;
V.w = DepthBuffer.Sample(DepthBuffer_Sampler, stc + float2(0, 2 / ScreenSize.y)).x;
float2 ve = abs(V.xy * V.zw * rcp(2 * V.zw - V.xy) - depth);
float3 vDeriv;
if (ve.x > ve.y)
vDeriv = Calculate vertical derivative of world position from taps | z | x |
else
vDeriv = Calculate vertical derivative of world position from taps | y | w |
return normalize(cross(hDeriv, vDeriv));
}
Feb 16 Update: The he
and ve
in above code are so calculated because we need to do perspective correct interpolation here, i.e, interpolating on 1/depth instead of depth.
At last I need to say that this accurate method may still fail on tiny triangles, but it’s rarely noticeable. We’ve used this technique in decal rendering for years, our artists never complain about any artifact. Hope you find it useful.
János Turánszki, “Improved normal reconstruction from depth”. https://wickedengine.net/2019/09/22/improved-normal-reconstruction-from-depth. ↩︎
Emil Persson. “Second-Depth Antialiasing”. In GPU Pro 4, A K Peters, 2013, pp. 201–212. ↩︎
The first thing comes to my eye is an old article “Fast Prefiltered Lines” ^{1}. It describes a method that can draw pretty high quality anti-aliased lines with fixed cost, since the filter is pre-calculated. It’s a great article but I can’t use it directly for several reasons:
Fortunately now the programmable pipeline is way more flexible than in 2005. With geometry shader, we can do it much easier. Following shows the final rendered image, and I’ll describe the details in below.
First, draw lines as usual: there is no change needed for host code and vertex shader. Then, add following geometry shader to expand 1-pixel-width lines to 3-pixel-width quads in screen space, and store distance to line center in TexCoord.x
:
struct GeometryOutput
{
float4 HPosition : SV_Position;
noperspective float2 TexCoord : TEXCOORDN; // section 2
};
[maxvertexcount(4)]
void AALineGS(line VertexOutput IN[2], inout TriangleStream<GeometryOutput> OUT)
{
VertexOutput P0 = IN[0];
VertexOutput P1 = IN[1];
if (P0.HPosition.w > P1.HPosition.w)
{
VertexOutput temp = P0;
P0 = P1;
P1 = temp;
}
if (P0.HPosition.w < NearPlane) // section 1
{
float ratio = (NearPlane - P0.HPosition.w) / (P1.HPosition.w - P0.HPosition.w);
P0.HPosition = lerp(P0.HPosition, P1.HPosition, ratio);
}
float2 a = P0.HPosition.xy / P0.HPosition.w;
float2 b = P1.HPosition.xy / P1.HPosition.w;
float2 c = normalize(float2(a.y - b.y, b.x - a.x)) / ScreenSize * 3;
GeometryOutput g0;
g0.HPosition = float4(P0.HPosition.xy + c * P0.HPosition.w, P0.HPosition.zw);
g0.TexCoord = float2( 1.5, 0.0);
GeometryOutput g1;
g1.HPosition = float4(P0.HPosition.xy - c * P0.HPosition.w, P0.HPosition.zw);
g1.TexCoord = float2(-1.5, 0.0);
GeometryOutput g2;
g2.HPosition = float4(P1.HPosition.xy + c * P1.HPosition.w, P1.HPosition.zw);
g2.TexCoord = float2( 1.5, 0.0);
GeometryOutput g3;
g3.HPosition = float4(P1.HPosition.xy - c * P1.HPosition.w, P1.HPosition.zw);
g3.TexCoord = float2(-1.5, 0.0);
OUT.Append(g0);
OUT.Append(g1);
OUT.Append(g2);
OUT.Append(g3);
OUT.RestartStrip();
}
The above code is pretty straightforward, except two places I would like to explain further:
TexCoord
in screen space rather than in world space, which is exactly what we want, since we are drawing lines with constant width in pixels.Finally the pixel shader. Since we already have TexCoord.x
represents the distance field, it’s not difficult to fade the pixel base on that to get anti-aliased lines. However, in order to get better AA quality, we have to choose the filter very carefully. It turns out $2^{-2.7 \cdot d \cdot d}$ can fit the cone filter kernel ^{2} very well.
Now here is the full PS code, and we are done!
float4 AALinePS(in GeometryOutput IN) : SV_Target
{
float a = exp2(-2.7 * IN.TexCoord.x * IN.TexCoord.x);
return float4(LineColor, a);
}
Eric Chan and Frédo Durand. “Fast Prefiltered Lines”. In GPU Gems 2, Addison-Wesley, 2005, pp. 345–359. ↩︎
McNamara, Robert, Joel McCormack, and Norman P. Jouppi. 2000. “Prefiltered Antialiased Lines Using Half-Plane Distance Functions.” In Proceedings of the ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware, pp. 77–85. ↩︎