There are a million reasons why having printf support in HLSL would be super valuable. It even used to exist in older shader models. Other shader languages like the Metal Shading language have their own logging mechanisms, and shader programming gurus like MJP have rolled their own.

Disclaimer: having an officially supported way to print strings would be super useful for HLSL… Although this isn’t official, this is just me playing around on my holiday break. It is likely we could make something like this supported in Clang, however there are some challenges to supporting this in DXC and when targeting SPIRV or Metal (more on that later).

Before I go any further, let me show you want it looks like to use this. Given the HLSL code here:

 #include "dx/printf.h"

[numthreads(16,1,1)]
void main(uint3 TID : SV_GroupThreadID) {
  // Required to initialize the debug output stream.
  dx::InitializeDebugStream();
  uint idx = TID.x;
  uint value = idx + 1;
  if (value % 3 == 0 && value % 5 == 0)
    dx::printf("FizzBuzz(%u)\n", value);
  else if (value % 3 == 0)
    dx::printf("Fizz(%u)\n", value);
  else if (value % 5 == 0)
    dx::printf("Buzz(%u)\n", value);
  else
    dx::printf("Baz(%u)\n", value);
}

This results in producing a buffer, that when interpreted by the CPU prints (note ordering of messages can be dependent on compiler optimizations and hardware constraints, but more on that later):

FizzBuzz(15)
Fizz(3)
Fizz(6)
Fizz(9)
Fizz(12)
Buzz(5)
Buzz(10)
Baz(1)
Baz(2)
Baz(4)
Baz(7)
Baz(8)
Baz(11)
Baz(13)
Baz(14)
Baz(16)

The Challenges with printf

There are two different classes of challenges that printf pose. First is the general problem space around printf on a GPU, and the second is more about how to fit printf into HLSL. The design I’ve prototyped tries to address problems in both of these spaces to provide a powerful solution for users.

Problem Space: Printing Text on GPUs

The main problem with printing text on GPUs comes down to two main issues around string data: (1) string data can be big, and (2) string formatting is wildly divergent.

Large amounts of string data coming on and off the GPU would pose a significant performance problem at multiple levels of the software stack. Embedding strings into GPU programs bloats the program sizes, increasing the data drivers need to manage and (if not stripped) results in transferring more data than necessary to the GPU (and potentially back). This would result in logging having a potentially massive impact on the runtime of a given shader, which would in turn limit its usefulness as a debugging tool.

Formatting strings on a GPU opens another massive can of worms because thread-divergent values would likely cause significant divergence in processing, and that’s before you even start to think about problems with memory allocations. I don’t know of any implementation outside the “C++ on your GPU” world that has actually tried to do string formatting on a GPU for good reason.

The solution I’ve prototyped aims to address both of these challenges by not doing any string processing on the GPU, and removing the strings entirely from the compiled shader binary that gets handed to the driver. This has an added benefit that it requires no changes to GPU drivers, and can work on existing production drivers.

Instead of processing the strings in the shader program, this solution depends on a new compile-time feature I’ve prototyped in the DirectXShaderCompiler and Clang, which collects all the strings in a shader into a packed de-duplicated string table, and converts strings to offsets into the table via a compile-time evaluated builtin function (__builtin_hlsl_string_to_offset).

With these features printf is entirely implemented in HLSL. The actual implementation from my prototype is:

template <typename... T> void printf(string Str, T... Args) {
  uint ThreadIndex = WavePrefixSum(1);
  uint ThreadCount = WaveActiveSum(1);
  uint StrOffset = __builtin_hlsl_string_to_offset(Str);
  uint ArgSize = NumDwords<T...>() * 4;
  uint MessageSize = sizeof(MessagePrefix) + (ThreadCount * ArgSize);
  uint StartOffset = 0;
  if (WaveIsFirstLane()) {
    InterlockedAdd(OutputOffset, MessageSize, StartOffset);
    MessagePrefix Prefix = {ThreadCount, StrOffset, ArgSize};
    DebugOutput.Store<MessagePrefix>(StartOffset, Prefix);
  }
  StartOffset = WaveReadLaneFirst(StartOffset);
  uint ThreadOffset =
      StartOffset + sizeof(MessagePrefix) + (ThreadIndex * ArgSize);
  StoreArgs(DebugOutput, ThreadOffset, Args...);
}

This function writes into a RWByteAddressBuffer the number of threads participating in the print, the offset of the format string, and the size of the per-thread argument lists. It then writes the per-thread argument list. Each argument takes one or more 32-bit dwords, The current implementation does mis-align 64-bit types, because I’m lazy, and there are a few other missing features and improvements that I’d want to include if we pursued a solution like this in the future.

Problem Space: HLSL isn’t ready (yet)

You may have noticed from the code above that this uses a feature of C++ that doesn’t exist in HLSL (yet): variadic templates. Argument packing for printf is an interesting problem. The way that variable argument lists work in C really won’t work on a GPU. It involves packing arguments into effectively an untyped stack-allocated array and walking the array with the format string to determine the types.

For a GPU program we want as much of the packing/unpacking to happen at compile time as possible. Templates are ideal for this since they’re fully resolved early in the compiler at compile time.

Adding variadic template support to DXC turned out to be unexpectedly easy because of how Clang implements new C++ language features. When a C++ language feature does not add incompatibilities with pre-existing language rules and is strictly additive, the Clang maintainers add the feature in all previous C++ language modes, and issue warnings for using the feature in older modes. This means that all the functionality for resolving variadic templates in the AST was already present in DXC. Re-enabling the support required enabling the parsing logic for parameter packs (...), and everything else just mostly worked.

A second problem that I encountered bringing this to DXC was the implementation of the string type. Many users would probably be surprised to know that HLSL does have a string type, it’s just extremely limited in what you can do with it. Today it is used almost exclusively for specifying root signatures in raytracing shaders.

They are effectively static const char * variables in the raytracing subobject, and they’re disallowed in almost every other semantic context. For this I evolved string objects so they can be parameters to functions. They’re still just const char* pointers (I even fixed a few bugs relating to their const-ness), and they’re passed as pointers rather than copy-in/copy-out.

This slight generalization of string handling allows printf to become a fully software implemented feature.

Bringing it Together

The full implementation of my printf prototype for HLSL can be found in my fork of DXC. The main bits of it are the compiler changes to implement __builtin_hlsl_string_to_offset, and the new printf.h.

In addition to the changes in my fork of DXC I’ve also implemented the string format handling in my fork of our test suite.

In a future where this solution becomes generally supported, I would expect to have the HLSL toolchain vend both the HLSL language functionality and a permissively licensed string format implementation that could be used by any application developer. There’s also nothing intrinsic about how the formatting is interpreted to the system, so it is also possible that a developer could use their own formatting logic with their own format specifiers. This could provide a lot of flexibility to users.

One big caveat: as I mentioned at the beginning the ordering of messages is pretty dependent on compiler optimizations and the hardware you run on. You can’t depend on any given ordering since the atomic operations used to grab space in the output buffer doesn’t have strict ordering guarantees and wave scheduling is largely undefined.

Despite that, this proof-of-concept implementation does work today on any retail DirectX 12 driver that supports Shader Model 6.0 or later.

Future Challenges

Bringing this into the core of the HLSL language and compiler has a few roadblocks. The main missing language features are variadic templates, and more general string types. We have had discussions about adding variadic templates as part of the plan to adopt C++11 as the base of HLSL.

The issues around the string type are a bit harder to figure out. The way strings are used in HLSL today is a bit magic and ill-defined. We’ll really need to spend some time documenting and defining the string object and the syntax and semantics of raytracing root signatures to figure out the best way to support this going forward. I believe there is a relatively straight forward path here, but we’ll need to spend some time working on it.

I fully expect that my prototype implementation will work with Clang in the near-ish future as the missing HLSL features are all in progress, but DXC support is a bit trickier. While I’ve demonstrated that it can work in my proof-of-concept implementation, I’ve worked around a few gnarly bugs in DXC to do so. DXC does some very strange things during codegen for function parameters, that needed to be reverted back to the original Clang code. This makes my prototype work, but causes some of our other tests to fail and I haven’t had a chance to investigate.

There are also some limitations in the current DXC implementation around how format strings can and can’t be used, which may be more restrictive than we really want, and I’d like to at least consider alternatives. Specifically, the current implementation requires that format strings be uniform across all threads, and there might be value in allowing them to be thread-divergent. DXC also doesn’t really support creating local variables of string types, which means you can’t have more general logic to select format strings. Enabling some of these cases will also require that the string lowering IR pass I wrote gain some complexity as well.

I’ve also thought about a slight adjustments that we may want to consider. The current printf implementation depends on a groupshared uint for the counter. We could instead support the counter being the first four bytes of the RWByteAddressBuffer this has a potential benefit that the resulting resources would contain the total amount of data written to the buffer and it saves 4 bytes of groupshared memory, but would be slower during execution. My gut feeling is that stealing 4 bytes of groupshared memory is better, but we could make the implementation work both ways.

Another big area that I haven’t thought about much is how to bring this to Vulkan and Metal shaders. One nice thing about the DirectX object file format is that it has named sections, so the string table can be its own section that the runtime and tools see but is never passed to drivers.

SPIRV has no such separation. To support this in SPIRV we’d likely need to create an annotation instruction that contains the string table and parse the string table out of the SPIRV. Not really a big deal, but unfortunate.

For Metal there are a couple different approaches we could take. One would be to tie this into the Metal logging system. If we wanted to have a unified portable system we’d need some way to attach the string table into the Metal library, or carry the data alongside the shader. This is not the only case where we’re needing to carry side-metadata along with shaders for Metal, so we may need to think more generally about this problem in the future. If only the Metal Shader Converter were open source, then we could make the string table a user-defined attribute on the shader, but alas…