Kernel Encoder
CPU Side: Encoder
Now it's time to write the CPU side of the metal pipeline. First, let's take a quick brief on how the GPU work scheduling is organized.
To get the GPU to perform work on your behalf, you need to send commands to it. There are three types of commands: render
, compute
and blit
. Compute command is what we need to schedule our adjustments shader for execution.
The objects that you operate while creating a command for a GPU are:
device
- software interface to a GPU. The device is able to create command queues, shader libraries, and pipeline states and allocate resources (heaps, buffers and textures). Created once by callingMTLCreateSystemDefaultDevice()
.library
- an object that contains compiled shaders. Created once bydevice
by callingmakeDefaultLibrary()
.function
- an object that specifies which shader function a Metal pipeline calls when the GPU executes commands that specify that pipeline. Created once by thelibrary
by calling themakeFunction(name:)
.pipeline state
- an object used to refer to a compiled function. Created once bydevice
by callingmakeComputePipelineState(function:)
.command queue
- an object that queues an ordered list of command buffers for a device to execute. Created once by device by callingmakeCommandQueue()
.command buffer
- a lightweight container that stores encoded commands for the GPU to execute. Created bycommand queue
on each command dispatch by callingmakeCommandBuffer()
.command encoder
- a lightweight object used to encode commands in a command buffer. Created bycommand buffer
on each command dispatch by callingmakeComputeCommandEncoder()
.
The hierarchy of creation of the objects is depicted below:
Now let's make an empty swift file Adjustments.swift
.
Adjustments
We are going to create an Adjustments
class which will be responsible for encoding the work to GPU and passing all necessary data to it: temperature and tint in our case. Following Metal's paradigm of precompilation of the instructions once and quickly reusing them in runtime, Adjustments
will store the pipeline state as its property.
final class Adjustments {
}
Create temperature
and tint
properties. These values will be modified by the UI and then sent to the kernel while encoding.
var temperature: Float = .zero
var tint: Float = .zero
Create the dispatch flag and the pipeline state. These values need to be initialized once and stored to use them while encoding.
private var deviceSupportsNonuniformThreadgroups: Bool
private let pipelineState: MTLComputePipelineState
The constructor of Adjustments
class takes a metal library as an argument. The library is used further to initialize a function for the pipeline state. As one library can contain multiple functions, it's a good practice to initialize the library once and then reuse it. So we're going to create and store the library outside of the class.
init(library: MTLLibrary) throws {
}
From this point, we are going to fill the constructor following step-by-step instructions.
Constructor
Different iPhones have different hardware (including GPU) that supports different sets of features. To initialise `deviceSupportsNonuniformThreadgroups`` property correctly we need to look find such a feature in the Metal Feature Set Table and find a corresponding feature set that describes the type of hardware that supports it.
self.deviceSupportsNonuniformThreadgroups = library.device.supportsFeatureSet(.iOS_GPUFamily4_v1)
Initialise function constants object and set the deviceSupportsNonuniformThreadgroups
value to it. The index is set to 0, the same as it was declared in the shaders.
let constantValues = MTLFunctionConstantValues()
constantValues.setConstantValue(&self.deviceSupportsNonuniformThreadgroups,
type: .bool,
index: 0)
Create a function from a library with and previously initialised FC. The name of the function is the same as in the shaders.
let function = try library.makeFunction(name: ,
constantValues: constantValues)
The final step is the pipeline state creation. At this point, the shaders will be compiled into GPU instructions and the passed FC will be used to determine if a boundary check will be among them.
self.pipelineState = try library.device.makeComputePipelineState(function: function)
The result should look like this:
final class Adjustments {
var temperature: Float = .zero
var tint: Float = .zero
private var deviceSupportsNonuniformThreadgroups: Bool
private let pipelineState: MTLComputePipelineState
init(library: MTLLibrary) throws {
self.deviceSupportsNonuniformThreadgroups = library.device.supportsFeatureSet(.iOS_GPUFamily4_v1)
let constantValues = MTLFunctionConstantValues()
constantValues.setConstantValue(&self.deviceSupportsNonuniformThreadgroups,
type: .bool,
index: 0)
let function = try library.makeFunction(name: ,
constantValues: constantValues)
self.pipelineState = try library.device.makeComputePipelineState(function: function)
}
}
Encoding Function
Next, we're going to write the encoding of the kernel. The main thing that we need here to do is use the command buffer's encoder to encode all necessary resources and instructions to the GPU.
Below the class constructor, add the encoding function.
Now let's fill it. Create a command encoder
. This lightweight object is used to encode everything in a command
buffer`.
guard let encoder = commandBuffer.makeComputeCommandEncoder()
else { return }
Set source
and destination
textures at the same indices we used in the shaders.
encoder.setTexture(source,
index: 0)
encoder.setTexture(destination,
index: 1)
Set tint
and temperature
values. Given that the data that we send to GPU is just two float values, which is not much in size, we use recommended in such cases setBytes
function. If the data is large, we'd create an MTLBuffer
for it and used setBuffer
instead. The indices are the same as in the kernel's arguments.
encoder.setBytes(&self.temperature,
length: MemoryLayout<Float>.stride,
index: 0)
encoder.setBytes(&self.tint,
length: MemoryLayout<Float>.stride,
index: 1)
Calculate the size of the grid and the threadgroups. The grid size should be the same as the texture's so each thread can work on its pixel. Speaking about the threadgroup size, we need it to be as much as possible to maximize the work parallelization. The calculation of threadgroup size is based on two properties of the pipeline state: maxTotalThreadsPerThreadgroup
and threadExecutionWidth
. The first defines the maximum number of threads that can be in a single threadgroup and the second is equal to the width of the SIMD group and defines the number of threads to execute in parallel on the GPU.
let gridSize = MTLSize(width: source.width,
height: source.height,
depth: 1)
let threadGroupWidth = self.pipelineState.threadExecutionWidth
let threadGroupHeight = self.pipelineState.maxTotalThreadsPerThreadgroup / threadGroupWidth
let threadGroupSize = MTLSize(width: threadGroupWidth,
height: threadGroupHeight,
depth: 1)
Set the pipeline state which contains precompiled instructions of our adjustments kernel.
encoder.setComputePipelineState(self.pipelineState)
If the device supports non-uniform threadgroups, we allow Metal to calculate the number of them and generate smaller threadgroups along the edges of the grid. If the device doesn't support this feature, we calculate the number of threadgroups by hand to overlap the size of the texture.
if self.deviceSupportsNonuniformThreadgroups {
encoder.dispatchThreads(gridSize,
threadsPerThreadgroup: threadGroupSize)
} else {
let threadGroupCount = MTLSize(width: (gridSize.width + threadGroupSize.width - 1) / threadGroupSize.width,
height: (gridSize.height + threadGroupSize.height - 1) / threadGroupSize.height,
depth: 1)
encoder.dispatchThreadgroups(threadGroupCount,
threadsPerThreadgroup: threadGroupSize)
}
After all encoding is done, we call endEncoding()
. Without calling this function the command buffer won't know that it is ready to dispatch the commands to the GPU.
encoder.endEncoding()
Here's the final encoding function:
encoder.setTexture(source,
index: 0)
encoder.setTexture(destination,
index: 1)
encoder.setBytes(&self.temperature,
length: MemoryLayout<Float>.stride,
index: 0)
encoder.setBytes(&self.tint,
length: MemoryLayout<Float>.stride,
index: 1)
let gridSize = MTLSize(width: source.width,
height: source.height,
depth: 1)
let threadGroupWidth = self.pipelineState.threadExecutionWidth
let threadGroupHeight = self.pipelineState.maxTotalThreadsPerThreadgroup / threadGroupWidth
let threadGroupSize = MTLSize(width: threadGroupWidth,
height: threadGroupHeight,
depth: 1)
encoder.setComputePipelineState(self.pipelineState)
if self.deviceSupportsNonuniformThreadgroups {
encoder.dispatchThreads(gridSize,
threadsPerThreadgroup: threadGroupSize)
} else {
let threadGroupCount = MTLSize(width: (gridSize.width + threadGroupSize.width - 1) / threadGroupSize.width,
height: (gridSize.height + threadGroupSize.height - 1) / threadGroupSize.height,
depth: 1)
encoder.dispatchThreadgroups(threadGroupCount,
threadsPerThreadgroup: threadGroupSize)
}
encoder.endEncoding()
}
Excellent! Now we have a compute kernel and the corresponding encoder for it. In the next part, we are going to write UIImage
to MTLTexture
conversion to pass the textures to the encoder, create a command queue and dispatch the kernel 👍.