# Part 2: Deep Learning FPGA Acceleration with Python - 'Inference'      ### Part 2: Deep Learning FPGA Acceleration with Python - 'Inference'

Published on May 12, 2023 by Dominik Kaukinen # Overview

In this we develop out our hardware acceleration library to handle inference for a MLP neural network using the Fisher Iris dataset. We are building a library that codegens to verilog for FPGA synthesis.

We will go through each operation necessary to build the model and build a PyTorch equivalent to compare our results.

You can follow the progress of the library (which is still in development) on GitHub

The difficulties encountered here are requiring the operations to work on quantized inputs, weights and biases.

See the our last post for details on quantization.

Where these difficulties seem to manifest is in multiplication operations. We need to accumulate the multiplication result in a object with a higher bitwidth than the inputs. This is because the multiplication of two 8-bit numbers can result in a 16-bit number. We can’t store a 16-bit number in an 8-bit object, so we need to use a 16-bit object to store the result.

We convert it back to an 8-bit number using truncation.

## Designing the Ops for Inference

We need to adapt standard neural net operations we are familiar with to work with quantized inputs. We also need to consider that we are simulating the result for FPGA-based hardware, not a regular CPU, some things that are easy to do in software are not so easy to do in hardware.

### Linear Transformation

The Linear transformation layer is a fully connected layer than takes a vector from the previous layer and multiplies it by a matrix of weights and adds a vector of biases. It can transform a vector from dimension `m` to dimension `n` where `m` is the number of inputs and `n` is the number of outputs.

We take the input vector and multiply it by the transpose of the weights matrix and add the bias vector.

In code this is pretty simple with with the underlying pyrtl library like so:

``````
def linear(self, other: HDLTensor) -> HDLTensor:
other @= self.weights.transpose()
other += self.bias
return other
``````

### TruncateLSB

This is a layer that doesn’t need to exist outside of hardware or in a pure software environment. It is used to truncate the least significant bits of a binary number in order to keep the bitwidth of the number manageable. Essentially, each matrix multiplication doubles the number of required bits. So if we start with 8-bit inputs and do 1 matrix multiplications we end up with 16-bit numbers (2^8 * 2^8 == 2^16).

`TruncateLSB` performs a bitshift to the right by the number of bits we want to truncate. This is equivalent to dividing by 2^`num_bits`. In code it looks like this:

``````
def truncate_lsb(self, bits: int) -> HDLTensor:
other.value = other.value[:bits]
return other
``````

### ReLU

ReLU is a non-linear activation function that is used to threshold the inputs after a layer. Since we can be dealing with unsigned integers that only have a positive range, we want to shift the zero point to the midpoint of the range. Our modified ReLU layer looks like this:

``````
def relu(self, inputs: HDLTensor, zero_point: int) -> HDLTensor:
for i in range(inputs.rows):
for j in range(inputs.columns):
inputs.ternary(
i, j,
lambda x: x < zero_point,
zero_point,
inputs[i, j]
)
return inputs
``````

In the case of unsigned integers the `zero_point` is `2**bits // 2`. For signed integers the `zero_point` is just `0`.

### Softmax

Fortunately in inference this is a very easy layer to implement with just an `argmax` like so:

``````
def softmax(self, inputs: HDLTensor) -> HDLTensor:
if self.mode == 'eval':
return inputs.argmax(axis=1)
elif self.mode == 'train':
raise NotImplementedError("SoftMax not implemented for training")
``````

This will return the index of the maximum value in the input tensor. This is the same as the class with the highest probability.

Note this op gets fun when it comes to training, but we will cover that in the next post. (Spoiler: we can’t do normal division in hardware or use large numbers exposed by the `exp` function.)

## Building a Benchmark Model with PyTorch

Since the focus of these posts isn’t PyTorch will be quick about this. We built a simple MLP model with 3 layers.

The model code used was:

``````
class IrisNet(nn.Module):
def __init__(self):
super(IrisNet, self).__init__()
self.fc1 = nn.Linear(4, 30)
self.fc2 = nn.Linear(30, 30)
self.fc3 = nn.Linear(30, 3)
self.softmax = nn.Softmax(dim=1)

def forward(self, X):
X = F.relu(self.fc1(X))
X = self.fc2(X)
X = F.relu(self.fc2(X))
X = self.fc3(X)
X = self.softmax(X)

return X
``````

We trained it on 80% of the Fisher Iris dataset for 1000 epochs and got the following results on the test set:

``````
# Accuracy  0.9833333333333333
# Precision 0.9844961240310077
# Recall    0.983739837398374
``````

Note the accuracy and statsics don’t matter, we just include them for completeness. We will be comparing the results of the model built in PyTorch to the model built in our library, using the PyTorch model as the ‘ground truth’. I only include them here as the next post will deal with training the model, and I will refer to these results.

We saved the 20% of the test set used for validation and will use the exact same 20% (120 data points) to compare to the results of the model built in our library.

## The Model Re-Built in Our Libaray

At a high level we can define a model in our `PyHDLNet` library like so:

``````
dim = 30
num_bits = 8

model = Sequential([
Linear(4, dim,
precision_bits=num_bits,
weights=weights1,
bias=bias1),
TruncateLSB(bits=num_bits),
ReLU(zero_point=2**num_bits // 2),
Linear(dim, dim,
precision_bits=num_bits,
weights=weights2,
bias=bias2),
TruncateLSB(bits=num_bits),
ReLU(zero_point=2**num_bits // 2),
Linear(dim, 3,
precision_bits=num_bits,
weights=weights3,
bias=bias3),
TruncateLSB(bits=num_bits),
SoftMax()
])
``````

Where `weights1`, `weights2`, `weights3`, `bias1`, `bias2` and `bias3` are the weights and biases from the PyTorch model, quantized with PyTorch. We quantized it with PyTorch like so:

``````
def quantize_tensor(tensor, bits=8, signed=False, symmetric=True):
'''
Quantization that supports signed and unsigned integers
as well as symmetric or asymetric ranges.
'''
import torch
max_val = 2**(bits) - 1 if not signed else 2**(bits - 1) - 1
min_val = 0 if not signed else -2**(bits - 1)
max_neg_val = torch.abs(torch.min(tensor))
max_pos_val = torch.abs(torch.max(tensor))
tensor = tensor.clone()
max_tensor_val = torch.max(torch.abs(tensor))
if symmetric:
if signed:
tensor = (tensor * (max_val / max_tensor_val))
else:
tensor = (tensor * ((max_val // 2) / max_tensor_val)) + \
(max_val // 2)
else:
if signed:
pos_tensor = torch.where(
tensor > 0, tensor * (max_val / max_pos_val), 0)
neg_tensor = torch.where(
tensor < 0, tensor * torch.abs(min_val / max_neg_val), 0)
tensor = pos_tensor + neg_tensor
else:
pos_tensor = torch.where(
tensor > 0,
tensor * ((max_val//2) / max_pos_val) + (max_val // 2),
0
)
neg_tensor = torch.where(
tensor < 0,
tensor * torch.abs((max_val//2) /
max_neg_val) + (max_val // 2),
0
)
tensor = pos_tensor + neg_tensor

if bits == 8:
dtype = torch.int8 if signed else torch.uint8
elif bits == 16:
dtype = torch.int16 if signed else torch.uint16
elif bits == 32:
dtype = torch.int32 if signed else torch.uint32
else:
raise ValueError("Unsupported bits: {0}".format(bits))

tensor = torch.clamp(
torch.round(tensor),
min=min_val,
max=max_val
).type(dtype).data.tolist()

return tensor

``````

Note `precision_bits` is the number of bits used to represent the weights and biases. This is the quantization we discussed in the last post. It can be set per-layer. The output of the layer could possibly be more than `precision_bits` depending on the operation performed.

## Comparing the Results

Using the same 120 datapoints from the PyTorch model we got the following metrics in our FPGA simluator:

``````
# Accuracy  0.4583333333333333
``````

The confusion matrix is:

Setosa Versicolor Virginica
Setosa 14 12 10
Versicolor 15 19 6
Virginica 9 13 22

Overall it appears to be a big loss in accuracy, but still better than guessing. We expected a loss in accuracy due to the quantization of the weights and biases as well as the truncation of the layer outputs. We can see that the model is still able to classify the data, but not as well as the PyTorch model.

I also observed it was very susceptible to how I quantized the weights. I took a naive approach with symmetric unsigned int8 quantization, but a dynamic / adaptive quantization approach would likely yield better results. unsigned int 12 or 16 would likely be better as well.

Our goal isn’t to necessarily take a model built in PyTorch and use our library for inference (but that would be nice), our goal is to build a model and train it with our hardware acceleration.

We would likely be able to improve the accuracy by using (quantization aware)[https://leimao.github.io/blog/PyTorch-Quantization-Aware-Training/] training as opposed to post-training quantization in PyTorch or using per channel quantization. We could have compared the weights and bias distributions to see what the best quantization would be.

Again, I didn’t spend a lot of time on this as I feel it is the smaller problem to solve compared to speeding up training.

I plan to revisit this in the future and see if I can get better results.

## Next Steps

We will be building out the training functionality for our library in the next post. This will include the backpropagation algorithm and the training loop. We will also be adding the ability to save and load models.