ML Accelerator: A new personal project idea (let’s see how long this lasts)

Jonathan Tan
7 min readDec 23, 2023

--

My favorite picture of Iowa State I’ve taken so far.

I am currently a Junior at Iowa State University. This past semester (Fall 2023), I took a class, CPRE 487, which is about hardware (HW) acceleration for machine learning (ML) operations.

Our final project was to design our own HW ML accelerator. For my group (Justin Wenzel and I), we decided to accelerate one layer of our class’s toy model. That layer is a convolution layer. It is slow. It takes about 4 seconds to complete on our lab’s computer (Intel i7 vPro 12-core 20-threads w/ 48kB L1d, 1280kB L2, and 25600kB L3 cache), and on the Zedboard (Zynq-7000), it takes about 3 minutes to run. Our goal was to shorten the time required to run that specific layer, i.e. accelerating it. It would serve not only as a proof of concept but also as a benchmark for future improvements.

Did we get it done? No. We simply didn’t have enough time. I want to blame the class for being an experimental class, hence some lack of organization of the timing and the sheer amount of work required to complete the project (or lab 6, project and lab 6 are the same thing), but truthfully, if I’d started earlier, I wouldn't have run out of time. I want to note that no other groups, even the ones with graduate students, managed to complete it. In fact, we are one of the groups that made the most progress.

What we needed to do for the lab was split into three parts: reading data from BRAM to the accelerator (let’s call it the “conv unit” from now on), performing computations (MAC operations), and writing data back to BRAM for software to read. We got to the point that our conv unit can read data from BRAM. Which was, like, two weeks of work.

Since we didn't manage to complete the project, I want to work on it myself over the next semester. I really want to see how much improvements can be made by having an accelerator. The first step is to complete the project. After that, I planned on making incremental improvements, e.g. using DRAM instead of BRAM, pipelining the conv unit, better data orchestration, and so on. This would allow me to understand each component’s role in improving efficiency. Once those are completed, I would like to make the unit more generic. Right now, it is focused on accelerating one layer (in a 13-layer model). Being able to accelerate convolution layers with different filter sizes would be cool. And one thing I could work on is using different dataflow. Right now, we are using weight stationary dataflow. It would be cool to experiment with input stationary, output stationary, row stationary, convolution stationary, and so on. Also, one thing to experiment with is how much energy overhead can actually be reduced by different quantization levels. The current implementation is using 32-bit fixed point values. I would like to experiment with 8-bit, 4-bit, and 2-bit quantization levels.

Interesting right? There are endless opportunities for things I could try. The point of this blog is for me to document what I did, what the results are, share the challenges I faced, my implementation, and hopefully help anyone replicate my results. I will link my GitHub repo with the source code for the C++ toy model framework, as well as my VHDL code and Vivado project.

Our conv unit design.

Background (Some technical jargon incoming, skip “What’s next?”)

To provide some background for people who are interested in my blog but have no/little prior interaction with ML. Machine Learning (ML) is a branch of AI. The idea is to make “machine” “learn” using a very large amount of data.

Stereotypical ML picture (Source: javatpoint.com)

In the stereotypical picture of ML above, you can see different “layers,” i.e. the columns of green dots. Every dot represents a number, usually a decimal number. In the world of computers, we call decimal numbers floating points. The way ML learns from the large amount of data is by changing the values of the dots in the layers, using some kind of formula. For simplicity, you can think of ML kinda like averaging different values to find common ground such that an ideal output value is generated for different inputs.

I am far from an expert in this field. If you are interested in how this works mathematically, try Googling (or ChatGPT-ing) “different machine learning models”, “how does an inference work in ml”, etc. If you understand mandarin, this professor from NTU Taiwan explained it so well I almost emailed him to thank him. Otherwise, YouTube channel 3blue1brown sums it up pretty well, too.

If you are like me, we appreciate the math, but that is not for us, I can give you a summary: It is just a bunch of math stuff that does tons of multiplication and summation.

So, what exactly am I doing? If you took my suggestion and Googled “different machine learning models”, you should see that there are a lot of different models and a lot of different layer types.

I will be talking about two mainly: convolutional layers, and dense layers.

Before we start, let’s make sure we are on the same page in terms of terminology. Inputs (or input activations) are inputs into a layer. Weights (or filter, kernels) are the values of layers (the green dots). Outputs (or output activations) are outputs of a layer.

In a very, very high-level sense, all machine learning is doing, as mentioned above, is doing multiplications and accumulations (aka MAC operations). You take the first input i1, and the first weight w1, to generate a partial (temporary) sum o1. You then take the second input i2, and the second weight w2, to generate a second partial sum o2. This goes on until you multiplied all inputs and weights. Then, you add them all up, i.e. i1 + i2 + i3 + … = final output.

Note: i1, i2, w1, w2, etc mentioned here are the green dots in the image above. They are the “values” we were talking about.

Ok, now let's talk about what the different layers do. Dense layers do MACs in a one-by-one fashion, similar to the order we mentioned two paragraphs above. But for image recognition, people figured using a convolutional approach might be better (and they are!).

Dense layer animation
Dense layer. (Source medium.com/@Suraj_Yadav)

Convolution layers, on the other hand, operation completely differently. The filters in a convolution layer are a set of multidimensional weights, they kinda do a sliding operation across the input activations.

A convolutional layer’s filter (green) “sliding”. (Source: towardsdatascience.com)

Well, the thing about this “sliding” operation is that there kinda are a lot of operations per “slide”. If you know programming, try thinking how many for-loops you need to do what the animation is doing.

I can tell you it is a lot. There are 25,920,000 MAC operations for a 64*64*3 image (the 3 is RGB), and 32 5*5*3 filters. This is a 64-by-64 image! A 1080p image is 1920-by-1080! You get how many operations it takes to run a convolution now?

And, there usually are multiple convolution layers with different input, filter, and output sizes in a model. The runtime stacks up.

So, we want to speed things up. The 25,920,000 MAC operations took 35ms (on the Intel i7 vPro lab computer) to run. The second layer in my class’s toy model is a 60*60*32 input, 5*5*32*32 filter convolution layer, which requires 2,569,011,200 MAC ops. That layer takes 4 seconds to run!

A big chunk of the 4 seconds is spent waiting for data to move to the right places. Memory read and writes are a big issue in computers. We have tons of computational efficiency breakthroughs (like the recent 3nm Apple chip). Those things are fast. But innovations in the memory technology industry are far less hot. Hence, memory read/writes are usually the biggest bottleneck of a processing system.

What’s next?

The goal of the class I took, and my continuing objective in my spare time, is to address this problem. With smart coding techniques, one can exploit spatial locality and temporal locality to reduce the memory read/write time. One can also design a hardware component (or accelerator) to manage memory more effectively.

I would like to do both. Designing software that can run on my hardware accelerator, aka, algorithm-hardware co-design.

So buckle up and enjoy the ride.

--

--

Jonathan Tan
Jonathan Tan

Written by Jonathan Tan

Student at Iowa State University. I am interested in hardware (computer architecture) design.