Originally posted at http://cs.ucsb.edu/~cgb/multigridCUDA.html, and thus, looks much better there.
Over the last quarter I have been investigating how to implement a multigrid algorithm with CUDA, a new language aimed at making General Purpose Computation on GPUs (GPGPU) easier. This web page documents this experience.
Page Layout
Introduction
It is not always possible to solve a partial differential equation (PDE). In fact, exact solutions are not known for most PDEs. With the advent of computers, however, we can obtain good approximations to these solutions. One class of methods that are used to solve PDEs are the Multigrid Methods. Compared to other types of methods used to solve PDEs, they require less iterations and thus are generally faster. For this project, we have elected to solve a common PDE, the heat equation in three-dimensions, given by the following equation:
Implementing multigrid methods is not a new problem. Therefore, we choose to implement them on a new platform, the Graphics Processor Unit (GPU). This has become possible only recently with the advent of GPGPU languages such as NVIDIA’s CUDA (Compute Unified Device Architecture) and ATI’s CTM (Close To Metal), amongst others. Since we already have an NVIDIA GeForce 8800 in our lab, which is CUDA-capable, we used CUDA to implement this project.
Algorithms and Technologies Used
CUDA
CUDA is a C-like language that allows programmers to “easily” port their code for use on the GPU. The language is mostly a super-set of the C language, allowing the programmer to flag which methods are run on the GPU and where variables are stored on the GPU. Although it is presented as a high-level language, low-level parts of the architecture are exposed to the user. As is the case with C, the programmer can allocate memory and de-allocate it (via cudaMalloc and cudaFree), but the programmer can choose specifically on the GPU where they want to put it. The following figure is from the CUDA documentation, showing the various memory locations available to the user:

Here, we can see that we can allocate memory in the graphics card’s shared memory, constant cache, texture cache, and device memory. Although the device memory is the slowest to access, it has the most space and allows us to not have to worry about cache coherence in the shared memory. Thus, we use the device memory. A critique of the CUDA language is given in the “Pros and Cons of CUDA” section.
Multigrid Methods
Typical methods used to solve PDEs represent the simulated area as a three-dimensional grid (or matrix). Therefore, it is easy to represent on a computer. Furthermore, we can represent the heat equation as solving a linear system of equations. The general and specific representations are as follows:


For our implementation, k is 1, A is the Laplace operator, x is u, and b is Ut. Three simple methods exist for solving this system of equations: The Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. Our explanation for choosing the Red-Black Gauss-Seidel Method is given later, since this section focuses on Multigrid Methods in general.
As was previously stated, regular methods use one grid and solve it using any of the aforementioned methods. Multigrid methods differ in the sense that they solve many smaller grids and use those as approximations to the solution of the full size grid. This is necessary because solving only one large grid does not give the fine-grained accuracy that solving the smaller grids does. Furthermore, solving many small grids once and the large grid a small number of times is computationally faster than solving the large grid many times. The methods needed to implement this will be described in the design section.
Red-Black Gauss-Seidel Method
The simple methods used to solve a linear system of equations are the Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. As shown in A Multigrid Tutorial, the Red-Black Gauss-Seidel method converges the fastest out of these three methods (requires the least number of iterations). An added benefit specific to CUDA is that this method is easily parallelizable. In the standard Gauss-Seidel method, grid points must be updated one at a time. Although this works fine for serial computation, it is a poor choice for the parallel case. For Red-Black Gauss-Seidel, the odd points are all updated in parallel, and then the even points are all updated in parallel. The computation used to update each point is given by the average of that point’s neighbors.
Design Process
This being my first academic project since reading Code Complete, I decided to develop a prototype first. My C++ prototype simulates a 2D grid situation. It was mainly used to get a feel for the methods needed for the CUDA version, and as a result, is incomplete. It ended up mapping over to CUDA well, as follows:
| C++ Method Name | CUDA Method Name | Purpose |
| Constructor | Initialize / MG_CUDA_AllocateMem | Creates a grid. The C++ version needs a grid size; the CUDA version creates all the subgrids from the given size to a preset limit. Each subgrid is half the size of the previous grid. |
| Destructor | Close / MG_CUDA_Free | Frees the memory a grid has allocated. The CUDA version uses cudaFree to release all the memory it has been given via cudaMalloc. |
| setupInitialConditions | HeatToggle | The grid is initialized to 0 degrees at all points by the constructor. The C++ version of this method sets all points in the first row to be 100 degrees, and the CUDA version sets the innermost area of the cube to be 200 degrees. This is done so that we will be able to see the heat flow throughout the geometry. |
| restrictGrid | MG_CUDA_RestrictGrid / MG_CUDA_RKernel | Creates a new grid half the size of the original grid. The C++ version copies down the points from the bigger grid to the smaller one, while the CUDA version sets each point to be the weighted average of the points around it. |
| interpolateGrid | MG_CUDA_InterpolateGrid / MG_CUDA_IKernel | Creates a new grid twice the size of the original grid. Both versions sets each point to be the weighted average of the points around it. |
| solveGrid | MG_CUDA_SolveGrid / MG_CUDA_SGKernel | Both versions solve the given grid by using the Red-Black Gauss-Seidel Method. The odd numbered points are updated to be a weighted average of the points around them, and the process is repeated for the even numbered points. The C++ version does this until 50 iterations are done, while the CUDA version does 1 iteration. |
| computeResidual | MG_CUDA_ComputeResidual / MG_CUDA_CRKernel | Takes the difference between each point in two grids and returns a new grid containing this difference. |
| applyCorrection | MG_CUDA_ApplyCorrection / MG_CUDA_ACKernel | Adds in each point in a grid to the given grid and returns a new grid. |
| printGrid | {none} | Used for debugging purposes only while the visualization scheme was being developed. |
| main | Simulate | Performs one iteration of our multigrid solver, which consists of restricting the grids with data from the previous iteration, solving the smaller grids, and solving the largest grid. |
The code for the multigrid prototype can be downloaded here.
Results
For the sake of timing purposes, we consider each execution of the ‘Simulate’ method to be one step, and we time how many steps can be performed per second. We test this for both the CUDA and C multigrid implementations for grids of size 16 x 16 x 16 through 256 x 256 x 256. We attempted to go larger (to 512), but the CPU’s memory allocation function (malloc) failed and could not allocate us the 512 MB of memory we needed for each of the four arrays we manipulate on the CPU (note the 512 MB matching up with the grid size 512 is coincidental). That being said, the speedup we achieve is given as follows:
For the purposes of this experiment, the % difference is also the speedup between the CUDA and C versions of the multigrid method.

What is interesting to note is that the CPU version is actually faster on the grid of dimension 64, and that the GPU is not notably faster as we have seen in other CUDA GPU applications. The reason for this is two-fold. First, we were unable to resolve issues we were having getting the Red-Black Gauss-Seidel method to work on our largest grid, so it had to be done on the CPU, and the integer modulo operator we use to get the even and odd points for Red-Black Gauss-Seidel works very fast on the CPU and very slow on the GPU. Furthermore, the grid was stored as a one dimensional array in memory, and to recover the three dimensional array structure, each point had to perform a swizzling method, which also involved using integer modulo operations. Add in the fact that we do this 16 million times per step (once per point on the grid, so 16 million is for the 256 grid) and you see why our speedup was not as good as it could have been. A low resolution video of our CUDA multigrid implementation has been recorded along with a high-resolution screen capture. Both are provided here in that order:
![]()
Unfortunately, the video recording software we used does not record the high-frame rates of our program very well and thus lags much more than the actual version does. Furthermore, our version is more brilliant in terms of color as the screenshot on the right reveals. We use the a simple 3D rotation technique based on which axes the user is manipulating for rotating the cube, and we provide mouse combinations to allow the user to zoom in and out of the cube and translate themselves in space about the cube. These features are all documented in the above video. As we have refactored Brent Oster’s C code to produce our CUDA version, we have kept the interaction mechanisms he has implemented intact. We have added functionality to switch rendering on and off to be able to accurately measure how many steps are computed per second, as the rendering negatively impacts it.
On December 14, 2007, this material was presented to our ‘GPGPU and 3D User Interfaces’ class. A PDF of the presentation can be found here. I’m currently working on putting the code up. The code used for this project consists of three files: NC_Multigrid.cpp, NC_Multigrid.h, NanoCAD2.cpp, and the CUDA source code, Multigridcu.cu. They can be found at the location “Projects/NanoCAD2/Modules/”, with the exception of the driver file, which is at “Projects/NanoCAD2/NanoCAD2.cpp”.
Pros and Cons of CUDA
CUDA provides a novel way for the programmer to use the graphics hardware to its fullest. It is both a high and low level language, but because it is a relatively new language, it is not without its downsides.
Pros:
Cons:

This wouldn’t be so bad except for the fact that this happens almost every time something goes wrong, not just with the blocks. In the last four hours of programming in CUDA, this has happened to me FIVE times. Future implementations of CUDA NEED to fix this for CUDA to become even close to viable as a good programming language. My computer shouldn’t die just because I used memory that I allocated and set to zero but forgot to copy the host’s version of to. That leads into the next issue…


Trying to recompile the code gives no errors and the code runs fine. So what’s the deal? I can’t even get the blue screen of death consistently! Sometimes Visual Studio gives me a cudaError_enum, sometimes the computer locks up, and sometimes it gives me the blue screen. Why can’t it just be consistent, and consistently the first result?
Conclusion
Whew! What a long review! It’s been a great quarter full of new opportunities, and thanks to all for letting me use what I’ve needed to get this done. Special thanks go out to Tobias Hollerer and Brent Oster for use of the hardware and software, respectively, that was used or refactored in some way for this project. CUDA has the potential to go far for sure, but there’s definitely some huge kinks it needs to work out before it’s ready for the general public.

[…] compiler issues is virtually impossible. See the screenshot from using RapidMind on the GPU in the Multigrid Methods on CUDA project to see more about […]
I have had some 8 months of experience with these cards, and I agree with most of your comments. They are a b**** to program. But I have seen great results in some algorithms, so what can I say… they’re great!?
HI i need to run your code it seems very sweet.!! But when I click on the link it says page not found 😦
Ah, my apologies. The links work fine on the page cited at the top, but here’s the links:
Prototype: http://cs.ucsb.edu/~cgb/code/cs290i/multigridPrototype.zip
CUDA Code: http://cs.ucsb.edu/~cgb/code/cs290i/multigridInCUDA.zip
I haven’t touched the code in a while, but hopefully you find it useful!
[…] compiler issues is virtually impossible. See the screenshot from using RapidMind on the GPU in the Multigrid Methods on CUDA project to see more about […]
Your prototype code do not use multigrid. It works on a single grid.
A newbie question: Can I use multigrid as a multi-resolution technique, or the grids are just to correct the fine grid?
I have worked with CUDA on RED HAT linux. It works simply great and very simple to use in both real and emulation mode! I achieved 30X speed up for my purpose 🙂
You know shatterednirvana, I wrote about this very subject earlier today on my own blog. Your blog post has provided me with some food for thought and I feel you made some really interesting points. I wish I had read it earlier, prior to writing my own post.
Yours ;-D,
bowlegged148.
I have download Cuda Code from this link http://cs.ucsb.edu/~cgb/code/cs290i/multigridInCUDA.zip but I did not find the solution file …………. would u please help me…………….
Thanks in advance.