FPGA’s for High Throughput Computing

A few years ago I attended a week long seminar on programming FPGA’s.  The training left me with a lot of questions, but I gained an appreciation for the power of this flexible little chip.

What is an FPGA?

OK – here’s the official definition from Xilinx (one of the two big manufacturers of FPGA’s): “Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects. FPGAs can be reprogrammed to desired application or functionality requirements after manufacturing. This feature distinguishes FPGAs from Application Specific Integrated Circuits (ASICs), which are custom manufactured for specific design tasks. Although one-time programmable (OTP) FPGAs are available, the dominant types are SRAM based which can be reprogrammed as the design evolves.”

Now let me translate that a bit.  A general purpose computer has registers (super fast) that hold values, memory (a lot slower than the registers), and a logic unit that takes values from the registers and does stuff with it (like math operations) and puts the result back into the register.  The logic units are designed in hardware so that they do their operations really fast.  General purpose CPU’s (like the Intel, AMD, or ARM processors) perform basic mathematical operations like addition, subtraction, multiplication, and division on integer or floating point values.  If you want to do something more complicated than that, you have to write a program that uses these basic operations to accomplish it.

An FPGA is essentially a blank slate that you can use to write your own logic unit and registers.  You can even configure part of it to act like RAM that operates at the same speed as the rest of the FPGA.  It is reprogrammable, meaning that if you have five analysts writing algorithms that need to be evaluated, each can have their own algorithm written for the FPGA and have it downloaded to the FPGA when their job is executed.  It’s pretty much like being able to write your own custom CPU.

Another nice thing about FPGA’s is that you can use available packages and integrate them into your design.  If you need network connections directly to the board you can (depending on the FPGA board) add several of them and add the pre-built transceiver units to the logic.  You can transfer data to and from the board using either the PCI interface or directly through up to nine 100G ethernet ports by adding the necessary pre-built logic units to interface with these options.

Orders of magnitude

The cost of adding FPGA’s to your cluster can easily be offset by the gains in performance and scale that you can achieve.  At this point you’re probably thinking that I just told you that FPGA’s use slower clock speeds than general purpose CPU’s – so how can they have gains in performance?  The answer lies at the heart of high performance computing: parallelism!  Let’s say that to build your algorithm you need 3000 logic units.  A typical FPGA has between 500,000 and 4,500,000 logic units available, so you can take that 3000 logic unit algorithm and replicate it 150 to 1500 times on the FPGA.  Accounting for some of the overhead of getting data in and out of the board — either through a 100GB ethernet connection or through the PCI bus — and you may end up with 100 to 1000 copies of your algorithm in hardware. A general purpose CPU may be running at a clock speed of 4 times as fast as the FPGA, but the CPU has to move data in and out of cache and registers as it moves through the algorithm.  Even with the faster speed, a typical algorithm may run as much as 3 times faster on an FPGA (including data transfer time). But wait, there’s more!  Generally, an algorithm is written in software to take one set of data, process it until a result is achieved, store the result, then move on to the next input.  FPGA’s can leverage pipelining so that as the data progresses through the algorithm logic, you could start on another set of data before the first result is finished.  So depending on the size of the algorithm and accounting for the branches that it takes (you don’t want the data sets colliding) you could possibly achieve at least another order of magnitude performance increase.

Comparing an algorithm in a CPU and an FPGA

Let’s take a simple example of adding two numbers or variables: x + y = z.  For the CPU this looks like:

Load value x from memory into register A
Load value y from memory into register B
Perform addition and store result in register A
Move contents of register A back to memory

That doesn’t look too bad, but remember, going to memory is a lot slower (like 4% as fast) as doing things in the CPU itself.  Though we don’t always have to move values back into memory as the algorithm proceeds, values are kept in the registers as best they can since the result from one calculation will likely be used again as the input of another calculation.  The FPGA works a little differently, since essentially everything is like working in the CPU itself, but it’s more complicated to setup.  To create a 1 bit addition I need the following logic gates where:

FullAdder

 

To calculate the sum of two 32 bit numbers, I need to hook together 32 of these in parallel.  This looks like a lot more work, but remember, as soon as one calculation is done with this logic it can start on another calculation.  An FPGA designer would have a lot of pieces like this already put together into little packages that can be plugged into the design, so they don’t have to do all of this by hand.

In an FPGA there are some calculations that we essentially get for free, like multiplying by a power of 2.  In a CPU we have to follow a similar path to the addition – load the registers, perform the multiply, and do something with the result.  In the FPGA it’s just a matter of rewiring the bits so that they are all shifted up by one bit per power of 2.

The limitations of FPGAs

So if FPGA’s are so great, then, why isn’t everyone using them all of the time?  There are several complicating factors:

1) They are definitely more expensive than a few GPU boards.  You won’t want to just plug in an FPGA board into your server – you’ll want to look at the FPGA appliances that are available.  They’ve been optimized for blazing IO throughput and can be tailored to you application.  But that comes at a price.

2) You need to have someone that understands how to write an algorithm in hardware.  The software that comes with FPGA boards makes it pretty easy to build the desired algorithm, but it does take some expertise in putting all of the logic together.  Industry experts say that a really good FPGA designer can program an algorithm in logic about as fast as you could write the same algorithm in software.  The software provided with the FPGA cards is also a lot easier to use than it was a few years ago, and there are a lot of pre-built modules that you can plug in for needed I/O and calculation functions.

3) The FPGA board usually operates at a clock frequency that is slower than a general purpose CPU – as much as 5 times slower.  But for the right application, you can achieve a massive increase by pipelining the data.

4) Doing math heavy operations (divides and trig functions) will be slower on an FPGA.  Modern CPU’s and GPU’s have highly optimized ALU’s (arithmetic logic unit) that would take a lot of space to replicate on an FPGA.  If you’re just going to be doing a lot of floating point calculations, a GPU may be a better option.  But if there is more of a process involved in the algorithm the FPGA may give you some dramatic performance gains.

Industry applications

HPC installations are utilizing FPGA’s to achieve speed improvements over general purpose CPU’s of between 33x and 240x in the areas of seismic imaging, proteomics, financial options valuations, dense linear equations, and sparse iterative equations in out-of-the-box type applications.  However, FPGA’s can also be useful in other areas of the cluster as well.  Due to their highly parallel nature, they can be used in bridges and switches to connect different interface logic and subsystems.  They can also be used in conjunction with CPU’s or GPU’s to offload portions of a software application to utilize the strengths of each platform as appropriate.

Feeding the monster

Once you’ve integrated FPGA’s into your HPC or HTC cluster, you’ll want to get the most out of your investment.  That means making sure that you have a policy-driven scheduler like the Moab HPC Suite and for high throughput, you’ll want to take a look at our Nitro application to deliver all of the inputs that the FPGA can handle.

Facebook Twitter Email

Speak Your Mind

*