Scalability Equalizer logo

Table of Contents

Overview
Basic Decomposition Modes
Advanced Scalability Features


Overview

In addition to the traditional multi-pipe execution, where one graphics card drives one or more display devices, Equalizer offers a broad range of decomposition modes to accelerate the rendering of large data sets by parallelizing the rendering of a single view across multiple graphics cards, processors and computers.

Equalizer implements a wide range of task decomposition algorithms to parallelize the rendering of large data sets. Multiple graphics cards, processors and memory can be combined to render one OpenGL view. The Equalizer framework distributes the rendering task across a number of rendering units (decomposition), collects the results and recombines them on the final view (recomposition). The detailed design and configuration of compounds is described in a technical specification.

All scalability features apply to graphic clusters and multi-GPU systems in the same way. Multi-GPU systems have the advantadge of a shared memory, which eliminates the need of transferring pixel data across a network, but on the other hand the number of GPUs per system is limited.

The achievable performance depends on a number of factors. The I/O bandwidth of the graphics cards, the cluster interconnect and the used decomposition mode define the upper limit of the achievable frame rate. The application characteristics and some application-specific optimizations determine how close to this limit the performance can be pushed.


Multi-GPU Systems

Multi-GPU systems, such as dual-SLI or quad-SLI workstations or the Apple Mac Pro are becoming more commonplace. Equalizer provides the natural framework to fully exploit the parallelism of such hardware. Equalizer applications are capable of optimally using the graphics cards as well as multiple CPU cores typically present in multi-GPU workstations. Application-transparent solutions such as SLI or Crossfire provide less scalability, since the application's rendering is still single-threaded and not optimized for the individual cards.

Example 1: HP xw9300 workstation

Performance comparison
Performance for a dual-GPU HP workstation

The example on the right shows the performance benefit of Equalizer compared to the default application-transparent SLI mode. The first configuration measures the baseline performance using only a single GPU. The second one uses the same Equalizer configuration, but transparently distributed across the two GPUs by SLI. The two remaining configurations are using two rendering threads with a database (sort-last) decomposition and a screen-space (sort-first) decomposition. These decompositions scale the rendering performance of a single view, similar to SLI mode but using two processor cores for rendering to optimize the data sent to each GPU. The pixel transfer does not use the SLI hardware, but is executed through main memory, the slowest possible path. The example used is the stock Equalizer polygon rendering application, using a data set of 12.6 million triangles.

Example 2: Apple Mac Pro

Performance comparison
Performance for a dual-GPU Mac Pro

The performance speedup of a dual-GPU Mac Pro is similar to the HP workstation. The benchmarks shows in addition to a medium-sized model also a small model, which naturally does not scale. The test machine is a quad-core Mac Pro with 12 GB memory and two ATI x1900 graphics cards.


Graphics Clusters

Graphic clusters are virtually unlimited in the number of possible processors and graphics cards. The software scalability on such cluster is a relatively new field, compared to compute clusters. Equalizer is pushing the boundaries on what is possible by bringing more applications to this environment.

Example 1: Volume Rendering

Volume rendering is a prime candidate for scalable rendering. A database decomposition is easily done by bricking the volume, and the recomposition uses less pixel data, since even for a database recomposition only RGBA data is needed. Furthermore, the decomposition is easy to load-balance and scales nicely all aspects of the rendering pipeline.

Performance comparison
Scalability of a 5123 volume data set.

Database (sort-last) volume rendering allows to visualize data sets which do not fit on a single GPU, since the individual graphics cards only need to render a sub-volume of the whole data set. The benchmark on the right shows this effect clearly. Up to four rendering nodes the volume brick does not fit on the the GPU. Afterwards, it gets small enough that it can be completely cached on the GPU, and performance immediately jumps by an order of magnitude. The screen-space decomposition always has to hold the full volume texture and can therefore only scale sub-linearly.

Performance comparison
Scalability of a 2563 volume data set.

The same data set at a lower resolution always fits into GPU memory. The benchmark on the right shows nicely the scalability limits of the used graphics cluster. The readback, transfer and compositing pipeline limits the performance to 25 fps at a resolution of 1280x1024. When using a database decomposition, twice the amount of data has to be handled by each node, thus limiting the performance to 12.5 fps.

This performance limitation is a mostly hardware bottleneck, mostly caused by the slow interconnect on the cluster. Framerates of up to 60 Hertz are possible with a properly tuned system.

Example 2: Polygonal Rendering

Performance comparison
Scalability of a polygonal data set.

Polygonal data sets have the disadvantage that the database recomposition is twice as expensive, since both color and depth information is processed. Furthermore, load balancing is harder compared to volume rendering since the data is less uniform. The benchmark results therefore show that the hardware limits the rendering to 7.25 fps, half of the volume rendering performance. Screen-space decomposition again suffers performance due to the fact that the whole model has to be loaded on each node. This polygonal rendering benchmark is much less fill-bound than the volume rendering benchmark.


Basic Decomposition Modes

Equalizer uses a compound tree to configure the parallelization of the rendering tasks. The compound tree allows a flexible configuration of the task decomposition. The Equalizer distribution contains numerous configuration files for the implemented feature set. The following parallel rendering algorithms can be configured:

A 2D compound
A sort-first compound

2D or sort-first decomposes the rendering in screen-space, that is, each contributing rendering unit processes a tile of the final view. Equalizer simply recomposes the tiles side-by-side on the destination view. This mode has a low, constant IO overhead for the image transfers and can provide good scalability when combined with view frustum culling. Depending on the application data structure, the overlap of some primitives between individual tiles limits the scalability of this mode, typically to around eight graphic cards.

A DB compound
A sort-last compound

DB or sort-last decomposes the rendered database so that all rendering units process a part of the scene in parallel. The depth buffer information is used to composite the individual images correctly for polygonal data. Volume rendering applications use an ordered alpha-based blending to composite the result image. This mode provides very good scalability while having linear increasing IO requirements. The increasing recomposition work can be addressed by using parallel compositing, as described below.

An Eye compound
A stereo compound

Eye decomposition is used during stereo rendering. The individual views for each eye are assigned to different rendering units and later copied to the appropriate stereo buffer. This modes supports a variety of stereo modes, including active (quad-buffered) stereo, anaglyphic stereo or auto-stereo displays with multiple eye passes. Due to the frame consistency between the eye views, this modes scales very good up to the number of eye passes. It is often combined with another mode on large-scale visualization clusters.

A Pixel compound
A pixel compound

Pixel compounds are similar to 2D compounds, but the frustra of the source rendering units are modified so that each unit renders an evenly distributed subset of pixels. Pixel compounds are ideal for purely fill-limited applications such as volume rendering and raytracing. The load is always almost equally distributed, thus allowing to scale the fill-rate nearly linearly.

A DPlex compound
A DPlex compound

DPlex time-multiplexes multiple rendering resources to produce a steady stream of output images. The rendered images are simply copied to the destination view. This mode provides very good scalability but introduces latency into the rendering. Depending on the application, the additional latency is compensated by the higher frame rate. The AFR (Alternate Frame Rendering) mode of multi-GPU systems implements this feature in hardware. This feature is not yet implemented.


Advanced Scalability Features

Equalizer implements a number of additional features to better utilize the rendering resources with the basic decomposition modes mentioned in the previous chapter. The following gives a short overview of some of the possible optimisations.

Multi-Level Compounds
A multilevel compound

The basic decomposition schemes described above can be combined in any possible way to address different bottlenecks within the rendering process. For example, an Eye compound can be combined with a 2D compound to first decompose the rendering for the left and right eye, and then further sub-decompose the rendering for each individual eye pass. This allows to exploit the excellent scalability of a stereo decomposition with the ability to use four graphic cards to render a single stereo view.

Direct Send Parallel Compositing
Data flow for direct send compositing

Equalizer supports various parallel compositing algorithms, typically used during DB/sort-last rendering. Due to the linear increase in pixel data, it becomes necessary to parallelize the compositing. The compound implementation of Equalizer supports various parallel sort-last compositing algorithms, including, but not limited to, direct send and binary swap. Such algorithms keep the compositing cost constant due to the parallelization across all rendering units. The example shows a three-node direct-send composition. Each rendering unit does a z-based composition of one tile, and sends the composited result to the destination channel.

Synchronous vs. asynchronous execution
Performance for asynchronous execution

Equalizer schedules the rendering resources asynchronously. Resources which are independent from others start rendering the next frame immediately after they finish a frame. By hiding the unavoidable imbalances in the task decomposition, this execution mode provides a much better resource utilisation than the traditional, frame-synchronous approach. The picture shows the performance for a sort-last direct-send compound on five nodes using 1GBit ethernet for a latency from 0 (synchronous execution) to 6. Allowing one or two frames of latency increases the performance by 15 percent in this case.

2D Load-Balancing
Load balancing a sort-first compound

The Equalizer server can dynamically adjust the decomposition parameters based on the current workload distribution. This load balancing can happen automatically (based on time measurements) or programatically (based on application input). This feature is not yet implemented.

Advanced Load-Balancing
Advanced load balancing

Advanced load balancing uses off-screen rendering buffers on the same graphic cards which display the final result. These buffers contribute to the rendering of other output channels, enabling the optimal usage of all rendering units on a multi-GPU machine. This feature is not yet implemented.

3D models courtesy of Cyberware , Stanford University Computer Graphics Laboratory, Stereolithography Archive at Clemson University and AVS, USA.