For the multiprocessor implementation of the cache-rings algorithm, we divide the dataset along the x-dimension into ``slabs,'' which are independently handled by separate processors. Each processor stores its first RX cache-ring globally, so the data can be accessed by the processor sharing those voxel edges. All subsequent cache-ring loads and stores are done on a local cache-ring, until the final plane of voxels in the slab is reached. At this point, the processor waits until the processor sharing the boundary edges of those voxels indicates that it has finished processing storing that edge information in its global RX cache-ring. In practice, this synchronization doesn't cause any slowdown, because every processor has generally completed processing its first plane of voxels before any processor is ready to process its last plane.
The dimension of the slabs and the number of utilized processors can be selected by the user. We found that choosing slabs with less than four voxels of thickness lead to dramatic slowdowns, as increasing percentages of execution time became necessary for sharing and copying data at the boundaries.
When each processor has completed its portion of the surface, the pieces are serially evaluated by a single processor to determine the offsets of the indices in each partial surface. These offsets are then read in by the processors and, once again proceeding in parallel, each processor adjusts the indices of the triangles it has extracted. Once the indices are consistent, the points and triangles can be moved directly into a single data structure which stores the entire surface.
Performance analysis shows that there is not much delay incurred by the barrier synchronizations described above. The largest performance hit comes from the extra work needed to process boundary voxels. When the entire volume is a single slab (as is the case for single processor execution), there are a minimal number of voxels touching boundaries. With each added processor, the size of the slabs decreases, and the percentage of the voxels which share a boundary rapidly rises. Furthermore, processing voxels which touch the Y and Z boundaries of the model incur a smaller overhead than voxels which are at an X boundary, and it is the number of X boundary voxels which is increasing as we divide the volume into more, ever thinner slabs.