Implementationally, the cache-ring algorithm operates very similarly to the hash table - the primary difference is in the data structures used. It turns out that even with a very fast hashing function, the hash table method is slowed because the table entry for that edge is rarely still in fast cache when it is accessed by subsequent voxels. And, most-importantly, when we suffer a cache miss for an edge, we are just as likely to suffer a miss on lookups of subsequent edges in the same direction. This is not the case with the cache-ring algorithm. This is because on a cache miss the memory management unit (MMU) brings in in entire cache line of data, rather than just the single requested item. This standard technique is generally efficient because of the law of data locality: if a memory address is being accessed now, its neighbors are likely to be accessed in the very near future . We turn this mechanism to our advantage with cache-rings, as we essentially prefetch most of our data.
In practice, we did not see as much speed-up as we had anticipated. This turned out to be because we were knocking our own data out of the cache as we tried to keep six rings stored there at the same time. Realizing that all of the necessary data is redundantly stored in multiple rings, we can begin carefully merging some of the rings.
The first improvement comes from recognizing that ring
r[i,j-1,k] is exactly the same length as ring r[i,j-1,k-1], and the data is all the same - just shifted over one. To avoid colliding with ourselves in the cache, we combine these lists by zippering them together into a single ring. The same process is applied with rings r[i-1,j,k] and r[i-1,j,k-1]. The second improvement comes from removing r[i,j,k-1], since this ring only has one element! Instead, the current and next values from this ring are stored in global variables.
This leave us with three cache-rings: r[i,j-1,k], r[i-1,j,k] and r[i-1,j-1,k]. Our final optimization is to eliminate the last ring, r[i-1,j-1,k] by combining it with r[i-1,j,k]. This operation introduces a complication, though, because ring r[i-1,j-1,k] is nz voxels longer than ring r[i,j-1,k]. We clean this up by initially storing the r[i-1,j-1,k] data in ring r[i,j-1,k] and then copying it into r[i- 1,j,k] when the first ring has cycled. For our final implementation, we are left with two cache-rings then: r[i-1,j,k] and r[i,j-1,k]. For simplicity, we will refer to these as RX and RY, respectively.
To visualize how these data structures serve the algorithm, we will follow through a short example. If we are extracting an isosurface of value v and some node n[i,j,k] (see Figure 1) has value (v+.1) and node n[i+1,j,k] has value (v-.1), then the triangular, linearly interpolated isosurface S, which will be extracted from the volume, will contain the point p, located at the midpoint of the edge between node n[i,j,k] and node n[i+1,j,k]. This edge is shared by four voxels. Specifically:
This intersection point is first encountered from voxel v[i-1,j-1,k]. Its location along the edge is computed, and the index to this new point is placed in rings RX and RY. The algorithm then continues processing cells. When it arrives at voxel v[i-1,j,k], it finds that it has an intersection on the (-y,+z) edge and blindly grabs the index of point p from cache-ring RY, knowing that the leading entry must be this intersection node's index. After all the triangles have been generated for voxel v[i-1,j,k], this intersection point is again inserted into the RX cache, so that the point can be referenced in voxel v[i,j,k].
The only specials cases in this implementation come when we encounter non-interior (exterior) edges. As defined above, these are edges which lie on boundary faces of the volume. When we are examining such a cell, we need to take special care to not assume the usual edge intersections have been precomputed and cached.