The performance gap continues to grow between applications which haphazardly access data and applications designed to optimize memory access patterns. This trend is a result of the inability of memory speeds to keep pace with processor speeds - while processor speeds double every two years, memory speeds double only every eight years . In an effort to mitigate this growing differential, memory designers have developed increasingly deep, multi-tiered storage hierarchies. The top tiers (registers and L1 cache) consist of fast storage capable of delivering data to the datapath fast enough that program execution can continue smoothly with little to no delay. The next tiers of storage, deeper (L2) caches and main memory, are not stored as close to the CPU, and as a result accesses to this data can be several orders of magnitude slower. When enough L1 misses accumulate, the CPU generally stalls program execution until the requested data arrives from the memory management unit (MMU). If these stalls occur frequently, as can be the case with haphazard data management, program execution can suffer drastic, even crippling, slow-downs. This condition has only been exacerbated with the advent of parallel architectures. Now, not only must the data be in L1 cache for execution to continue without delay, but it must be in L1 cache on the right processor. Improving memory performance can no longer be a matter of hacking the code as an after-thought to improve memory access patterns - memory performance must be treated as a primary factor in algorithm design.
Recognizing this paradigm-shift, we have developed a topological isosurface extraction algorithm to optimize cache performance. Our algorithm exploits the predictability of memory accesses during Marching Cubes  surface construction, matching data structures to these access patterns in order to minimize cache misses. The data structure we have developed, termed a cache-ring, leverages hardware access mechanisms (such as spatial coherence and cache line fetching) to minimize MMU stalls. Upon analysis, we found that our algorithm actually requires more instructions than the comparable edge-hashing algorithm (most of these increases are due to book-keeping operations, as we will discuss later); however, we see moderate performance improvements, as our cache-performance more than makes up for this overhead by reducing the number of MMU stalls. As the processor-memory gap widens and parallel processors become increasingly prevalent, we anticipate even greater performance improvements with such memory-conscious algorithms.