Optoelectronic Cache Memory System Architectures
Donald M. Chiarulli & Steven P. Levitan
Departments of Computer Science
and Electrical Engineering
University of Pittsburgh
Pittsburgh, PA 15260
Workshop on Data Encoding for Page Oriented Optical Memories DEPOM '96, Phoenix, AZ, March 1996
We present an investigation of the architecture of an optoelectronic cache which can integrate terabit optical memories with the electronic caches associated with high performance uni- and multi- processors. The use of optoelectronic cache memories will enable these terabit technologies to transparently provide low latency secondary memory with frame sizes comparable to disk-pages but with latencies approaching those of electronic secondary cache memories. This will enable the integration of optical memory into the memory hierarchy thereby supporting terabit address spaces with effective access times comparable to the cycle times of current microprocessors. We present the architecture of an interface to an off-the-shelf desktop computer and simulation results which predict the performance of this system.
In the conventional description of a memory hierarchy, a distinction is made between secondary memory, primary memory, and each level of cache memory. This distinction was originally based on the visibility of the memory relative to a machine language instruction. In this historical context, shown in Figure 1(a), primary, or main, memory was defined by the program address space (e.g., 16 bit addresses) and secondary memory, or backing store, was associated with input and output. Cache memories, to the extent they existed, were invisible, and were first implemented as a buffer between the processor and primary memory. In modern systems, shown in Figure1(b), caches are implemented routinely and typically exist in multiple levels, with the first level cache integrated into the processor itself. The distinction between primary and secondary memory has been significantly blurred by address segmentation and virtual memory systems. Typically, secondary memory now supports a much larger program address space, parts of which are swapped on demand into a semiconductor RAM primary memory level. In the following discussion, we dispense with the notion of distinct primary and secondary memories. As shown in Figure 1(c), we merge these levels into a single optoelectronic memory at the lowest level of the hierarchy. The processor address space is directly supported in the optical memory. All levels between this optical memory and the processor are transparent to the processor and therefore are referred to as cache levels.
In this discussion we present a realization of the optoelectronic (OE) cache level in the OE memory hierarchy. As shown in Figure 1(c), the cache is in the same position as the primary memory in a conventional hierarchy. However, unlike primary memory it is transparent to both the processor and the operating system. This level is the interface between the optical memory backing store and the secondary cache associated with the processor. Another distinguishing feature of the OE cache is its significantly larger line size than is typical for primary memory. A memory line, (also commonly known as a cache line) is the amount of data transferred between levels of the hierarchy when a memory fault (or equivalently, a cache miss) occurs. Thus, the size of a line at a particular level, is a trade-off between the locality supported within the memory traffic and the efficiency to which the cache is utilized. A large cache line more loosely constrains memory access locality. However, large cache lines will also bring into the cache fragments of unused memory. This effect is called internal fragmentation. In a conventional memory the cost associated with internal fragmentation can be significant since the fault service time is typically linearly related to the line size. However, in the OE cache, the (much larger) line size is determined by the width of the optical memory word. The parallel access characteristics of an optical memory make it possible to transfer cache lines to and from the optical memory in a single access time. This is substantially faster than the equivalent transfer from a magnetic disk which must allow for both rotational latency and serial transfers. This is a significant advantage. However, it has an effect on the organization of the cache itself, and also impacts the mechanism for address translation and, in multiprocessor systems, coherency issues.
Figure 2 shows a block diagram of an PCI-bus implementation of the OE cache. In this implementation, each of the SRAMs can be accessed either from the optical memory interface, in which an optical memory frame is written in parallel along the horizontal buses, or from the PCI bus, in which a single 32-bit word is accessed along the vertical bus. The cache controller processes each address from the PCI bus and determines if there is a cache hit. In other words, if the requested location is present in the cache SRAMs. If a hit occurs, the controller translates the address of the requested location from its location in the processor address space, to an address within the OE cache, selected by enabling both the corresponding cache line (column) and the corresponding word offset onto the electronic bus. If the address is not held in the cache, a cache miss occurs. In this case, the cache controller accesses the optical memory in order to load the requested frame.
The average memory latency Lx at any level, x, of a memory hierarchy can be calculated as:
where px the fault probability, (1 - px) is the hit probability, and Lx is the memory access time at level x. L0 is the latency associated with the memory at the lowest level of the hierarchy, commonly known as the backing store. In this expression we approximate the miss penalty, at all but the lowest level, to the average latency of the next lower level. This approximation is accurate if we assume that memory banking, or other prefetching techniques have been implemented between these levels. At the L0 level, specifically when disk drives are used as the backing store, is it necessary to consider the transfer time of a memory line as part of the latency. In this case, if Ts is the average seek time, Tr is the average rotational latency and Tx is the transfer rate of a disk based backing store, the miss latency of a memory line of size nm is:
Alternatively, when an optical memory is used as the backing store and the entire cache line is transferred in parallel, only To, the access time of the optical memory, needs to be considered:
To investigate the relative fault rates of an optoelectronic memory hierarchy verses a conventional electronic/magnetic disk based hierarchy, we implemented models of two memory systems. Each has a three level hierarchy. The first version models the behavior of an electronic primary memory at level one with magnetic disk as the backing store. The second version models the behavior of an optoelectronic cache at level one with optical memory as the backing store. The top two levels in both models are electronic primary and secondary cache memories with identical characteristics. Figure 3 is a plot of the average memory latency of three applications running under each of the two models.
For the applications tested, the simulations show a three to four order of magnitude increase in the performance of the optoelectronic memory system (with 1msec. random access to 1Mbit pages) versus that of a conventional memory hierarchy with rotational magnetic backing store. Thus, we have demonstrated that optoelectronic cache memories can be used to effectively interface a low latency optical backing store for an optoelectronic memory hierarchy. Although line sizes in the cache are typically larger than disk-pages, average memory access latency is not adversely affected by the additional internal fragmentation introduced. We are currently investigating the relationship between various competing technologies for the optical memory and the smart pixel array used in the cache. These technology choices must be considered in the context of architectural issues such as the address translation mechanism, frame size at each level of the memory hierarchy, write policy, replacement algorithms, and coherency support mechanisms for multiprocessor implementations.