A system and method avoids “livelock” and “starvation” among two or more input/output (I/O) devices of a symmetrical multiprocessor (SMP) computer system competing for the same data. The SMP computer system includes a plurality of interconnected processors, one or more memories that are shared by the processors, and a plurality of I/O bridges to which the I/O devices are coupled. A cache coherency protocol is executed the I/O bridges, which requires the I/O bridges to obtain “exclusive” (not shared) ownership of all data stored by the bridges. In response to a request for data currently stored by an I/O bridge, the bridge first copies at least a portion of that data to a non-coherent buffer before invalidating the data. The bridge then takes the largest amount of the data saved in its non-coherent buffer that its knows to be coherent, and releases only that known coherent amount to the I/O device, and then discards all of the saved data.
1. A method for avoiding livelock among two or more input/output (I/O) devices of a symmetrical multiprocessor computer system comprising a plurality of interconnected processors, one or more shared memories coupled to the processors, and at least one I/O bridge in communicating relationship with the two or more I/O devices, the processors and the one or more shared memories, the method comprising the steps of:
providing at least one coherent buffer and at least one non-coherent buffer at the I/O bridge, the non-coherent buffer coupled to the at least one coherent buffer and to at least one of the I/O devices; receiving a request from a first I/O device coupled to the I/O bridge for information; storing the device requested information in the coherent buffer of the I/O bridge; receiving a system message at the I/O bridge requesting the information stored in the coherent buffer, the system message originating from other than the first I/O device; copying at least a portion of the stored information to the non-coherent buffer; invalidating the stored information within the coherent buffer; and supplying to the first I/O device at least some of the stored information copied into the non-coherent buffer. 2. The method of the first I/O device is coupled to the non-coherent buffer by an I/O bus having a bus cycle specifying a predetermined number of bits per I/O bus cycle, and the stored information supplied to the first I/O device from the non-coherent buffer is the predetermined number of bits of one bus cycle. 3. The method of receiving a second request at the I/O bridge from the first I/O device requesting information; determining whether the information of the second request is stored in the coherent buffer; and if the information of the second request is stored in the coherent buffer, supplying at least some of the information to the first I/O device. 4. The method of if the information of the second request is not stored in the coherent buffer, determining whether the information of the second request is stored in the non-coherent buffer; and if the information of the second request is stored in the non-coherent buffer, supplying the predetermined number of bits of one bus cycle of the information to the first I/O device. 5. The method of granting the I/O bridge exclusive ownership relative to the plurality of processors and the other I/O bridges of the computer system over the information stored by the I/O bridge; and following the step of invalidating, generating an acknowledgement confirming that the stored information has been invalidated by the I/O bridge. 6. The method of organizing information stored in the one or more shared memories of the computer system into respective cache lines; and providing one or more cache coherency directories, the one or more cache coherency directories configured to store an ownership status for each cache line, wherein the system message requesting information originates from one or more of the directories and the acknowledgement is sent to one or more of the directories. 7. An input/output (I/O) bridge for use in a distributed shared memory computer system comprising a plurality of interconnected processors and one or more shared memories that are coupled to the processors, the I/O bridge configured to provide intercommunication between one or more I/O devices and the plurality of processors or shared memories, the I/O bridge comprising:
at least one coherent buffer configured to store information requested by a first I/O device coupled to the I/O bridge; at least one non-coherent buffer coupled to the coherent buffer and to the one or more I/O devices; and a controller coupled to the coherent buffer and the non-coherent buffer, the controller configured to:
store at least a portion of the information stored in the coherent buffer in the non-coherent buffer in response to receiving a system message originating from other than the first I/O device requesting the information stored in the coherent buffer, invalidate the information within the coherent buffer, and supply to the first I/O device at least some of the information copied into the non-coherent buffer. 8. The I/O bridge of the first I/O device is coupled to the non-coherent buffer by an I/O bus having a bus cycle specifying a predetermined number of bits per I/O bus cycle, and the information supplied to the first I/O device from the non-coherent buffer is the predetermined number of bits of one bus cycle.
[0001] This patent application is related to the following co-pending, commonly owned U.S. Patent Applications, all of which were filed on even date with the within application for United States Patent and are each hereby incorporated by reference in their entirety: [0002] U.S. patent application Ser. No. (15311-2281) entitled ADAPTIVE DATA PREFETCH PREDICTION ALGORITHM; [0003] U.S. patent application Ser. No. (15311-282) entitled UNIQUE METHOD OF REDUCING LOSSES IN CIRCUITS USING V2 PWM CONTROL; [0004] U.S. patent application Ser. No. (15311-283) entitled IO SPEED AND LENGTH PROGRAMMABLE WITH BUS POPULATION; [0005] U.S. patent application Ser. No. (15311-284) entitled PARTITION FORMATION USING MICROPROCESSORS IN A MULTIPROCESSOR COMPUTER SYSTEM; [0006] U.S. patent application Ser. No. (15311-285) entitled SYSTEM AND METHOD FOR USING FUNCTION NUMBERS TO INCREASE THE COUNT OF OUTSTANDING SPLIT TRANSACTIONS; [0007] U.S. patent application Ser. No. (15311-287) entitled ONLINE ADD/REMOVAL OF SERVER MANAGEMENT INFRASTRUCTURE; [0008] U.S. patent application Ser. No. (15311-288) entitled AUTOMATED BACKPLANE CABLE CONNECTION IDENTIFICATION SYSTEM AND METHOD; [0009] U.S. patent application Ser. No. (15311-289) entitled AUTOMATED BACKPLANE CABLE CONNECTION IDENTIFICATION SYSTEM AND METHOD; [0010] U.S. patent application Ser. No. (15311-290) entitled CLOCK FORWARD INITIALIZATION AND RESET SIGNALING TECHNIQUE; [0011] U.S. patent application Ser. No. (15311-292) entitled PASSIVE RELEASE AVOIDANCE TECHNIQUE; [0012] U.S. patent application Ser. No. (15311-293) entitled COHERENT TRANSLATION LOOK-ASIDE BUFFER; [0013] U.S. patent application Ser. No. (15311-294) entitled DETERMINISTIC HARDWARE BEHAVIOR BETWEEN MULTIPLE ASYNCHRONOUS CLOCK DOMAINS THROUGH THE NOVEL USE OF A PLL; and [0014] U.S. patent application Ser. No. (15311-306) entitled VIRTUAL TIME OF YEAR CLOCK. [0015] 1. Field of the Invention [0016] This invention relates to computer architectures and, more specifically, to distributed, shared memory multiprocessor computer systems. [0017] 2. Background Information [0018] Distributed shared memory computer systems, such as symmetric multiprocessor (SMP) systems support high-performance application processing. Conventional SMP systems include a plurality of processors coupled together by a bus. One characteristic of SMP systems is that memory space is typically shared among all of the processors. That is, each processor accesses programs in the shared memory, and processors communicate with each other via that memory (e.g., through messages and status information left in shared address spaces). In some SMP systems, the processors may also be able to exchange signals directly. One or more operating systems are typically stored in the shared memory. These operating systems control the distribution of processes or threads among the various processors. The operating system kernels may execute on any processor, and may even execute in parallel. By allowing many different processors to execute different processes or threads simultaneously, the execution speed of a given application may be greatly increased. [0019] [0020] The cache memories 114 [0021] In general, cache coherency protocols cause other processors to be notified when an update (e.g., a write) is about to take place at some processor's cache. Other processors, to the extent they also have copies of this same data in their caches, may then invalidate their copies of the data. The write is typically broadcast to the processors which then update the copies of the data in their local caches. Protocols or algorithms, some of which may be relatively complex, are often used to determine which entries in a cache should be overwritten when more data than can be stored in the cache is received. [0022] I/O bridge 108 may also include one or more cache memories (not shown) of its own. The bridge cache is used to store data received via system bus 104 from memory 106 and/or the processor caches 114 that is intended for one or more of the I/O devices 112. That is, bridge 108 forwards the data from its cache onto one or more of the I/O busses 110. Data may also be received by an I/O device 112 and stored at the bridge cache before being driven onto system bus 104 for receipt by a processor 102 or memory 106. Generally, the data stored in the cache of I/O bridge 108 is not coherent with the system 110. In small computer systems, it is reasonable for an I/O bridge not to maintain cache coherence for read transactions because those transactions (fetching data from the cache coherent domain) are implicitly ordered and the data is consumed immediately by the device. However, in large computer systems with distributed memory, I/O devices, such as devices 112, are not guaranteed to receive coherent data. [0023] U.S. Pat. No. 5,884,100 to Normoyle et al. discloses a single central processing unit (CPU) chip in which an I/O system is disposed on (i.e., built right onto) the core or package of the CPU chip. That is, Normoyle discloses an I/O system that is part of the CPU chipset. Because the I/O system in the Normoyle patent is located in such close proximity to the CPU, and there is only one CPU, the Normoyle patent is purportedly able to keep the I/O system coherent with the CPU. [0024] In symmetrical multiprocessor computer systems, however, it would be difficult to incorporate the I/O system onto the processor chipset. For example, the Normoyle patent provides no suggestion as to how its I/O system might interface with other CPUs or with other I/O systems. Thus, a need exists for providing cache coherency in the I/O domain of a symmetrical multiprocessor system. [0025] However, by imposing cache coherency on the I/O domain of a symmetrical multiprocessor computer system, other problems that could degrade system's performance may result. For example, some cache coherency protocols, if applied to the I/O bridge, may result in two or more I/O devices, who are competing for the same data, becoming “livelocked”. In other words, neither I/O device is able to access the data. As a result, both devices are “starved” of data and are unable to make any progress in their respective processes or application programs. Accordingly, a need exists, not just for providing cache coherency in the I/O domain, but for also ensuring continued, high-level operation of the symmetrical multiprocessor system. [0026] Briefly, the invention relates to a system and method for avoiding “livelock” and “starvation” among two or more input/output (I/O) devices competing for the same data in a symmetrical multiprocessor (SMP) computer system. The SMP computer system includes a plurality of interconnected processors having corresponding caches, one or more memories that are shared by the processors, and a plurality of I/O bridges to which the I/O devices are coupled. Each I/O bridge includes one or more upstream buffers and one or more downstream buffers. An up engine is coupled to the upstream buffer and controls the flow of information, including requests for data, from the I/O devices to the processors and shared memory. A down engine is coupled to the downstream buffer, and controls the flow of information from the processors and shared memory to the I/O devices. A cache coherency protocol is executed in the I/O bridge in order to keep the data in the downstream buffer coherent with the processor caches and shared memory. As part of the cache coherency protocol, the I/O bridge obtains “exclusive” (not shared) ownership of all data fetched from the processor caches and the shared memory, and invalidates and releases any data in the downstream buffer that is requested by a processor or by some other I/O bridge. [0027] To prevent two I/O devices from becoming “livelocked” in response to competing requests for the same data, each I/O bridge further includes at least one non-coherent memory device which is also coupled to and thus under the control of the down engine. Before invalidating data requested by a competed device or entity, the down engine at the I/O bridge receiving the request first copies that data to the bridge's non-coherent memory device. The down engine then takes the largest amount of the copied data that it “knows” to be coherent (despite the request for that data by a processor or other I/O bridge) and releases only that amount to the I/O device which originally requested the data from the bridge. In the illustrative embodiment, this “known” coherent amount of data corresponds to one I/O bus cycle. The remaining data that was copied into the non-coherent memory device is then discarded. In this way, the I/O device that originally requested the data is guaranteed to make at least some forward progress despite data collisions, and yet data coherency is still maintained within the I/O domain of the SMP computer system. [0028] In another embodiment of the invention, the I/O bridge includes a single, dual-property buffer configured to store both coherent and non-coherent data. Each entry of the dual-property buffer includes a tag that specifies whether the respective entry contains coherent or non-coherent data. As data is entered into a buffer entry in response to request for exclusive ownership of that data, the I/O bridge sets the respective tag to indicate that the data is coherent. If the data is subsequently requested by a competing device or entity, the I/O bridge changes the respective tag from coherent to non-coherent. For buffer entries whose tag indicates that the data is non-coherent, the I/O bridge preferably releases to the target I/O device only that amount “known” to be coherent. [0029] The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like reference numbers indicate identical or functionally similar elements: [0030] [0031] [0032] [0033] [0034] [0035] [0036] FIGS. 7A-7B are flow diagrams of the methods of the present invention; [0037] [0038] [0039] [0040] [0041] Each CPU 202 of a 2P module 300 is preferably an “EV7” processor that includes part of an “EV6” processor as its core together with “wrapper” circuitry comprising two memory controllers, an I/O interface and four network ports. In the illustrative embodiment, the EV7 address space is 44 physical address bits and supports up to 256 processors 202 and 256 I/O subsystems 500. The EV6 core preferably incorporates a traditional reduced instruction set computer (RISC) load/store architecture. In the illustrative embodiment described herein, the EV6 core is an Alpha® 21264 processor chip manufactured by Compaq Computer Corporation of Houston, Tex., with the addition of a 1.75 megabyte (MB) 7-way associative internal cache and “CBOX”, the latter providing integrated cache controller functions to the EV7 processor. However, it will be apparent to those skilled in the art that other types of processor chips may be advantageously used. The EV7 processor also includes an “RBOX” that provides integrated routing/networking control functions with respect to the compass points, and a “ZBOX” that provides integrated memory controller functions for controlling the memory subsystem. [0042] [0043] The IO7400 comprises a North circuit region 410 that interfaces to the EV7 processor 202 and a South circuit region 450 that includes a plurality of I/O ports 460 (P0-P3) that preferably interface to standard I/O buses. An EV7 port 420 of the North region 410 couples to the EV7 processor 202 via two unidirectional, clock forwarded links 430. In the illustrative embodiment, three of the four I/O ports 460 interface to the well-known PCI and/or PCI-X bus standards, while the fourth port interfaces to an AGP bus standard. [0044] In accordance with an aspect of the present invention, a cache coherent domain of the SMP system 200 extends into the IO7400 and, in particular, to I/O buffers or caches located within each I/O port 460 of the IO7400. Specifically, the cache coherent domain extends to a write cache (WC) 462 and a read cache (RC) 464 located within each I/O port 460. As described further herein, these caches 462, 464 function as coherent buffers. Each port 460 of the IO7400 may further include a translation look-aside buffer (TLB) 466 for translating I/O domain addresses to system addresses. [0045] [0046] Each I/O subsystem 500 also includes power supplies, fans and storage/load devices (not shown). The I/O standard module card 580 contains a Small Computer System Interface (SCSI) controller for storage/load devices and a Universal Serial Bus (USB) that enables keyboard, mouse, CD and similar input/output functions. The embedded region 550 of the I/O subsystem 500 is typically pre-configured and does not support hot-swap operations. In contrast, the hot-plug region 530 includes a plurality of slots adapted to support hot-swap. Specifically, there are two ports 532, 534 of the hot plug region 530 dedicated to I/O port one (PI of [0047] Also included within the I/O subsystem 500 and coupled adjacent to the IO7400 is a PCI backplane manager (PBM) 502. The PBM 502 is part of a platform management infrastructure. The PBM 502 is coupled to a local area network (LAN), e.g., 100 base T LAN, by way of another I/O riser board 590 within the I/O subsystem 500. The LAN provides an interconnect for the server management platform that includes, in addition to the PBM 502, a CPU Management Module (CMM) located on each 2P module 300 ( [0048] Virtual Channels [0049] The SMP system 200 comprises a plurality of virtual channels including a Request channel, a Response channel, an I/O channel, a Forward channel and an Error channel. Each channel may be associated with its own buffer (not shown) on the EV7 processors 202. Ordering within a CPU 202 with respect to memory is achieved through the use of memory barrier (MB) instructions, whereas ordering in the I/O subsystem 500 is done both implicitly and explicitly. In the case of memory, references are ordered at the home memory of the cache line data in a directory in flight (DIF) data structure (table) of the EV7202. [0050] Within the I/O channel, write operations are maintained in order relative to write operations and read operations are maintained in order relative to read operations. Moreover, write operations are allowed to pass read operations and write acknowledgements are used to confirm that their corresponding write operations have reached a point of coherency in the system. Ordering within the I/O channel is important from the perspective of any two end points. For example, if a first processor (EV7 [0051] Cache Coherency in the EV7 Domain [0052] In the illustrative embodiment, a directory-based cache coherency policy is utilized in the SMP system 200. A portion of each memory data block (“cache line”) is associated with the directory and, as such, contains information about the current state of the cache line, as well as an indication of those EV7s 202 in the system 200 holding copies of the cache line. The EV7202 allocates storage for directory information by using bits in the respective memory storage. For example, there may be 72 bytes of storage for each 64 bytes of data in a cache line, thereby leaving 8 additional bytes. A typical implementation allocates one byte of this excess storage for error correction code (ECC) coverage on the 8 bytes. The EV7202 may alternatively allocate a 9-bit ECC on each 16 bytes of data. The cache states supported by the directory include: invalid; exclusive-clean (processor has exclusive ownership of the data, and the value of the data is the same as in memory); dirty (processor has exclusive ownership of the data, and the value at the processor may be different than the value in memory); and shared (processor has a read-only copy of the data, and the value of the data is the same as in memory). [0053] If a CPU 202 on a 2P module 300 requests a cache line that is resident on another 2P module 300, the CPU 202 on the latter module supplies the cache line from its memory and updates the coherency state of that line within the directory. More specifically, in order to load data into its cache, an EV7202 may issue a read_modify_request (ReadModReq) or an invalidate_to_dirty_request (InvaltoDirtyReq) message, among others, on the Request channel to the directory identifying the requested data (e.g., the cache line). The directory typically returns a block_exclusive_count (BlkExclusiveCnt) or an invalidate_to_dirty_response_count (InvaltoDirtyRespCnt) message on the Response channel (assuming access to the data is permitted). If the requested data is exclusively owned by another processor 202, the directory will issue a read_forward (ReadForward) or a read_modify_forward (ReadModForward) message on the Forward channel to that processor 202. The processor 202 may acknowledge that it has invalidated its copy of the data with a Victim or VictimClean message on the Response channel. [0054] I/O Space Ordering [0055] The EV7 processor 202 supports the same I/O space ordering rules as the EV6 processor: load (LD)-LD ordering is maintained to the same IO7400 or processor 202, store (ST)-ST ordering is maintained to the same IO7 or processor, LD-ST or ST-LD ordering is maintained to the same address, and LD-ST or ST-LD ordering is not maintained when the addresses are different. All of these ordering constraints are on a single processor basis to the same IO7400 or processor 202. Multiple loads (to the same or different addresses) may be in flight without being responded to, though their in-flight order is maintained to the destination by the core/CBOX and the router. Similarly, multiple stores (the same or different addresses) can be in flight. [0056] The EV7 processor 202 also supports peer-to-peer I/O. In order to avoid deadlock among peer IO7 “clients”, write operations are able to bypass prior read operations. This is required because read responses cannot be returned until prior write operations have completed in order to maintain PCI ordering constraints. By allowing the write operations to bypass the read operations, it is guaranteed that the write operations will eventually drain, thereby guaranteeing that the read operations will eventually drain. [0057] Cache Coherency in the I/O Domain [0058] As described above, the EV7 processors 202 of system 200 implement a cache coherency protocol to ensure the coherency of data stored in their respective caches. In accordance with the present invention, cache coherency is also extended into the I/O domain. Since each IO7400 can be up to six meters away from its respective EV7 processor 202, if not farther, IO7s can end up relatively far away from each other. To implement cache coherency across such a physically separated I/O domain, unlike the Normoyle patent where the I/O is basically on top of the CPU, among other reasons, the IO7s 400 are generally required to obtain “exclusive” ownership of all data that they obtained from the processors 202 or the memory subsystems 370, even if the IO7400 is only going to read the data. That is, the IO7s 400 are not permitted to obtain copies of data and hold that data in a “shared” state, as the EV7 processors 202 are permitted to do. In addition, upon receiving a ReadForward or a ReadModForward message on the Forward channel specifying data “exclusively” owned by an IO7400, the IO7400 immediately releases that data. More specifically, the IO7400 invalidates its copy of the data and returns either a VictimClean or a Victim message to the directory indicating that it has invalidated the data. [0059] Although these rules maintain the coherency of data obtained by the IO7s 400, there is a potential for livelock and/or starvation among I/O devices. [0060] Each IO7400 [0061] Each IO7400 [0062] As indicated above, the SMP system 200 uses a directory-based cache coherency policy or protocol. In other words, the SMP system 200 includes one or more directories 618. Those skilled in the art will understand that directory 618 is preferably distributed across the processor caches and/or memory subsystems 370 of system 200, and may be maintained by processes or threads running on one or more of the EV7 processors 202. The directory 618 contains information about the current state (e.g., shared, exclusive, etc.) and location (e.g., the caches of one or more EV7 processors 202 and/or memory subsystem 370) for each cache line or data block defined by the memory subsystems 370. [0063] As also indicated above, the data in the downstream buffers 602 [0064] In the illustrative embodiment, the I/O devices 610 specify data in 32-bit addresses, whereas the SMP system 200 address space is 44 bits. A translation mechanism is thus needed to correlate locations in the smaller PCI address space with those of the larger SMP system 200 address space. As noted, an I/O TLB 466 ( [0065] Because the I/O TLBs 466 can be relatively “far away” from the processor and memory components of the SMP system (e.g., up to six meters or more), they are typically not maintained in a coherent manner. Instead, in response to memory management software on the SMP system 200 modifying a page table in memory, the I/O TLBs 466 are flushed. [0066] Upon deriving the system address of the cache line specified in the 32-bit I/O domain address, the up engine 614 [0067] The EV7 processor 202 receives the ReadModReq message from IO7400 [0068] In response to the Retry message, I/O device 610 [0069] Suppose, however, that after IO7400 [0070] IO7400 [0071] Meanwhile, in response to the Retry message, suppose I/O device 610 [0072] According to the invention, a system and method are provided for preventing the occurrence of livelock and for allowing two or more I/O devices that are competing for the same data to still make at least some forward progress. FIGS. 7A-B are flow diagrams of the method of the present invention. First, a DMA read is received by an IO7, such as IO7400 [0073] Nonetheless, on the assumption that I/O device 610 [0074] Suppose IO7400 [0075] In accordance with the present invention, the down engine 604 [0076] Meanwhile, in response to the Retry message of block 710 ( [0077] In this case, down engine 604 [0078] Upon consuming the one bus cycle of data, the I/O device 610 [0079] If the cache line is still being exclusively held by IO7400 [0080] If a cache line requested by an I/O device 610 is already available at the IO7's coherent buffer 602, the response to decision block 706 is Yes, and the IO7400 provides the cache line to the I/O device 610, as indicated at block 732 ( [0081] As shown, despite the receipt of the forward at IO7400 [0082] At least one data beat worth of data from the cache line can be considered coherent by the IO7s 400 and thus transferred to the I/O devices 610 despite a forward hit on the cache line. For example, suppose an EV7 processor 202 (e.g., a “producer”) places “n” transactions into a memory structure, such as a circular queue that are to be read out by an I/O device 610 (e.g.,a “consumer”). The producer will typically signal that these entries have been added to the queue by updating a producer index. The producer index may specify where in the circular queue the “n” transactions start. The consumer will see that the producer index has been updated and generate a read request for the queue. The IO7400 will fetch the cache line(s) corresponding to the circular queue. [0083] Suppose, however, that the producer then wishes to add “m” new transactions into the circular queue. The producer requests write access to the circular queue, causing the IO7 to victimize its copy of the circular queue. The circular queue at the IO7 must be victimized because the IO7 does not know if the cache line(s) that it obtained includes one or more entries to which an “m” transactions is to be written. At least the first of the “n” transactions, however, is still valid, because the producer signaled to the consumer that the “n” transactions were ready for consumption. Accordingly, the IO7 can provide at least a data beat at the starting point identified by the producer, i.e., the first of the “n” transactions. [0084] Those skilled in the art will understand that the functionality of the up and down engines 604, 614 may be combined into a single DMA controller at the IO7400. It should also be understood that the upstream buffer 612 may correspond to the previously discussed write cache (WC) 462 ( [0085] In order to support high performance I/O devices, the up engine 614 of an IO7400, in addition to requesting the cache line specified by a DMA read, may also prefetch additional data corresponding to other cache lines that it “anticipates” the requesting I/O device 610 may need in the future. More specifically, the IO7400 may include a prefetch engine (not shown) that executes an algorithm to identify additional cache lines based on the cache line requested by the I/O device 610. [0086] [0087] If the IO7400 [0088] Suppose that I/O device 610 [0089] As a performance matter, it should be understood that the number of delayed DMA reads that IO7400 [0090] In another embodiment of the present invention, the coherent downstream buffer 602 and the non-coherent buffer 616 at the IO7 are replaced with one or more dual-property buffers. [0091] Suppose an I/O device issues a DMA read specifying a particular cache line. The IO7 first checks to see if the requested cache line is already stored in the dual-property buffer 1000. If not, the IO7 returns a retry message to the I/O device and issues a request for exclusive ownership of the data from the EV7 mesh. The data may be provided to the IO7 as part of a BlkExclusiveCnt message, where the Cnt (count) specifies the number of agents or entities (e.g., processors, other IO7s, etc.) having a shared copy of the data (as determined by the directory). As each of these agents or entities invalidate their copy of the data (as requested by the directory), they send an invalidate_acknowledgement (InvalAck) message to the IO7400 [0092] Upon receiving data, the IO7 preferably stores it in the data space 1004 of a selected entry 1002 of the dual-property buffer 1000. The IO7 then sets the value of the respective tag 1006 for this entry 1002 to indicate that the data is coherent. The IO7 may wait until the Cnt reaches zero before setting the tag to the coherent value. Alternatively, the IO7 may set the tag immediately to coherent, even if the Cnt is non-zero. [0093] If a forward is received that “hits” on this entry 1002 of the dual-property buffer 1000 before the data is provided to the I/O device, the IO7 preferably changes the tag 1006 from coherent to non-coherent. The IO7 then returns or at least schedules the return of a VictimClean message to the directory. It should be understood that the IO7 may have initially responded to the forward with a ForwardMiss before probing the contents of the dual-property buffer 1000. When the retried DMA read is received from the I/O device, the IO7 searches its dual-property buffer 1000 for the specified cache line. Although the cache line is located in the dual property buffer 1000, the IO7 notices that the tag 1006 indicates that the data is non-coherent. Accordingly, to ensure at least some forward progress, the IO7 preferably releases only that amount of the cache line that the IO7 knows to be coherent. Again, in the preferred embodiment, the amount corresponds to one “data beat” of data (e.g., one local bus cycle). After releasing the one data beat, the IO7 may victimize the cache line. Had the tag 1006 indicated that the data is coherent, the entire cache line could be released or otherwise provided to the I/O device. [0094] It should be understood that an IO7400 may be configured to obtain nonexclusive ownership of data in certain situations. For example, an IO7 may issue a particular message, such as a read_invalid (Readlnval), to obtain a non-coherent (e.g., a shared) copy or “snapshot” of data for one-time use by the IO7400. This data may stored directly in the non-coherent buffer or in the dual-property buffer with the tag set from the beginning to non-coherent. [0095] For DMA writes, a different procedure is preferably implemented in accordance with the present invention. In particular, in response to receiving a DMA write from I/O device 610 [0096] If the IO7400 [0097] For DMA writes to less than a full cache line, the IO7400 [0098] The foregoing description has been directed to specific embodiments of this invention. It will be apparent, however, that other variations and modifications may be made to the described embodiments, with the attainment of some or all of their advantages. For example, the IO7 could return a ForwardMiss message to the directory in response to a Forward, and then victimize the cache line after allowing the I/O device to consume at least a portion of the cache line. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.BACKGROUND OF THE INVENTION
SUMMARY OF THE INVENTION
BRIEF DESCRIPTION OF THE DRAWINGS
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT