How the PCI Bus Works
(This is an edited version of an article that appeared a few years ago in PC Support Advisor. Although it provides a good general introduction to PCI bus concepts, it is now quite an old article and does not cover the latest PCI bus developments.)
The acronym PCI stands for Peripheral Component Interconnect, which aptly describes what it does. PCI was designed to satisfy the requirement for a standard interface for connecting peripherals to a PC, capable of sustaining the high data transfer rates needed by modern graphics controllers, storage media, network interface cards and other devices.
Earlier bus designs were all lacking in one respect or another. The IBM PC-AT standard ISA (Industry Standard Architecture) bus, for example, can manage a data transfer rate of 8MB/sec at best. In practice, the throughput that can be sustained is much less than that. Other factors, like the 16-bit data bandwidth and 24 bit wide address bus - which restricts memory mapped peripherals to the first 16MB of memory address space - made the ISA bus seem increasingly outmoded .
More recent designs such as IBM's MCA (Micro Channel Architecture) and the EISA (Extended ISA) bus, though having higher bandwidth (32 bits) and providing better support for bus mastering and direct memory access (DMA), were not enough of an improvement over ISA to offer a long term solution. New, faster versions of both MCA and EISA were proposed, but were received without enthusiasm. Neither promised to be an inexpensive solution, and cost has always been an important factor in the competitive PC market.
VL-Bus
One attempt to improve bus performance inexpensively was the VL-Bus. Prior to this, some PC vendors had started providing proprietary local bus interfaces enabling graphics boards to be connected directly to the 486 processor bus. These systems had one major failing: since the interfaces were proprietary, support for them by third-party peripheral vendors was limited, so there was little chance that a user would ever be able to purchase a compatible graphics card as an upgrade.
VESA's intervention gave manufacturers a standard to work to. The standard was based on existing chip sets. This had the advantages of low cost, and of enabling the technology to be got to market quickly. The disadvantage was the rather crude implementation. The VL-Bus was a success in its day because it met a need, but it never looked like a long term solution.
Some of the parties involved in the design of the VL-Bus standard felt that a solution based on existing bus technologies had too many design compromises to be worth considering. This group, led by Intel, split off to form the PCI Special Interest Group with the aim of producing a new bus specification from scratch.
Although PCI has been described as a local bus, in fact it is nothing of the sort. The term 'local bus' means that the address, data and control signals are directly connected - in other words, 'local' - to the processor. The VL-Bus is a true local bus, since devices are connected to the CPU via nothing more than some electrical buffering. This is one of the reasons for its simplicity, but it is also the reason for many of its limitations.
One problem is that a local bus is by definition synchronous. The bus speed is the same as the external processor clock speed. Expansion card vendors therefore have the difficulty of ensuring that their products will run at a range of speeds. The upper limit of the range cannot be defined, and is liable to increase as new processors are introduced. This is a recipe for compatibility problems. Such problems have been experienced by users of VL-Bus systems using the AMD 80MHz processors, which have a 40MHz bus clock.
The second problem with a true local bus is that the electrical load (and consequently the number of expansion slots) that can be driven by the bus decreases as the clock speed increases. This creates the situation where typically three slots can be provided at 33MHz, but only two at 40MHz and just one at 50MHz. This is particularly awkward given that most motherboards are designed to work at a range of clock speeds and usually come with three slots. Many manufacturers simply ignored the problem.
PCI Design
PCI's designers decided to avoid these difficulties altogether by making PCI an asynchronous bus. The top speed for most PCI cards is 33MHz. The PCI 2.1 specification made provision for a doubling of the speed to 66MHz, but support for this higher speed was optional.
At 33MHz, with a 32-bit data bus, the theoretical maximum data transfer rate of the PCI bus is 132MB/sec. At 66MHz, with a 64-bit data path, the top speed would be 528MB/sec.
The PCI bus can run at lower speeds. In a system clocked at 25MHz, for example, the bus could also run at this speed. This was an important consideration at the time PCI was being developed.
Peripherals must be designed to work over the entire range of permitted speeds. In the original PCI specification the lower limit to the speed range was given as 16MHz; in PCI revision 2.0 this was reduced to 0MHz. This supports 'green' power saving modes by allowing the system to run at reduced speed for lower power consumption, or to be put into 'suspend' mode (0MHz), without any bus status information being lost.
The number of devices on a PCI bus depends on the load. In practice this means three or four slots, plus an on-board disk controller and a secondary bus. Up to 256 PCI busses can be linked together, though, to provide extra slots, so this is not really a limitation.
PCI Connector
The 32-bit PCI connector has 124 pins (62 per side). The pin-outs are arranged so that every signal pin is adjacent to a power or ground rail, which helps to reduce electromagnetic interference (EMI) by capacitive decoupling. The compact size is obtained by multiplexing the 32 address and data lines so they share the same 32 pins. This allows PCI cards to be made short enough to install in portable PCs.
The PCI specification stipulates the size of PCI add-in cards. Short cards are 6.875in. long and from 1.42in. to 4.2in. high. There is also a long card, which is 12.28in. long and 4.2in. high.
PCI makes provision for both standard 5V and low power 3.3V boards. Separate slots are needed for the two types of card: these are keyed to prevent cards being inserted in the wrong type of slot. Slots for 5V cards have a key towards the end furthest from the backplane, while 3.3V slots are keyed at a similar position at the other end. The cards themselves have a corresponding keyway in the key position. Universal cards able to operate at either 3.3V or 5V have keyways in both positions, and so may be installed in either type of slot.
The 64-bit PCI connector has a further 64 pins (32 per side) which follow on from the standard 32-bit slot in a similar manner to the IBM AT extension to the original IBM PC 8-bit slot. The extension contains mainly another 32 multiplexed address and data lines, plus extra power and ground rails. Signals present on the 32-bit part of the connector allow a 64-bit card to be detected and used (albeit with reduced performance) in a 32-bit slot.
Besides the interlacing of power, ground and signal traces, PCI uses another innovation, reflected wave switching, to reduce power consumption and the EMI problem associated with fast, high power digital electronics. Circuit traces on a PCI board are unterminated. This means that a signal travelling along a trace meets a high impedance at the end, and is consequently reflected back along the trace instead of being absorbed. By careful design, the logic gates are placed at the points where the incident and reflected waves reinforce each other. Because the voltages of the two waves add, the logic drivers need only produce a signal of half the needed voltage level, which reduces the power needed by a similar fraction.
A pair of pins on the bus connector allow the system to determine the power requirements of the installed hardware. Interpreted as two bits they permit a total of four combinations showing that the slot is either empty, or contains a board with a power consumption of up to 7.5W, 15W or 25W.
Host to PCI bridge
The PCI designers used a bridge to connect the PCI bus to the processor bus. This is the reason why PCI is not a local bus. The advantage is that the bus design can be independent of that of the processor. To interface a PCI bus to a new type of processor requires only a new bridge chip. The benefits of this are that systems using non-Intel processors can also use the PCI bus and are able to take advantage of peripheral add-ins designed for the PC market.
On the PCI bus devices are described as initiators or targets. Initiators are devices that can initiate a bus transaction, such as the host to PCI bridge and intelligent I/O boards called bus masters. Some devices may only be targets: they can speak only when they are spoken to.
Another key feature of PCI is that all data transfers are burst transfers. This means that data is sent in chunks of one, two, four or eight bytes (according to the highest common capability of the devices and the width of the data bus), one chunk per bus cycle. This is the fastest method of transferring data, and contrasts with the non-burst modes used in older PC bus designs, where data is transferred using a sequence of alternating address and data cycles.
The PCI bus design places no limits on the length of burst transfers. This is a big improvement over the VL-Bus which due to the design of the 486 processor was limited to a maximum burst of 4 cycles. Most real-world data transfers are of blocks longer than 4 x 32 bits, so the overhead associated with each burst transfer will affect the overall transfer rate. Typically the VL-Bus could manage only 40 - 50MB/sec. In fact, early PCI systems were only capable of similar performance due to implementation restrictions. Later PCI systems were able to deliver sustained transfer rates of over 100MB/sec.
A PCI implementation may employ a variety of techniques to improve performance. For example, bridges may include a posted-write buffer which allows the a bus master to post memory writes to the bridge at burst speed and not merely the speed of the target device. To ensure data consistency, the bridge will not permit a read to take place until all posted writes have been flushed to their destination addresses.
The bus may also combine separate memory writes of 8- or 16-bit values into single 32-bit memory transactions to optimise bus and memory performance. The PCI specification states that data must be written to the target in the original order, before it was combined. It also recommends that this feature, if present, should be capable of being disabled in case it causes problems. I/O writes are not combined in this fashion.
The integrity of data on the PCI bus is checked using a single parity bit which protects the 32 address/data lines and four Command/Byte Enable signals. A further parity bit protects the additional lines of the 64-bit extension where present.
Bus transactions
Let's look at what happens during a PCI data transfer or bus transaction. First, the initiating device has to get permission to have control of the bus. This is determined during the process of bus arbitration. A function called the arbiter, which is part of the PCI chip set, decides which device is allowed to initiate a transaction next. The arbiter uses an algorithm designed to avoid deadlocks and prevent one or more devices from monopolising the bus to the exclusion of others.
Having gained control of the bus, an initiator then places the target address and a code representing the transfer type on the bus. Other PCI devices determine, by decoding the address and the command type information, whether they are the intended target for the transfer. The target device claims the transaction by asserting a device select signal.
Once the target has sent its acknowledgement, the bus transaction enters the data phase. During this phase the data is transferred. The transfer can be terminated either by the initiator, when the transfer is completed or when its permission to use the bus is withdrawn by the arbiter, or by the target if it is unable to accept any more data for the time being. If the latter, the transfer must be restarted as a separate transaction. One of the rules of PCI protocol is that a target must terminate a transaction and release the bus if it is unable to process any more data, so a slow target device cannot hog the bus and prevent others from using it.
Note that although all PCI data transfers are burst transfers, a device does not have to be able to accept long bursts of data. A target device can terminate the data phase after one cycle if it wants to. Such behaviour would be perfectly acceptable in a non-performance-critical device. Even high performance devices may have to terminate a burst, since their data buffers will be of finite size and if they cannot process the data as quickly as it is sent these buffers will eventually fill up.
Non-PCI devices
This description of a PCI bus transaction assumes that both initiating and target devices are PCI devices. However, even today most PCs require an ISA or other expansion bus in order to be able to install legacy peripherals. This is achieved using a PCI to expansion bus bridge. In this configuration, the PCI bus is the primary bus, and the legacy bus is the secondary bus.
If an initiator begins a transaction for a device that is on a secondary expansion bus, no PCI device will acknowledge that it is the target. One of two things could happen next. The PCI to expansion bus bridge could claim the transaction on behalf of its own peripherals; however, this would require that the bridge be programmed with the addresses of all the devices on the other side of it. This is a possibility in the case of MCA, EISA and plug-and-play ISA boards. However, ordinary ISA boards are not plug-and-play so a PCI to ISA bridge can have no knowledge of what memory and I/O addresses are on the ISA bus.
The method normally used to handle transactions destined for an ISA expansion bus is to use a process of subtractive decoding, or "if nobody else wants it, it must be for me." The expansion bus bridge claims the transaction if it is for a memory address in the first 16MB of address space or an I/O port address in the first 64KB, and no PCI device has claimed the transaction within a set delay period. The delay period depends on the speed of the PCI device address decoders, which can take from one to three clock cycles to respond with an acknowledgement.
The speed, and hence the length of the delay, is determined during the power on configuration process. The presence of even one slow device will require a delay of four bus clock cycles for every ISA bus transfer. This will degrade the performance of peripherals on the ISA expansion bus.
Bus arbitration
All PCI devices capable of initiating a data transfer are bus masters. This means that they can take control of the bus to perform a data transfer without requiring the assistance of the CPU. Reducing the need for the CPU to become involved in transferring large volumes of data has performance benefits.
Because there is usually more than one bus master in a PCI system, a method of arbitration is needed to resolve conflicts when two or more devices want to transfer data at the same time. This isn't as easy as it might sound. An arbiter has to handle all the possible situations that may occur between a group of communicating devices, as well as ensuring that bus access is granted fairly.
The main objective of arbitration is to ensure that all devices are given access to the bus when they need it. Too long a delay could harm performance or cause other problems. Every PCI bus master contains a configuration register which specifies its maximum latency: the time within which the device should be allowed to transfer its data.
To reduce bus latency, PCI uses hidden arbitration. This means that arbitration can take place whilst another bus transaction is going on, so that the next device can begin transferring data the instant the bus is free.
When the arbiter grants a device access to the bus, the device's GNT# signal is asserted. The device then starts monitoring the state of other bus signals (FRAME# and RDY#) to determine when the bus is free. Once the bus is free, and assuming the GNT# signal is still asserted, the device can begin its transaction.
The arbiter uses the maximum latency register to determine priority levels. If a bus master requests the bus after access has already been granted to a device with a higher maximum latency, then as long as the bus is still busy and the first device's transaction has not yet started, the arbiter can pre-empt the first device and award the GNT# signal to the one that needs it more urgently.
Whilst a bus transaction is in progress another mechanism ensures that bus masters cannot hog the bus and prevent other devices from getting access when they need it. Each bus master has another configuration register called the latency timer. This is set to the minimum number of cycles for which the device will be guaranteed access to the bus.
The latency timer register is decremented with every bus cycle. When the arbiter wants to allow another device to access the bus, it removes the GNT# signal from the active device. If the register is zero or less when this occurs, the device knows that it has had its guaranteed minimum period of bus access, and must complete the current data cycle and immediately relinquish control of the bus. If the register value is positive, the device may continue with its transfer but only until the register value reaches zero, when it must release the bus for the next device.
Command types
Most often, we think of bus transactions in terms of blocks of data being transferred from one location to another. In fact, there are a number of different types of information that can be transferred across a bus. On the PCI bus, four signal lines called Command/Byte Enable are used to indicate the transaction type. Of the 16 possible values, 12 are currently defined. (During the data phase, these lines are used to show which of the bytes on the 32-bit bus contain valid data, hence the 'Byte Enable.')
The I/O Read, I/O Write, Memory Read and Memory Write transactions should need little explanation. However, for memory transfers there are also commands called Memory Read Line, Memory Read Multiple and Memory Write and Invalidate. These commands convey additional information about how the data to be transferred relates to that held in cache memory, and so allow the cache controller to operate more efficiently.
The PCI bus has the capability to access memory targets in address space beyond 4GB, even when using a 32-bit PCI slot. The Dual-Address Cycle command indicates that a 64-bit address is being placed on the bus in two 32-bit halves.
There is a Special Cycle command which is used to broadcast messages to devices on the PCI bus. In this command, the address has no validity but the first 16 bits of the data contain a message type and the remaining 16 bits can contain message-specific data. This command is mainly used to inform devices that the system is about to shut down.
The Interrupt Acknowledge command is used by the host to PCI bridge to obtain further information about an interrupt request from an interrupting PCI device. In a PC compatible system, a device requesting an interrupt does so by raising one of the interrupt request lines IRQ0 to IRQ15. The IRQ is converted by the programmable interrupt controller to a single signal to the processor, INTR. The processor responds to this signal by requesting the controller to supply an interrupt vector address: the address in memory of the software routine for handling the interrupt. In an ISA or VL-Bus system there is a direct connection between the CPU and the interrupt controller.
On a PCI system, the processor's interrupt vector request is passed to the host to PCI bridge. The bridge responds by obtaining control of the bus and initiating an interrupt acknowledge transaction. The PCI target containing the interrupt controller claims the transaction, and sends a signal emulating the interrupt vector request to the interrupt controller chip. The interrupt vector address is then placed by the controller on to the data bus. From there it is read by the host to PCI bridge, which then terminates the transaction and passes the vector to the processor.
Interrupt handling
The concept of 16 discrete IRQ lines, each uniquely assigned to a device, is peculiar to the ISA bus and its derivatives. The CPU sees only a single interrupt signal, obtains an interrupt vector address and then processes the interrupt routine at that address. The use of 16 lines was the method chosen by the designers of the original IBM PC to tell the interrupt controller which address to supply.
Each PCI slot has four interrupt lines connected to it, designated INTA# to INTD#. The first (or only) interrupt-using function on a PCI board must be connected to INTA#. The other three lines allow up to four functions to be combined on one board using INTA# - INTD# in that order.
The PCI interrupt lines and the output from the ISA interrupt controller are combined in a programmable interrupt router, which generates the single interrupt signal for the CPU. How they are combined is not defined by the PCI specification. PCI interrupts are edge-triggered and therefore shareable, so some of them may be connected together.
The IBM PC architecture expects particular devices to use particular IRQs (e.g. the primary disk controller must use IRQ14). Furthermore, because ISA interrupt lines cannot be shared, PC interrupt routines expect that when they are called, they are servicing their specific device and no other.
This means that in a PC, the INTx# lines in each PCI slot - or those that are being used - must each be mapped to a separate IRQ which the operating system or driver software will expect the device in that slot to use. This is usually done using the BIOS Setup utility. Some early PCI systems which did not have this facility required an ISA 'paddle-board' to be used with add-ins like caching disk controllers to ensure they were connected to the appropriate IRQ.
Integrated (on-board) devices are hard-configured to use the appropriate interrupts. Were it not for the fact that specific devices must use specific IRQs, PCI configuration would be completely automatic as the interrupt level could be assigned by the system at start-up.
Configuration Space
PCI was designed as a plug-and-play, self-configuring system. In support of this it defines an area of addressable ROM and RAM called configuration space, which can be interrogated to obtain information about a device and written to in order to configure it.
Each PCI device has a block of 256 bytes of configuration space: 16 32-bit doublewords of header information plus 48 doublewords of device-specific configuration registers. The header contains a vendor ID and type of device code, flags which show whether the device generates interrupts, whether the device is 66MHz-capable and other low level performance-related information, the base address locations of I/O ports, RAM and expansion ROM, the maximum latency register (mentioned earlier) and other similar general information. ROMs can contain code for different processor architectures, and a configuration register shows which ones are supported.
Configuration space is completely separate from memory and I/O space, and can only be accessed using the PCI bus Configuration Read and Write commands. Intel x86 processors cannot access configuration space directly, so the PCI specification defines two methods by which this can be achieved. The preferred method, used by current implementations, is to write the target address to the 32-bit I/O port at 0CFBh, and then read or write the doubleword through I/O port 0CFCh. A second method, used by early PCI chip sets but now discouraged by the PCI specification, involves using I/O ports at 0CF8h and 0CFAh to map the configuration spaces of up to 16 PCI devices into the I/O range C000h to CFFFh, from where the data may be read or written.
The above information was given for interest only. The correct way for software such as device drivers or diagnostic programs to access a device's configuration space is using the functions provided in the PCI BIOS. The PCI BIOS function code is 0B1h. If a program calls interrupt 1Ah with the AX register set to 0B101h the carry bit will be clear on return if the PCI BIOS is present, and the 32-bit EDX register will contain the ASCII characters " PCI." Register BX will contain the major and minor BIOS revision version. Register AL will be odd (bit 0 set) if the system supports the preferred configuration space addressing mechanism.
Using other subfunctions of BIOS function 0B1h programs can search for a device and obtain its location, find devices of a particular class, read and write to configuration space, generate a PCI bus special cycle, discover how PCI interrupts have been assigned to IRQ lines, and set a PCI device's interrupt to a particular IRQ. Normally, of course, these functions would only be carried out by system software.
Conclusion
The PCI bus was an expansion bus designed to meet the requirements of PC users now and for the foreseeable future,. With its high speed, 64-bit data bandwidth and wholehearted support for bus mastering and burst mode data transfers, its maximum throughput is unlikely to become a bottleneck for some time. And its processor independence will be a valuable asset as our PCs move further away from the limitations of the '80s Intel x86 architecture.
From a support point of view, PCI's plug-and-play ambitions are welcome. We are never likely to see completely automatic configuration, nor get away from such restrictions as 16 interrupt request lines, while we continue to demand PC compatibility. But PCI, which inherits none of these limitations, is a step in the right direction.