PCI_Express_System_Architecture

PCI Express System Architecture

The COM symbol is utilized to achieve Symbol Lock under the following circumstances:

During Link training when the Link is first established, TS1 and TS2 Ordered-Sets are transmitted (and each set begins with a COM symbol).

During Link retraining initiated due to a problem on the Link, TS1 and TS2 Ordered-Sets are transmitted (and each set begins with a COM symbol).

FTS Ordered-Sets are sent by a transmitter to inform the receiver to regain Bit Lock and Symbol Lock and change the state of the Link from L0s to L0.

Receiver Clock Compensation Logic

Background

Consider a transmitter at one end of a Link and the receiver at the opposite end. The transmit clock accuracy must be

2.5 GHz + / - 300 ppm

(parts per million). Once the Link is trained,the receive clock (Rx Clock) in the receiver is the same as the transmit clock (Tx Clock) at the other end of the Link (because the receive clock is derived from the bit stream that was transmitted at the remote end's transmit clock frequency). If the transmitter's Tx Clock at one end of the Link operates at

+ 300 ppm

and the Local Clock (shown in Figure 11-21 on page 439- not the Rx Clock) at the receiver at the other end operates at

- 300 ppm

,this results in a worst-case

600 ppm

difference between the two clocks.

In this scenario, the transmitter at one end of the Link is operating at 2.5GHz

+ 300 ppm

,while the receiver’s local clock is operating at a frequency of

2.5 GHz

- 300ppm. The Tx Clock of the transmitter and Local Clock of the receiver can therefore shift one clock every 1666 clocks.

The Elastic Buffer's Role in the Receiver

It is a common design practice to clock most of the receive path logic using the Physical Layer's local clock. To compensate for the frequency difference between the Rx Clock (which is derived from the remote port's transmit frequency) and the Local Clock (which is derived from the local port's transmit frequency), an elastic buffer (see Figure 11-21 on page 439) is incorporated in the very early stages of the receive path.

Symbols arrive at the differential receiver as a bit stream and are presented to the Deserializer. The receive PLL recovers the clock (Rx Clock) embedded in the bit stream and the Deserializer converts the incoming bit stream into a series of 10-bit symbols. The symbols are clocked into the input side of the Elastic Buffer using the Rx Clock recovered from the incoming bit stream and are clocked out of the buffer using the receiver's local clock. As previously cited, these two clocks can be as much as

600 ppm

out of sync with each other.

The Elastic Buffer compensates for the difference between the two clocks by either deleting a SKP symbol from or inserting a SKP symbol into the symbols contained in the Elastic Buffer:

If the transmit clock frequency is greater than the receive clock frequency by up to $600 ppm$ ,a SKP symbol is deleted from the buffer.

If the transmit clock frequency is less than the receive clock frequency by up to $600 ppm$ ,a SKP symbol is added to the buffer.

The transmitter on the other end periodically transmits a special symbol sequence called the SKIP Ordered-Set (see Figure 11-19 on page 437 and "Inserting Clock Compensation Zones" on page 436) from which the "don't care" SKP symbol can be deleted or to which a "don't care" SKP symbol can be added. The SKIP Ordered-Set consists of four control symbols (a COM and three SKPs; the Skips are the "don't care" characters, hence the name "Skip"). Deleting or adding a SKP symbol to the SKIP Ordered-Set in the Elastic Buffer prevents a buffer overflow or underflow condition, respectively.

Loss of symbol(s) caused by Elastic Buffer overflow or underflow triggers a Receiver Error indication to the Data Link Layer and results in the automatic initiation of Link error recovery.

The transmitter schedules a SKIP Ordered-Set transmission once every 1180 to 1538 symbol times. However, if the transmitter starts a maximum sized TLP transmission right at the 1538 symbol time boundary when a SKIP Ordered-Set is scheduled to be transmitted, the SKIP Ordered-Set transmission is deferred. Receivers must be tolerant to receive and process SKIP Ordered-Sets that have a maximum separation dependent on the maximum packet payload size a device supports. The formula for the maximum number of Symbols (

n

) between SKIP Ordered-Sets is:

n = 1538 + (maximum packet payload size + 28)

28 is the number of symbols associated with the header (16 bytes), the optional ECRC (4 bytes), the LCRC (4 bytes), the sequence number (2 bytes) and framing symbols STP and END (2 bytes).

Lane-to-Lane De-Skew

Not a Problem on a Single-Lane Link

The problem of Lane-to-Lane skew is obviously only an issue on multi-Lane Links.

Flight Time Varies from Lane-to-Lane

Symbols are transmitted simultaneously on all Lanes using the same transmit clock, but they cannot be expected to arrive at the receiver at the same time (i.e., without Lane-to-Lane skew). A multi-Lane Link may have many sources of Lane-to-Lane skew. These sources include but are not limited to:

Chip differential drivers and receivers.

Printed wiring board impedance variations.

Lane wire length miss-matches.

Delays injected by the serialization and de-serialization logic.

When the byte-striped serial bit streams associated with a packet arrive on all Lanes at the receiver, it must remove this Lane-to-Lane skew in order to receive and process the data correctly. This process is referred to as Link deskew. Receivers use TS1 or TS2 Ordered-Sets during Link training or FTS Ordered-Sets during L0s exit to perform Link de-skew functions.

If Lane Data Is Not Aligned, Byte Unstriping Wouldn't Work

Havoc would ensue if the symbols transmitted on each Lane simultaneously were to arrive at each Lane receiver at different times and were then de-serialized and fed to the Byte Unstriping Logic. Gibberish would be fed to the Link Layer as packet data.

TS1/TS2 or FTS Ordered-Sets Used to De-Skew Link

The unique structure and length of the TS1/TS2 and FTS sets, and the fact that they are transmitted simultaneously on all Lanes, are used by the receiver's De-Skew logic to determine the amount of misalignment between Lanes. The specification doesn't define the method used to achieve multi-Lane alignment. As an example, the receiver logic could compensate for the misalignment by tuning an automatic delay circuit in each Lane's receiver (see Figure 11-21 on page 439 and Figure 11-22 on page 445).

The receiver must be capable of de-skewing up to 20ns of Lane-to-Lane skew as defined by the LRX-SKEW parameter shown in Table 12-2 on page 480.

De-Skew During Link Training, Retraining and L0s Exit

TS1 and TS2 Ordered-Sets are only transmitted during initial Link training or during Link retraining (i.e., recovery). FTS Ordered-Sets are transmitted during L0s exit. De-skew is therefore only performed by the receiver at those times and is not done on a periodic basis.

Lane-to-Lane De-Skew Capability of Receiver

The Lane-to-Lane de-skew parameter

L_{RX}

-SKEW shown in Table 12-2 on page 480 requires that the receiver be capable of de-skewing Lane delays of up to 20ns. The transmitter is allowed to introduce a minimal Lane-to-Lane skew at the output pad defined by the

L_{TX-SKEW}

parameter (see Table 12-1 on page 477) of

1.3 ns

Figure 11-22: Receiver's Link De-Skew Logic

8b/10b Decoder

General

Refer to Figure 11-23 on page 447. Each receiver Lane incorporates a

10 b / 8 b

Decoder which is fed from the Elastic Buffer. The

8 b / 10 b

Decoder uses two lookup tables (the D and K tables) to decode the 10-bit symbol stream into 8-bit Data (D) or Control (K) characters plus the D/K# signal. The state of the D/K# signal indicates that the received symbol is:

A Data (D) character if a match for the received symbol is discovered in the D table. D/K# is driven High.

A Control (K) character if a match for the received symbol is discovered in the $K$ table. $D / K #$ is driven Low.

Disparity Calculator

The decoder determines the initial disparity value based on the disparity of the first symbol received. After the first symbol, once the disparity is initialized in the decoder, it expects the calculated disparity for each subsequent symbol received to toggle between + and - unless the symbol received has neutral disparity in which case the disparity remains the same value.

Code Violation and Disparity Error Detection

General. The error detection logic of the

8 b / 10 b

Decoder detects errors in the received symbol stream. It should be noted that it doesn't catch all possible transmission errors. The specification requires that these errors be detected and reported as a Receiver Error indication to the Data Link Layer. The two types of errors detected are:

Code violation errors (i.e., a 10-bit symbol could not be decoded into a valid 8-bit Data or Control character).

Disparity errors.

There is no automatic hardware error correction for these errors at the Physical Layer.

Code Violations. The following conditions represent code violations:

Any 6-bit sub-block containing more than four $1 s$ or four $0 s$ is in error.

Any 4-bit sub-block containing more than three $1 s$ or three $0 s$ is in error.

Any 10-bit symbol containing more than six 1s or six 0s is in error.

Any 10-bit symbol containing more than five consecutive 1s or five consecutive 0 s is in error.

Any 10-bit symbol that doesn't decode into an 8-bit character is in error.

Disparity Errors

A character that encodes into a 10-bit symbol with disparity other than neutral is encoded into a 10-bit symbol with polarity opposite to that of the CRD.

If the next symbol does not have neutral disparity and its disparity is the same as the CRD, a disparity error is detected.

Some disparity errors may not be detectable until the subsequent symbol is processed (see Figure 11-24 on page 448).

If two bits in a symbol flip in error, the error may not be detected (and the symbol may decode into a valid 8-bit character). The error goes undetected at the Physical Layer.

Figure 11-23: 8b/10b Decoder per Lane

PCI Express System Architecture

Figure 11-24: Example of Delayed Disparity Error Detection

	CRD	Character	CRD	Character	CRD	Character	CRD
Transmitted Character Stream	-	D21.1	-	D10.2	-	D23.5	$+$
Transmitted Bit Stream	-	101010 1001	-	010101 0101	-	111010 1010	$+$
Bit Stream After Error	-	101010 1011	$+$	010101 0101	$+$	111010 1010	$+$
Decoded Character Stream	-	D21.0	$+$	D10.2	$+$	Invalid	$+$

Error occurs here Error detected here

De-Scrambler

The De-Scrambler is fed by the

8 b / 10 b

Decoder block. The De-Scrambler only de-scrambles Data (D) characters associated with a TLP or DLLP (D/K# is high). It does not de-scramble Control (K) characters or Ordered-Sets. K characters and Ordered-Sets sourced from the

8 b / 10 b

decoder are valid as is.

Some De-Scrambler implementation Rules:

On a multi-Lane Link, De-Scramblers associated with each Lane must operate in concert, maintaining the same simultaneous value in each LFSR.

De-scrambling is applied to 'D' characters associated with TLP and DLLPs including the Logical Idle (00h) sequence. 'D' characters within the TS1 and TS2 Ordered-Set are not de-scrambled.

’ $K^{'}$ characters and Ordered-Set characters are not de-scrambled. These characters bypass the de-scrambler logic.

Compliance Pattern related characters are not de-scrambled.

When a COM character enters the De- Scrambler, it initializes the LFSR. The initialized value of the 16-bit LFSR is FFFFh.

With one exception, the LFSR serially advances eight times for every character (D or K character) received. The LFSR does NOT advance on SKP characters associated with the SKIP Ordered-Sets received. The reason the

LFSR is not advanced on detecting SKPs is because there may be a difference between the number of SKP characters transmitted and the SKP characters exiting the Elastic Buffer (as discussed in "Receiver Clock Compensation Logic" on page 442).

By default, the De-Scrambler is always enabled. The specification does allow the De-Scrambler to be disabled for test and debug purposes. However the specification does not provide a standard software method or configuration register-related method for disabling the De-Scrambler.

Disabling De-Scrambling

If the receiver De-Scrambler receives at least two TS1/TS2 Ordered-Sets with the disable scrambling bit set from the remote device on all of its configured Lanes, it disables its De-Scrambler.

Byte Un-Striping

Figure 11-25 on page 449 illustrates an example of eight decoded 8-bit character streams from the eight De-Scramblers of a x8 Link being un-striped into a single byte stream which is fed to the Filter logic (see the next section).

Figure 11-25: Example of

x 8

Byte Un-Striping

Filter and Packet Alignment Check

The serial byte stream supplied by the byte un-striping logic contains TLPs, DLLPs, Logical Idle sequences, Control characters such as STP, SDP, END, EDB, and PADs, as well as the types of Ordered-Sets. Of these characters, the Logical Idle sequence, the control characters and Ordered-Sets are detected and eliminated. What remains are TLPs and DLLPs which are sent to the Rx Buffer along with boundary characters indicating the start and end of each TLP and DLLP.

Receive Buffer (Rx Buffer)

The Rx Buffer holds received TLPs and DLLPs after the start and end characters have been eliminated. The received packets are ready to send to the Data Link Layer.

The interface between the Physical Layer and Data Link Layer is unspecified. Hence, the designer is free to decide what data bus width interface to implement. As an example, we can assume the interface clock to be 250MHz. In that case, the width of the data bus connecting the Data Link Layer to the Physical Layer interface can

=

the number of Lanes supported by the device

x

eight bits.

Physical Layer Error Handling

When the Physical Layer logic detects an error, it sends a Receiver Error indication to the Data Link Layer. The specification lists a few of these errors, but it is far from being an exhaustive error list. It is up to the designer to determine what Physical Layer errors to detect and report.

Some of these errors include:

8b/10b Decoder-related disparity errors (described in "Disparity Errors" on page 447). This check is required.

8b/10b Decoder-related code violation errors (described in "Code Violations" on page 446). This check is required.

Elastic Buffer overflow or underflow caused by loss of symbol(s) (described in "The Elastic Buffer's Role in the Receiver" on page 442).

The packet received is not consistent with the packet format rules described in "General Packet Format Rules" on page 411. This condition is optionally checked.

Loss of Symbol Lock (see "Symbol Boundary Sensing (Symbol Lock)" on page 441).

Loss of Lane-to-Lane de-skew (see "Lane-to-Lane De-Skew" on page 444).

Response of Data Link Layer to 'Receiver Error' Indication

If the Physical Layer indicates a Receiver Error to the Data Link Layer, the Data Link Layer discards the TLP currently being received and frees any storage allocated for the TLP. The Data Link Layer schedules a NAK DLLP for transmission back to the transmitter of the TLP. Doing so automatically causes the transmitter device to replay TLPs from the Replay Buffer, resulting in possible auto-correction of the error. The Data Link Layer may also direct the Physical Layer to initiate Link re-training (i.e., link recovery).

Detected Link errors may also result in the Physical Layer initiating the Link retraining (recovery) process.

In addition, the device that detects a Receiver Error sets the Receiver Error Status bit in the Correctable Error Status register (see Figure 24-20 on page 936) of the PCI Express Extended Advanced Error Capabilities register set. If enabled to do so, the device sends an ERR_COR (correctable error) message to the Root Complex (See "Advanced Uncorrectable Error Handling" on page 386 for details on error logging and reporting). 12

Electrical Physical Layer

The Previous Chapter

The previous chapter described:

The logical Physical Layer core logic and how an outbound packet is processed before clocking the packet out differentially.

How an inbound packet arriving from the Link is processed and sent to the Data Link Layer.

Sub-block functions of the Physical Layer such as Byte Striping and un-striping logic, Scrambler and De-Scrambler, 8b/10b Encoder and decoder, Elastic Buffers and more.

This Chapter

This chapter describes the Physical Layer's electrical interface to the link. It describes the analog characteristics of the differential drivers and receivers that connect a PCI Express device to the Link. Timing and driver/receiver parameters are documented here.

The Next Chapter

The next chapter describes the three types of reset, namely: cold reset, warm reset and hot reset. It also describes the usage of a side-band reset signal called PERST#. The effect of reset on devices and the system is described.

Electrical Physical Layer Overview

The electrical sub-block associated with each lane (see Figure 12-1 on page 454) provides the physical interface to the Link. This sub-block contains differential drivers (transmitters) and differential receivers (receivers). The transmitter serializes outbound symbols on each Lane and converts the bit stream to electrical

PCI Express System Architecture

signals that have an embedded clock. The receiver detects electrical signaling on each Lane and generates a serial bit stream that it de-serializes into symbols, and supplies the symbol stream to the logical Physical Layer along with the clock recovered from the inbound serial bit stream.

In the future, this sub-block could be redesigned to support a cable interface or an optical (i.e., fiber) interface.

In addition, the electrical Physical Layer contains a Phase Lock Loop (PLL) that drives the Serializer in the transmitter and a receiver PLL that is sync'd to the transitions in the incoming serial symbol stream.

Figure 12-1: Electrical Sub-Block of the Physical Layer

When the Link is in the L0 full-on state, the differential drivers drive the differential voltage associated with a logical 1 and logical 0 while driving the correct DC common mode voltage. The receivers sense differential voltages that indicate a logical 1 or 0 and, inaddition, can sense the electrical idle state of the Link. An eye diagram clearly illustrates the electrical characteristics of a driver and receiver and addresses signaling voltage levels, skew and jitter issues.

The electrical Physical Layer is responsible for placing the differential drivers, differential receivers, and the Link in the correct state when the Link is placed in a low power state such as L0s, L1, or L2. While in the L2 low power state, a device can signal a wake-up event upstream via a Beacon signaling mechanism.

The differential drivers support signal de-emphasis (or pre-emphasis; see "DeEmphasis (or Pre-Emphasis)" on page 466) to help reduce the bit error rate (BER)-especially on a lossy Link.

The drivers and receivers are short-circuit tolerant, making them ideally suited for hot insertion and removal events. The Link connecting two devices is AC coupled. A capacitor at the transmitter side of the Link DC de-couples it from the receiver. As a result, two devices at opposite ends of a Link can have their own ground and power planes. See Figure 12-1 on page 454 for the capacitor

(C_{TX})

placement on the Link.

High Speed Electrical Signaling

Refer to Figure 12-2. High-speed LVDS (Low-Voltage Differential Signaling) electrical signaling is used in driver and receiver implementations. Drivers and receivers from different manufacturers must be inter-operable and may be designed to be hot-pluggable. A standard FR4 board can be used to route the Link wires. The following sections describe the electrical characteristics of the driver, receiver, and the Link represented in the Figure.

Figure 12-2: Differential Transmitter/Receiver

Clock Requirements

General

The transmitter clocks data out at

2.5 Gbits / s

. The clock used to do so must be accurate to

+ / - 300 ppm

of the center frequency. It is allowed to skew a maximum of 1 clock every 1666 clocks. The two devices at the opposite ends of a Link could have their transmit clocks out of phase by as much as

600 ppm

A device may derive its clock from an external clock source. The system board supplies a

100 MHz

clock that is made available to devices on the system board as well as to add-in cards via the connector. With the aid of PLLs, a device may generate its required clocks from this

100 MHz

clock.

Spread Spectrum Clocking (SSC)

Spread spectrum clocking is a technique used to modulate the clock frequency slowly so as to reduce EMI radiated noise at the center frequency of the clock. With SSC, the radiated energy does not produce a noise spike at

2.5 GHz

because the radiated energy is spread over a small frequency range around

2.5 GHz

SCC is not required by the specification. However, if supported, the following rules apply:

The clock can be modulated by $+ 0 %$ to $- 0.5 %$ from nominal a frequency of $2.5 GHz$ .

The modulation rate must be between $30 KHz$ and $33 KHz$ .

The +/- 300 ppm requirement for clock frequency accuracy still holds. Further,the maximum of $600 ppm$ frequency variation between the two devices at opposite ends of a Link also remains true. This almost certainly imposes a requirement that the two devices at opposite ends of the Link be driven from the same clock source when the clock is modulated with SSC.

Impedance and Termination

The characteristic impedance of the Link is

100 Ohms

differential (nominal), while single-ended DC common mode impedance is

50 Ohms

. This impedance is matched to the transmitter and receiver impedances.

Transmitter Impedance Requirements

Transmitters must meet the

Z_{TX-DIFF-DC}

(see Table 12-1 on page 477) parameters anytime differential signals are transmitted during the full-on L0 power state.

When a differential signal is not driven (e.g., in the lower power states), the transmitter may keep its output impedance at a minimum

Z_{TX-DC}

(see Table 12- 1 on page 477) of

40 Ohms

,but may also place the driver in a high impedance state. Placing a driver in the high impedance state may be helpful while in L0s or L1 low power states to help reduce power drain in these states.

Receiver Impedance Requirements

The receiver is required to meet the

Z_{RX-DIFF-DC}

(see Table 12-2 on page 480) parameter of

100 Ohms

anytime differential signals are transmitted during the full-on L0 power state, as well as in all other lower power states wherein adequate power is provided to the device. A receiver is excluded from this impedance requirement when the device is powered down (e.g., in the L2 and L3 power states and during Fundamental Reset).

When a receiver is powered down to the L2 or L3 state, or during Fundamental Reset,its receiver goes to the high impedance state and must meet the

Z_{RX}

. HIGH-IMP-DC parameter of 200 KOhms minimum (see Table 12-2 on page 480).

DC Common Mode Voltages

Transmitter DC Common Mode Voltage

Once driven after power-on and during the Detect state of Link training, the transmitter DC common mode voltage

V_{TX-DC-CM}

(see Table 12-1 on page 477) must remain at the same voltage. The common mode voltage is turned off only when the transmitter is placed in the L2 or L3 low power state, during which main power to the device is removed. A designer can choose any common mode voltage in the range of

0 V

3.6 V

Receiver DC Common Mode Voltage

The receiver is DC de-coupled from the transmitter by a capacitor. This allows the receiver to have its own DC common mode voltage. This voltage is specified at

0 V

. The specification is unclear about the meaning of this

0 V

receiver DC common mode voltage requirement and does not require the common mode voltage to be

0 V

at the input to the receiver differential amplifier. Rather,a simple bias voltage network allows the receiver to operate at optimal common mode. See Figure 12-3 on page 458.

PCI Express System Architecture

Figure 12-3: Receiver DC Common Mode Voltage Requirement

ESD and Short Circuit Requirements

All signals and power pins must withstand (without damage) a 2000V ElectroStatic Discharge (ESD) using the human body model and

500 V

using the charged device model. For more details on this topic, see the JEDEC JESE22- A114-A specification.

The ESD requirement not only protects against electro-static damage, but facilitates support of surprise hot insertion and removal events. Transmitters and receivers are also required to be short-circuit tolerant. They must be able to withstand sustained short-circuit currents (on D+ or D- to ground) of

I_{TX-SHORT}

(see Table 12-2 on page 480) in the order of

90 mA

(the maximum current a transmitter is required to provide).

Receiver Detection

General

The Detect block in the transmitter shown in Figure 12-2 on page 455 is required to detect the presence or absence of a receiver at the other end of the Link after coming out of reset or power-on. The Detect state of the Link Training state machine is responsible for making this determination.

Detection is accomplished when the transmitter changes the DC common mode voltage from one value to another. By design, the transmitter detect logic has knowledge of the rate at which the lines charge with or without a receiver.

With a Receiver Attached

With a receiver attached at the other end of the Link, the charge time (RC time constant) is relatively long due to the large coupling capacitor

(C_{TX})

. See the lower half of Figure 12-4 on page 460.

Time Constant to Charge

\sim= Z_{TX} * (C_{TX} + C_{interconnect} + C_{pad}) =>

Large value

Without a Receiver Attached

Without a receiver attached at the other end of the Link, the charge time (RC time constant) is relatively short because the large coupling capacitor

(C_{TX})

does NOT come into play. See the upper half of Figure 12-4 on page 460.

Time Constant to Charge

\sim= Z_{TX} * (C_{interconnect} + C_{pad}) =>

Small value

Procedure To Detect Presence or Absence of Receiver

After reset or power-up,the transmitter drives a stable voltage on the $D +$ and D- terminals. This can be $V_{DD} (3.6 V)$ ,Ground or any common mode voltage in-between $V_{DD}$ and Ground.

Transmitter changes the common mode voltage:

If the initial common mode voltage is $V_{DD}$ ,then it drives the voltage towards Ground.

If the initial common mode voltage is Ground, then it drives the voltage towards $V_{DD}$ .

If the initial common mode voltage is between $V_{DD}$ and Ground,the transmitter drives the voltage in the opposite direction (the direction it must be driven to attain to the initial common mode voltage).

Differential Drivers and Receivers

Differential signaling (as opposed to the single-ended signaling employed in PCI and PCI-X) is ideal for high frequency signaling.

Advantages of Differential Signaling

Some of the advantages of differential signaling (versus single-ended signaling) are:

Can achieve higher frequency transmission rate because the signal swing is smaller.

Less EMI noise emitted due to noise cancellation of D+ signal emission with D- signal emission.

Noise immunity, because any noise that couples into one signal will also couple into the other signal.

Can signal three signal states: logical 1, logical 0 and electrical Idle.

Smaller signal swing means less power consumption on the Link.

Differential Voltages

The differential driver uses NRZ encoding to drive the serial bit stream. The differential driver output consists of two signals, D+ and D-. A logical one is signaled by driving the

D +

signal high and the

D

- signal low,creating a positive voltage difference between the

D +

and

D

- signals. A logical Zero is signaled by driving the

D +

signal low and the

D

- signal high,creating a negative voltage difference between the D+ and D- signals.

The differential peak-to-peak voltage driven by the transmitter

V_{TX-DIFFp-p}

(see Table 12-1 on page 477) is between

800 mV

(minimum) and

1200 mV

(max).

Logical 1 is signaled with a positive differential voltage.

Logical 0 is signaled with a negative differential voltage.

During the Link electrical Idle state, the transmitter drives a differential peak voltage

V_{TX-IDLE-DIFFp}

(see Table 12-1 on page 477) of between

0 mV

and

20 mV

. In this state, the transmitter may be in the low- or high-impedance state.

The receiver is able to sense a logical 1, a logical 0 , as well as the electrical idle state of the Link by detecting the voltage on the Link via a differential receiver amplifier. Due to signal loss along the Link at high frequency, the receiver must be designed to sense an attenuated version of the differential signal driven by PCI Express System Architecture

the transmitter. The receiver sensitivity is defined by the differential peak-to-peak voltage

V_{RX-DIFFp-p}

(see Table 12-2 on page 480) of between

175 mV

and

1200 mV

Differential Voltage Notation

General. A differential signal voltage is defined by taking the difference in the voltage on the two conductors,

D +

and

D

-. The voltage with respect to ground on each conductor is

V_{D +}

and

V_{D}

. The differential voltage is

V_{DIFF}

= V_{D +} - V_{D -}

. The Common Mode voltage,

V_{CM}

,is defined as the mean voltage of

D +

and

D - . V_{CM} = (V_{D +} + V_{D} -) / 2

In defining differential voltages, the specification uses two parameters: 1) The differential peak-to-peak voltage, and 2) the Differential peak voltage. These voltages are defined by the following equations and are illustrated in Figure 12-5 on page 463.

Differential Peak Voltage $\Rightarrow V_{DIFFp} = (max | V_{D +} - V_{D -} |)$ . Assume symmetric signal swing.

Differential Peak-to-Peak Voltage $\Rightarrow V_{DIFFp-p} = (2 * max | V_{D +} - V_{D -} |)$ . Assume symmetric signal swing.

Peak Common Mode Voltage $=> V_{CMp} = (max | V_{D +} + V_{D -} | / 2)$ .

Differential Peak Voltage. The differential peak voltage is easily represented in a diagram as the differential voltage associated with signaling a logical 1 or logical 0 .

Differential Peak-to-Peak Voltage. The differential peak-to-peak voltage is not easily represented in a diagram. One can think of the differential peak-to-peak voltage as the sum total of the differential voltage for signaling a logical 1 and for signaling a logical 0 . One can think of this voltage as the total voltage swing a receiver experiences between receiving a logical 1 and receiving a logical 0 .

Common Mode Voltage. The common mode voltage is the center voltage with respect to ground when the

D +

and

D

- signals cross-over one another, assuming these two signals are symmetric. When a differential driver does not drive a differential voltage, it drives a common mode voltage with both

D +

and

D

- signals at the same voltage (the signals do not swing). Chapter 12: Electrical Physical Layer

PCI Express System Architecture

Electrical Idle

The electrical idle state of the Link is the state wherein the transmitter D+ and D- voltages are held at a steady, constant voltage (the common mode voltage). This state is used in power savings states such as L0s and L1, as well as the Link Inactive or Link Disable states.

Figure 12-6: Electrical Idle Ordered-Set

Transmitter Responsibility

A transmitter that wishes to place a Link in the electrical Idle state must first transmit the electrical Idle Ordered-Set shown in Figure 12-6. After doing so, the transmitter must go to the electrical Idle state within

T_{TX-IDLE-SET-TO-IDLE}

time (see Table 12-1 on page 477) which is less that 20 UI (Unit Intervals

= 400 ps, 20

UI = 8 ns

). The differential peak voltage driven by the transmitter in the electrical Idle state is

V_{TX-IDLE-DIFFp}

(see Table 12-1 on page 477) which is less than a

20 mV

peak.

The transmitter can then remain in the low impedance state or go to the high impedance state. Once in the electrical Idle state, the transmitter must remain in this state for a minimum of

T_{TX-IDLE-MIN}

(see Table 12-1 on page 477) which is 50 UI (20ns).

To exit electrical Idle and return the Link to the full-on L0 state when transmission resumes,the transmitter must do so within

T_{TX-IDLE-TO-DIFF-DATA}

(see Table 12-1 on page 477) which is less that 20 UI (8ns). The transmitter sends FTS Ordered-Sets or TS1/TS2 Ordered sets to transition the Link state from the L0s or L1 state, respectively, back to L0 full-on state.

Receiver Responsibility

A receiver determines that the Link is going to enter electrical Idle state when it sees two out of the three IDLs of the Ordered-Set. The receiver de-gates the error reporting logic to prevent reporting errors due to unreliable activity on the Link and also immediately arms its electrical Idle Exit detector.

A receiver is able to detect an exit from the electrical Idle state when it detects a differential peak-to-peak voltage on the Link of greater than

V_{RX-IDLE-DET-DIFFpp}

(see Table 12-2 on page 480) of

65 mV

. In the electrical Idle state,the receiver PLL will over time, lose clock synchronization because the receiver input is at a steady state voltage. To exit the electrical Idle state, a transmitter sends FTS or TS1/TS2 Ordered-Sets that the receiver uses to achieve Bit Lock and Symbol Lock and to resync the receiver PLL with the transmitter.

Power Consumed When Link Is in Electrical Idle State

In the electrical Idle state, the Link consumes less power because there are no Link voltage transitions that occur and the transmitter can de-gate its output stage. The Link is in either the L0s, L1, or the Disabled state while it remains in the electrical Idle state. The recommended power consumed in L0s is less than

20 mW /

Lane,while it is less than

5 mW /

Lane in the L1 state. The recommended power consumed per Lane in L0 is on the order of

80 mW

Electrical Idle Exit

A receiver detects electrical Idle exit when it receives a valid differential voltage within the

V_{RX-DIFFp-p}

175 mV - 1200 mV

. A transmitter typically sends TS1 Ordered-Sets to signal electrical Idle exit to a receiver.

Transmission Line Loss on Link

The transmitter drives a minimum differential peak-to-peak voltage

V_{TX-DIFFp-p}

800 mV

. The receiver sensitivity is designed for a minimum differential peak-to-peak voltage

(V_{RX-DIFFp-p})

175 mV

. This translates to a

13.2 dB

loss budget that a Link is designed for. Although a board designer can determine the attenuation loss budget of a Link plotted against various frequencies, the transmitter and receiver eye diagram measurement are the ultimate determinant of loss budget for a Link. Eye diagrams are described in "LVDS Eye Diagram" on page 470. A transmitter that drives up to the maximum allowed differential peak-to-peak voltage of

1200 mV

can compensate for a lossy Link that has worst-case attenuation characteristics.

AC Coupling

PCI Express requires AC coupling capacitors be placed in close proximity to the transmitter on each Lane's differential signal pair. The AC coupling capacitor,

C_{TX}

(see Table 12-2 on page 480),is of a value between

75 nF

and

200 nF

. The capacitors can be integrated onto the system board, or integrated into the device itself. An add-in card with a PCI Express device on it must either place the capacitors on the card in close proximity to the transmitter, or integrate the capacitors into the PCI Express silicon.

The AC coupling capacitors eliminate DC common mode voltage sharing between two devices at opposite ends of the Link. This simplifies the device design by allowing each device to operate with its own transmitter DC common voltage. Each device can operate with its own power and ground plane, independent of the remote device at the opposite end of the Link.

De-Emphasis (or Pre-Emphasis)

PCI Express employs the concept of de-emphasis to help reduce the effect of the inter-symbol interference that may occur, especially on more lossy Link transmission lines. Supporting this mandatory feature reduces the Bit Error Rate (BER).

What is De-Emphasis?

A transmitted differential signal is de-emphasized when multiple bits of the same polarity are transmitted back-to-back as shown in Figure 12-7 on page 467. The Figure show a transmission of '1000010000'. Some rules related to signal deemphasis are:

An individual bit (that has the opposite polarity of the preceding bit) is not de-emphasized. It transmitted at the peak-to-peak differential voltage as specified by $V_{TX-DIFFp-p}$ (see Table 12-1 on page 477).

The first bit of a series of same polarity bits is also not de-emphasized.

Only subsequent bits of the same polarity after the first bit (of the same polarity) are de-emphasized.

The de-emphasized voltage is $3.5 dB$ nominal (actually,the $3 dB - 4 dB$ range is fine) less than the pre-emphasized voltage $V_{TX-DIFFp-p-MIN}$ (see Table 12-1 on page 477). The de-emphasized voltage translates to about $300 mV$ differential peak-to-peak less than $800 mV$ .

566 mV (3 dB) >= V_{TX-DEEMPH-DIFFp-p-MIN} >= 505 mV (4 dB)

(see Table 12-1 on page 477).

The Beacon signal is de-emphasized according to a slightly different rule. See "Beacon Signaling" on page 469.

Figure 12-7: Transmission with De-emphasis

What is the Problem Addressed By De-emphasis?

As bit transmission frequencies increase, the bit-time or Unit Interval (UI) decreases. At the 2.5GBit/s transmission rate, the Unit Interval is a very small (400ps). The capacitive effects on the Link transmission line become more apparent. The line capacitors

(C_{pad} + C_{interconnect} + C_{TX})

store charge. When a signal has been held at a constant differential voltage (as in transmission of successive bits of the same polarity), the line capacitors charge up. The line does not easily change voltage when the signal polarity has to flip immediately to the opposite value. This results in what is referred to as inter-symbol interference.

Consider the example in Figure 12-8 on page 468 wherein a transmitter sends the bit pattern '111101111'. The string of the first four logical 1s charges the line capacitors. When the transmitter follows this string with a logical 0 , the capacitors cannot discharge fast enough and then charge to the opposite polarity, so that the receiver will register the logical 0 . The result is inter-symbol interference at the receiver. A receiver eye diagram would show the 'lonely' logical 0 with a narrower eye.

PCI Express System Architecture

Figure 12-8: Problem of Inter-Symbol Interference

Solution

Rather than thinking that each subsequent bit transmitted after the first bit of the same polarity must be de-emphasized by

3.5 dB

(the PCI Express specification prefers to use the term de-emphasis), think of the first bit of a string of same polarity bits as being pre-emphasized by

3.5 dB

Figure 12-9: Solution is Pre-emphasis

Consider the solution in Figure 12-9. By pre-emphasizing the 'lonely' logical 0 bit, the transmitter is given sufficient additional drive strength to overcome the capacitive effect of the previous string of logical 1s.

PCI Express device receivers are designed to detect differential signals that are attenuated by the Link transmission line by as much as 11-13.2dB from the transmitted value. The de-emphasis requirement for the transmitted signal is designed to accommodate systems with Link transmission lines that have this worst-case loss budget. Of course, for lower loss systems, there is more voltage margin at a receiver that receives a de-emphasized signal.

Beacon Signaling

General

A PCI Express device that is in the L2 low power state can generate a wake up event to inform the system that it wishes to move to the full-on L0 state. The Beacon signaling mechanism is one of two methods a device may employ to accomplish this. The other method (see "WAKE#" on page 696) is via the assertion of the WAKE# signal (if it is supported by the device).

While a device is in the L2 power state, its main power source and clock are turned off (as described in "" on page 484). However, an auxiliary power source

(V_{aux})

keeps a limited portion of the device powered,including the wake up signaling logic.

When in the L2 low power state, a downstream device signals a Beacon wake up signal upstream to start the L2 exit sequence. If a switch or bridge receives the Beacon signal on its downstream port, it must forward the wake up event to its upstream port. This can be done by either forwarding the Beacon signal to the upstream port or by using WAKE# assertion to the power management logic. See "WAKE# (AUX Power)" on page 643.

When a device's Link power state is L2, even though the main power to the device is powered off,a limited portion of the device is powered by

V_{aux}

. The powered portion of the device allows the device to signal the wake up event via the Beacon. An upstream device such as a switch, bridge or Root Complex that is also in L2 power state is able to sense the Beacon because the receiver Beacon signal detection logic is also powered by

V_{aux}

Properties of the Beacon Signal

It is a relatively low frequency, DC balanced differential signal consisting of periodic arbitrary data wherein the pulse width of the signal is at least $2 ns$ but no greater than $16 μ s$ . A low frequency differential sine wave may suffice.

The maximum time between pulses can be no larger than $16 μ s$ .

PCI Express System Architecture

The transmitted Beacon signal must meet the electrical voltage specifications documented in Table 12-1 on page 477.

The signal must be DC balanced within a maximum time of $32 μ s$ .

Beacon signaling, like normal differential signaling, must be done with the transmitter in the low impedance mode (50 Ohm single-ended, 100 Ohms differential impedance).

When signaled, the Beacon signal must be transmitted on Lane 0, but does not have to be transmitted on other Lanes.

With one exception, the transmitted Beacon signal must be de-emphasized according to the rules defined in the previous section. For Beacon pulses greater than $500 ns$ ,the Beacon signal voltage must be $6 db$ de-emphasized from the $V_{TX-DIFFp-p}$ specification. The Beacon signal voltage may be deemphasized by up to $3.5 dB$ for Beacon pulses smaller than 500ns.

LVDS Eye Diagram

Jitter, Noise, and Signal Attenuation

As the bit stream travels from the transmitter on one end of a link to the receiver on the other end, it is subject to the following disruptive influences:

Deterministic (i.e., predictable) jitter induced by the Link transmission line.

Data-dependent jitter induced by the dynamic data patterns on the Link.

Noise induced into the signal pair.

Signal attentuation due to the impedance effect of the transmission line.

The Eye Test

Refer to Figure 12-10 on page 472. In order to ensure that the differential receiver receives an in-specification signal, an eye test is performed. The following description of the eye diagram was provided by James Edwards from an article he authored for

O E

Magazine. The author of this book has added some additional comments [in brackets].

"The most common time domain measurement for a transmission system is the eye diagram. The eye diagram is a plot of data points repetitively sampled from a pseudo-random bit sequence and displayed by an oscilloscope. The time window of observation is two data periods wide. For a [PCI Express link running at

2.5 Gbits / s

],the period is

400 ps

,and the time win-

dow is set to 800 ps. The oscilloscope sweep is triggered by every data clock pulse. An eye diagram allows the user to observe system performance on a single plot.

To observe every possible data combination, the oscilloscope must operate like a multiple-exposure camera. The digital oscilloscope's display persistence is set to infinite. With each clock trigger, a new waveform is measured and overlaid upon all previous measured waveforms. To enhance the interpretation of the composite image, digital oscilloscopes can assign different colors to convey information on the number of occurrences of the waveforms that occupy the same pixel on the display, a process known as color-grading. Modern digital sampling oscilloscopes include the ability to make a large number of automated measurements to fully characterize the various eye parameters."

The oscilloscope is set for infinite-persistence and a pattern generator is set up to generate a pseudo-random data pattern.

Optimal Eye

The most ideal reading would paint an eye pattern such as that shown in the center of Figure 12-10 on page 472 (labelled "Optimal Eye Opening"). It should be noted, however, that as long as the pattern painted resides totally within the region noted as "Normal," the transmitter and Link are within tolerance. Note that in these eye diagrams, the differential voltage parameters and values shown are peak differential voltages as opposed to peak-to-peak voltages documented in the specification. This is done because peak differential voltages can be represented in an eye diagram whereas peak-to-peak differential voltages cannot be represented in an eye diagram. See Figure 12-13 on page 475 for an example oscilloscope screen capture of an optimal eye.

Jitter Widens or Narrows the Eye Sideways

Refer to Figure 12-11 on page 473. Jitter will cause a clock pulse to occur either before or after the "Optimal Eye Opening" resulting in an eye opening wider or narrower horizontally than the optimal width. Once again, as long as the amount of jitter doesn't cause the window to widen beyond the normal zone, it is still within tolerance. The jitter specification

J_{T}

(see Table 12-1 on page 477) is a maximum of 3 UIs. See Figure 12-14 on page 476 for an example oscilloscope screen capture of an Eye Diagram showing how out-of-spec jitter causes horizontal widening or narrowing of the eye.

Noise and Signal Attenuation Heighten the Eye

Refer to Figure 12-12 on page 474. Noise or signal attenuation will cause the signal's voltage level to overshoot or undershoot the "Optimal Eye Opening" zone. As long as the amount of undershoot or overshoot doesn't cause the window height to dip below or extend above the normal zone, it is still within tolerance. See Figure 12-14 on page 476 for an example oscilloscope screen capture of an eye diagram showing how significant noise or signal attenuation causes the vertical widening or narrowing of the eye. Chapter 12: Electrical Physical Layer

Figure 12-10: LVDS (Low-Voltage Differential Signal) Transmitter Eye Diagram

Transmitter Driver Characteristics

General

Table 12-1 on this page lists the transmitter driver characteristics.

Table 12-1: Output Driver Characteristics

Item	Max.	Min.	Units	Notes
UI	400.12	399.88	ps	Unit Interval = the bit time. 400 ps nominal.
$T_{TX-EYE}$		0.7	UI	Minimum eye width from which maximum jitter can be derived. $J_{T} = 1 - T_{TX - EYE}$
$J_{T}$	0.3		UI	Maximum jitter spec shown in Figure 12-11 on page 473.
$T_{TX-RISE}$ $T_{TX-FALL}$		0.125	UI	Rise and Fall time for differen- tial signal measured at the 20%/80% voltage point.
$V_{TX - DIFF p - p}$	1200	800	$mV$	Peak-to-peak differential volt- age.
$V_{TX-DIFFp}$	600	400	$mV$	Half of $V_{TX-DIFFp-p.}$
$V_{TX-DC-CM}$	3.6	0	V	DC common mode voltage.
$V_{TX-DEEMPH-DIFFp-p-MIN}$	566	505	$mV$	Range of minimum differential peak-to-peak voltages for de- emphasized bits. This is a 3dB - 4dB de-emphasis from pre- emphasized $V_{TX-DIFFp-p-MIN}$ of $800 mV$ .
$I_{TX-SHORT}$	90		mA	Total current transmitter can provide when shorted to ground.

Table 12-1: Output Driver Characteristics (Continued)

Item	Max.	Min.	Units	Notes
$V_{TX - IDLE - DIFFp}$	20	0	$mV$	Peak differential voltage under electrical Idle state of Link.
$T_{TX-IDLE-MIN}$		50	UI	Minimum time a transmitter must be in electrical Idle.
$T_{TX-IDLE-SET-TO-IDLE}$	20		UI	Time allowed for transmitter to meet electrical Idle transmitter specification after sending elec- trical Idle Ordered-Set.
$T_{TX-IDLE-TO-DIFF-DATA}$	20		UI	Maximum time allowed for transmitter to meet differential transmission specification after electrical Idle exit.
$Z_{TX - DIFF - DC}$	120	80	Ohms	Transmitter differential mode low impedance. Typical value is 100 Ohms.
$Z_{TX - DC}$		40	Ohms	Requires minimum D+ and D- line impedance during all power states.
$C_{TX}$	200	75	nF	AC coupling capacitor on each Lane placed in close proximity to transmitter.
$L_{TX-SKEW}$	1.3		ns	Maximum Lane-to-Lane skew at transmitter between any two Lanes.

Transmit Driver Compliance Test and Measurement Load

The AC timing and voltage parameters shown in Table 12-1 on page 477 is measured within 0.2 inches from the package pins into the test load shown in Figure 12-15.

50 Ohm

probe,or a resistor attached to ground,when attached to the transmit signal pair causes the device to go into the Compliance state of the Link Training and Status State Machine (LTSSM) (see "Polling.Configuration SubState" on page 517). During this state, the device outputs the compliance pattern which can be used for interoperability testing, EMI noise testing, Lane-to-Lane interference testing, Bit Error Rate determination, Transmit and Receive Eye measurements, etc.

Figure 12-15: Compliance Test/Measurement Load

Input Receiver Characteristics

Table 12-2 on this page lists the input receiver characteristics. The receiver Eye Diagram in Figure 12-16 on page 481 illustrates some of the parameters listed in Table 12-2.

Table 12-2: Input Receiver Characteristics

Item	Max.	Min.	Units	Notes
UI	400.12	399.88	ps	Unit Interval =the bit time. 400ps nom- inal.
$T_{RX - EYE}$		0.4	UI	Minimum eye width from which max- imum jitter is derived. $J_{T} = 1 - T_{RX - EYE}$
$J_{T}$	0.3		UI	Maximum jitter spec.
$V_{RX-DIFFp-p}$	1200	175	$mV$	Peak-to-peak differential voltage sen- sitivity of receiver.
$V_{RX-DIFFp}$	600	88	mV	Half of $V_{RX-DIFFp-p.}$
$V_{RX-IDLE-DET-}$ DIFFp-p	175	65	mV	This is the electrical Idle detect thresh- old voltage. Any voltage less than 65mV peak-to-peak implies that the Link is in electrical Idle.
$Z_{RX-DIFF-DC}$	120	80	Ohms	Receiver DC differential mode imped- ance. 100 Ohms nominal.
$Z_{RX - DC}$	60	40	Ohms	Requires minimum D+ and D- line impedance during all power states.
$Z_{RX--HIGH-IMP-DC}$		200k	Ohms	Requires minimum D+ and D- line impedance when the receiver termina- tions do not have power (e.g., in the L2 low power state, or during Fundamen- tal Reset).
$L_{RX-SKEW}$	20		ns	Lane-to-Lane skew that a receiver must be able to compensate for.

13 System Reset

The Previous Chapter

The previous chapter describes the Electrical Physical Layer. It describes the analog characteristics of the differential drivers and receivers that connect a PCI Express device to the Link. Timing and driver/receiver parameters are documented in that chapter.

This Chapter

This chapter describes 3 types of system reset generation capabilities: cold reset, warm reset and hot reset. The chapter also describes the usage of a side-band reset signal called PERST#. It describes the usage of the TS1 Ordered-Set to generate an in-band Hot Reset. The effect of reset on a device and system is described.

The Next Chapter

The next chapter describes the function of the Link Training and Status State Machine (LTSSM) of the Physical Layer. The chapter describes the initialization process of the Link from Power-On or Reset, until the full-on L0 state, where traffic on the Link can begin. In addition, the chapter describes the lower power management states L0s, L1, L2, L3 and briefly describes entry and exit procedure to/from these states.

Two Categories of System Reset

The PCI Express specification describes two reset generation mechanisms. The first mechanism is a system generated reset referred to as Fundamental Reset. The second mechanism is an In-band Reset (communicated downstream via the Link from one device to another) referred to as the Hot Reset.

Fundamental Reset

Fundamental Reset causes a device's state machines, hardware logic, port states and configuration registers (except sticky registers of a device that can draw valid

V_{aux}

) to initialize to their default conditions.

There are two types of Fundamental Reset:

Cold Reset. This is a reset generated as a result of application of main power to the system.

Warm Reset. Triggered by hardware without the removal and re-application of main power. A Warm Reset could be triggered due to toggling of the system 'POWERGOOD' signal with the system power stable. The mechanism for generating a Warm Reset is not defined by specification. It is up to the system designer to optionally provide a mechanism to generate a Warm Reset.

When Fundamental Reset is asserted:

The receiver terminations are required to meet the $Z_{RX-HIGH-IMP-DC}$ parameter of 200 kOhms minimum (see Table 12-2 on page 480).

The transmitter terminations are required to meet the output impedance at minimum $Z_{TX-DC}$ (see Table 12-1 on page 477) of 40 Ohms,but may place the driver in a high impedance state.

The transmitter holds a constant DC common mode voltage between $0 V$ and $3.6 V$ .

After Fundamental Reset Exit:

The receiver must re-enable its receiver terminations $Z_{RX-DIFF-DC}$ (see Table 12-2 on page 480) of 100 Ohms within 5 ms of Fundamental Reset exit. The receiver is now ready to detect electrical signaling on the Link.

After Fundamental Reset exit, the Link Training state machine enters the 'Detect' state and the transmitter is ready to detect the presence of a receiver at the other end of the Link.

The transmitter holds a constant DC common mode voltage between $0 V$ and $3.6 V$ .

Methods of Signaling Fundamental Reset

Fundamental Reset may be signaled via an auxiliary side-band signal called PERST# (PCI Express Reset, asserted low). When PERST# is not provided to an add-in card or component, Fundamental Reset is generated autonomously by the component or add-in card.

Below is a description of the two mechanisms of Fundamental Reset generation.

PERST# Type Fundamental Reset Generation. A central resource device, e.g. a chipset, in the PCI Express system provides this source of reset. For example, the IO Controller Hub (ICH) chip in Figure 13-1 on page 490 may generate PERST#. The system power supply (not shown in figure) generates a 'POWERGOOD' signal once main power is turned on and stable. The ICH Reset logic in-turn uses this signal to assert PERST# when POWERGOOD (asserted High) is deasserted. If power is cycled, POWERGOOD toggles and causes PERST# to assert and deassert. This is the Cold Reset. If the system provides a method of toggling POWERGOOD without cycling through power (as via a button on the chassis) then also PERST# asserts and deasserts. This is the Warm Reset.

The PERST# signal feeds all PCI Express devices on the motherboard including the connectors and graphics controller. Devices may choose to use PERST# but are not required to use it as the source of reset.

The PERST# signal also feeds the PCI Express-to-PCI-X bridge shown in the figure. The bridge forwards this reset to the PCI-X bus as PCI-X bus RST#. ICH also generates PRST# for the PCI bus.

Autonomous Method of Fundamental Reset Generation. A device can be designed to generate its own Fundamental Reset upon detection of application (or re-application) of main power. The specification does not describe the mechanism for doing so. The self reset generation mechanism can be built into the device or may be designed as external logic, for example, on a add-in card that detects Power-On and generates a local reset to the device.

The device must also generate an autonomous Fundamental Reset if it detects its power go outside of the limits specified.

A device should support the autonomous method of triggering a Fundamental Reset given that the specification is not clear about requirement of system PERST# support.

In-Band Reset or Hot Reset

Hot Reset is propagated in-band via the transmission of TS1 Ordered-Sets (shown in Figure 13-2) with bit 0 of symbol 5 in the TS1 Ordered-Set asserted. The TS1 Ordered-Set is transmitted on all Lanes with the correct Link # and Lane# symbols. These TS1 Ordered-Sets are continuously transmitted for 2 ms. Both transmitter and receiver of Hot Reset end up in the detect state (see "Hot Reset State" on page 544). Hot Reset, in general, is a software generated reset.

Figure 13-2: TS1 Ordered-Set Showing the Hot Reset Bit

Hot Reset is propagated downstream. Hot Reset is not propagated upstream. This means that only the Root Complex and Switches are able to generate Hot Reset. Endpoints do not generate Hot Reset. A switch that receives a Hot Reset TS1 Ordered-Set on its upstream port must pass it to all its downstream ports. In addition, the switch resets itself. All devices downstream of a switch that receive the Hot Reset TS1 Ordered-Set will reset themselves.

Response to Receiving a Hot Reset Command

When a device receives a Hot Reset command:

It goes to the 'Detect' Link State (via the Recovery and Hot Reset state) of the Link Training state machine and starts the Link training process, followed by initialization of VC0.

Its state machines, hardware logic, port states and configuration registers (except sticky registers) initialize to their default conditions.

Switches Generate Hot Reset on Their Downstream Ports

The following are a list of bullets that indicate when a switch generates a Hot Reset on ALL its downstream ports:

Switch receives a Hot Reset on its upstream port

The Data Link Layer of the switch upstream port reports a DL_Down state. This state occurs when the upstream port has been disconnected or when the upstream port has lost connection with an upstream device due to an error that is not recoverable by the Physical Layer and Data Link Layer.

Software sets the 'Secondary Bus Reset' bit of the Bridge Control configuration register associated with the upstream port.

Bridges Forward Hot Reset to the Secondary Bus

If a bridge such as a PCI Express-to-PCI(-X) bridge detects a Hot Reset on its upstream port, it must assert the PRST# signal on its secondary PCI(-X) bus.

How Does Software Tell a Device (e.g. Switch or Root Com- plex) to Generate Hot Reset?

Software tells a root complex or switch to generate a Hot Reset on a specific port by writing a 1 followed by 0 to the 'Secondary Bus Reset' bit in the Bridge Control register of that associated port's configuration header. See Figure 13-3 on page 493 for the location of this bit. Consider the example shown in Figure 13-4 on page 494. Software writes a 1 to the 'Secondary Bus Reset' register of Switch A's downstream left side port. Switch A generates a Hot Reset on that port by forwarding TS1 Ordered-Sets with the Hot Reset bit set. Switch A does not generate a Hot Reset on its right side port. Switch B receives this Hot Reset on its upstream port and forwards it on all downstream ports to the two endpoints.

If software writes to the 'Secondary Bus Reset' bit of the switch's upstream port, then the switch generates a Hot Reset on ALL its downstream ports. Consider the example shown in Figure 13-5 on page 495. Software writes a 1 to the 'Secondary Bus Reset' register of Switch C's upstream port. Switch C generates a Hot Reset on ALL downstream ports by forwarding TS1 Ordered-Sets with the Hot Reset bit set on both ports. The PCI Express-to-PCI bridge receives this Hot Reset and forwards it on to the PCI bus by asserting PRST#.

A device is in the L0 state when the 'Secondary Bus Reset' bit is set. The device (upstream device) then goes through the Recovery state of the LTSSM (see "Recovery State" on page 532) before it generates the TS1 Ordered-Sets with the Hot Reset bit set and then enters the Hot Reset state (see "Hot Reset State" on page 544). The Hot Reset TS1 Ordered-Sets are generated continuously for

2 ms

and then the device exits to the Detect state where it is ready to start the Link training and initialization process.

Reset Exit

After exiting the reset state, Link training and initialization must begin within

80 ms

. Devices may exit the reset state at different times,since reset signaling is asynchronous. This means that in fact two devices on opposite ends of the Link who are reset may not start the Link training process at the same time.

After Link Training and Initialization each device proceeds through Flow Control initialization for VC0, making it possible for TLPs and DLLPs to be transferred across the Link.

To allow components who have been reset to perform internal initialization, system software must wait for at least

100 ms

from the end of a reset (cold/ warm/hot) before issuing Configuration Requests to PCI Express devices. To be software visible, devices must be ready to receive configuration request after

100 ms

from the end of Reset. The specification does not address how software measures this wait time of

100 ms

. It could be as simple as software starts a software loop at the end of which the first configuration request is initiated.

If software initiates a configuration request to a device after the

100 ms

wait time from the end of Reset and the device is not done with its internal initialization, it must at least return a completion TLP, with a completion status of "Configuration Request Retry Status (CRRS)". The completion is returned to the root complex who initiated the configuration request on the behalf of the CPU. The completion TLP terminates the configuration request. Either the root complex may re-issue the configuration request or have the CPU retry the request again.

The Root Complex and/or system software must allow 1.0 second (+50%/-0%) after a reset before it may determine that a device which fails to return a successful Completion status for a valid Configuration Request is broken. This delay is analogous to the Trhfa parameter specified for PCI/PCI-X, and is intended to allow sufficient time for devices to complete self initialization.

Link Wakeup from L2 Low Power State

When a device's Link is in the L2 low power state, its main power is turned off though

V_{aux}

is still applied. A device returns to the full-on L0 power state by one of two methods. Either the device signals a wakeup event or Power Management software triggers the wakeup procedure "Waking Non-Communicating Links" on page 642.

Device Signals Wakeup

The powered down device (device(s) are in

D 3_{Cold}

with

V_{aux}

valid) whose Link is in L2 state, is able to signal a wakeup event either through signaling a Beacon (see "Beacon Signaling" on page 469) upstream towards the root complex or by asserting WAKE#. The wakeup event ultimately results in the power controller re-applying power and clock to the device (or group of devices) who signaled the wakeup. The power controller also causes (either autonomously or under software control) a PERST# Reset to the device or group of devices whose power and clock has been re-applied. If the device does not support PERST#, it must autonomously generate its own Fundamental Reset when it senses main power re-applied to it. Upon exit from Fundamental Reset, the device proceeds with Link training and initialization. The device, which is now in the

{D 0}_{unintialized}

state,can now send a PM_PME TLP message upstream to the root complex to inform Power Management software of the wakeup event.

Power Management Software Generates Wakeup Event

Power Management software can wake up a device or group of devices (device(s) are in

{D 3}_{Cold}

) whose Link is in L2 state. Power Management software causes the power controller to re-apply power and clock to the device or group of devices. The power controller also causes (either autonomously or under software control) a PERST# Reset. If the device does not support PERST#, it must autonomously generate its own Fundamental Reset when it senses main power re-applied to it. Upon exit from Fundamental Reset, the device proceeds with Link training and initialization. The device is now in the

D 0_{unintialized}

state. It will have to be configured to bring it to the D0 state.

14 Link Initialization & Training

The Previous Chapter

The previous chapter described three types of system reset generation capabilities: cold reset, warm reset and hot reset. The chapter also described the usage of the side-band reset signal PERST#. The effect of reset on a device and system was described.

This Chapter

This chapter describes the function of the Link Training and Status State Machine (LTSSM) of the Physical Layer. The chapter describes the initialization process of the Link from Power-On or Reset, until the full-on L0 state, where traffic on the Link can begin. In addition, the chapter describes the lower power management states L0s, L1, L2, L3 and briefly describes entry and exit procedure to/from these states.

The Next Chapter

The next chapter describes the mechanical form factor for the PCI Express connector and add-in card. Different slot form factors are defined to support

\times 1, \times 4

x 8

and

x 16

Lane widths. In addition,the next chapter describes the Mini PCI Express form factor which targets the mobile market, Server IO Module (SIOM) form factor which targets the workstation and server market, and the NEW-CARD form factor which targets both mobile and desktop markets.

Link Initialization and Training Overview

General

Link initialization and training is a Physical Layer control process that configures and initializes a device's Physical Layer, port, and associated Link so that normal packet traffic can proceed on the Link. This process is automatically initiated after reset without any software involvement. A sub-set of the Link training and initialization process, referred to as Link re-training, is initiated automatically as a result of a wakeup event from a low power mode, or due to an error condition that renders the Link inoperable. The Link Training and Status State Machine (LTSSM) is the Physical Layer sub-block responsible for the Link training and initialization process (see Figure 14-1).X

A receiver may optionally check for violations of the Link training and initialization protocol. If such an error occurs, it may be reported as a 'Link Training Error' to the error reporting logic (see "Link Errors" on page 379).

The following are configured during the Link training and initialization process:

Link Width is established and set. Two devices with a different number of port Lanes may be connected. For example, one device with a x2 port may be connected to a device with a $\times 4$ port. During Link training and initialization, the Physical Layer of both devices determines and sets the Link width to the minimum Lane width of the two (i.e., x2). Other Link negotiated behaviors include Lane reversal, splitting of ports into multiple Links, and the configuration of a cross-Link.

Lane Reversal on a multi-Lane device's port (if reversal is required). The Lanes on a device's port are numbered by design. When wiring up a Link to connect two devices, a board designer should match up the lane numbers of each device's port so that Lane 0 of one device's port connects to Lane 0 of the remote device’s port,Lane $n$ to Lane $n$ of the remote device’s port,and so on.

Chapter 14: Link Initialization & Training

Figure 14-1: Link Training and Status State Machine Location

Due to the way the Lanes are organized on the pins of the device's package, it may not be possible to match up the Lanes of the two devices without crisscrossing the wires (see Figure 14-2 on page 502). Crisscrossed wires will introduce interference into the Link. If however, one or both of the devices support Lane Reversal, the designer could wire the Lanes in parallel fashion. During the Link training and initialization process, one device reverses the Lane numbering so the Lane numbers of the two ports would match up (Figure 14-2 on page 502). Unfortunately, the specification does not require devices to support the Lane Reversal feature. Hence, the

PCI Express System Architecture

designer must verify that at least one of the two devices connected via a Link supports this feature before wiring the Lanes of the two ports in reverse order. If the device supports this feature, the Lane Reversal process may permit a multi-Lane Link to be split into multiple Links that connect to multiple devices. More on this feature later. Chapter 14: Link Initialization & Training

Figure 14-2: Example Showing Lane Reversal

Polarity Inversion may be necessary. The D+ and D- differential pair terminals for two devices may not be connected correctly, or may be intentionally reversed so that the signals do not crisscross when wiring the Link. If Lanes are wired with $D +$ and $D$ - of one device wired to $D$ - and $D +$ of the remote device,respectively,the Polarity Inversion feature reverses the D+ and D-signal polarities of the receiver differential terminal. Figure 14-3 illustrates the benefit of this feature on a $\times 1$ Link. Support for Polarity Inversion is mandatory.

Figure 14-3: Example Showing Polarity Inversion

Link Data Rate. Link initialization and training is completed at the default 2.5Gbit/s Generation 1 data rate. In the future, Generation 2 PCI Express will support higher data rates of $5 Gbit / s$ and $10 Gbit / s$ . During training, each node advertises its highest data rate capability. The Link is then initialized with the highest common frequency that both neighbors can support.

Bit Lock. Before Link training begins, the receiver PLL is not yet sync'd with the remote transmitter's transmit clock, and the receiver is unable to differentiate between one received bit and another. During Link training, the receiver PLL is sync'd to the transmit clock and the receiver is then able to shift in the received serial bit stream. See "Achieving Bit Lock" on page 440.

Symbol Lock. Before training, the receiver has no way of discerning the boundary between two, 10-bit sympols. During training, when TS1 and TS2 Ordered-Sets are exchanged, the receiver is able to locate the COM symbol (using its unique encoding) and uses it to initialize the deserializer. See "Symbol Boundary Sensing (Symbol Lock)" on page 441.

Lane-to-Lane De-skew. Due to Link wire length variations and the different driver/receiver characteristics on a multi-Lane Link, each of the parallel bit streams that represent a packet are transmitted simultaneously, but they do not arrive at the receivers on each lane at the same time. The receiver circuit must compensate for this skew by adding or removing delays on each Lane so that the receiver can receive and align the serial bit streams of the packet (see "Lane-to-Lane De-Skew" on page 444). This deskew feature combined with the Polarity Inversion and Lane Reversal features, greatly simplifies

PCI Express System Architecture

the designer's task of wiring up the high speed Link.

Ordered-Sets Used During Link Training and Initialization

Physical Layer Packets (PLPs), referred to as Ordered-Sets, are exchanged between neighboring devices during the Link training and initialization process. These packets were briefly described in the section on "Ordered-Sets" on page 433. The five Ordered-Sets are:

Training Sequence 1 and 2 (TS1 and TS2),

Electrical Idle,

Fast Training Sequence (FTS), and

Skip (SKIP) Ordered-Sets.

Their character structure is summarized in Figure 14-4 on page 504.

Figure 14-4: Five Ordered-Sets Used in the Link Training and Initialization Process

TS1 or TS2

	Training Control
Bit 0	0 = De-assert Hot Reset 1 = Assert Hot Reset
Bit 1	0 = De-assert Disable Link 1 = Assert Disable Link
Bit 2	0 = De-assert Loopback 1 = Assert Loopback
Bit 3	$0 =$ De-assert Disable Scrambling 1 = Assert Disable Scrambling
Bit 4:7	Reserved

TS1 and TS2 Ordered-Sets

The TS1 and TS2 Ordered-Sets are each comprised up of 16 symbols. Structurally, there is not much difference between a TS1 and TS2 Ordered-Set, other than the TS Identifier (symbols 6-15) which contains a D10.2 for TS1 and D5.2 for the TS2 Ordered-Set. They are exchanged during the Polling, Configuration and Recovery states of the LTSSM described in "Link Training and Status State Machine (LTSSM)" on page 508. The TS1 and TS2 symbols consist of:

Symbol 0 (COM): The K28.5 character identifies the start of an Ordered-Set. The receiver uses this character to achieve Bit Lock and Symbol Lock as described in "Achieving Bit Lock" on page 440 and "Symbol Boundary Sensing (Symbol Lock)" on page 441. By locating this character on a synchronized transmission of TS1 or TS2 Ordered-Sets on a multi-Lane Link, the receiver can de-skew the Lanes.

Symbol 1 (Link #): In the early stages of Link training, when the TS1 and TS2 Ordered-Sets are exchanged, this field contains the PAD symbol (transmitted as a null symbol). During the configuration state of the LTSSM, this field contains an assumed Link Number. TS1 and TS2 Ordered-Sets driven from different ports of a switch contain different Link Numbers.

Symbol 2 (Lane #): In the early stages of Link training when the TS1 and TS2 Ordered-Sets are exchanged, this field contains the PAD symbol (transmitted as a null symbol). During the configuration state of the LTSSM, this field contains an assumed Lane Number for each Lane of a Link. The TS1 and TS2 Ordered-Sets driven on each Lane of a given link contain different numbers.

Symbol 3 (N_FTS): Contains the number of Fast Training Sequences. The exchange of FTS Ordered-Sets is used to achieve Bit Lock and Symbol Lock when exiting from the L0s to the L0 power state. During Link training at Link initialization, when TS1 or TS2 Ordered-Sets are exchanged, the receiver sends the remote transmitter the N_FTS field to indicate how many FTS Ordered-Sets it must receive to reliably obtain Bit and Symbol Lock. Armed with this information, the transmitter sends at least that many FTS Ordered-Sets during exit from the L0s state. A typical value is between two and four. This implies that the transmitter must send at least that many FTS Ordered-Sets during exit from the L0s state. For example, N_FTS = 4 translates to 4 FTS Ordered-Sets $= 4 \times 4$ symbols $= 16 \times 4 ns / symbol = 64 ns$ ,the period of time it takes the receiver's PLL to achieve Bit and Symbol Lock during exit from the L0s state. When the Extended Synch bit is set in the transmitter device, 4096 FTS Ordered-Sets must be sent in order to provide external Link monitoring tools with enough time to achieve Bit and Symbol

PCI Express System Architecture

Lock synchronization. During the FTS Ordered set exchange, if the N_FTS period of time expires and the Receiver has not yet obtained Bit Lock, Symbol Lock, and Lane-to-Lane de-skew on all Lanes of the configured Link, the Receiver must transition to the Recovery state of the LTSSM.

Symbol 4 (Rate ID): Each device informs its neighbor what data transfer rate it supports. A value of D2.0 indicates a 2.5Gbits/s transfer rate, while other values are currently reserved.

Symbol 5 (Training Control): A device that sends TS1 and TS2 Ordered-Sets uses this symbol to communicate additional information such as:

Bit 0 , when set, indicates Hot Reset.

Bit 1, when set, indicates Link Disable.

Bit 2, when set, indicates Enable Loopback.

Bit 3, when set, indicates disable scrambling.

The remainder of the bits are reserved.

Only one bit can be set in this field per Ordered-Set.

Symbol 6-15 (Training Sequence ID): Driven with D10.2 for TS1 Ordered-Sets and D5.2 for TS2 Ordered-Sets.

See Table 14-1 for a summary of this information.

Table 14-1: Summary of TS1 and TS2 Ordered-Set Contents

Symbol Number	Allowed Value	Encoded Character Value	Description
0	Comma	K28.5	This is the COM (Comma) symbol.
1	0-255	D0.0 - D31.7, K23.7 (PAD)	Link Number. Uses the PAD symbol when there is no Link Number to communicate.
2	0-31	D0.0 - D31.0, K23.7 (PAD)	Lane Number. Use the PAD symbol when no there is no Lane Number to communicate.
3	0-255	D0.0 - D31.7	N_FTS. This is the number of FTS Ordered-Sets required by receiver to obtain Bit and Symbol Lock during exit from the L0s state.

Table 14-1: Summary of TS1 and TS2 Ordered-Set Contents (Continued)

Symbol Number	Allowed Value	Encoded Character Value	Description
4	2	D2.0	Data Rate Identifier: - Bit $0 =$ Reserved. - Bit $1 = 1$ ,Generation $1 (2.5 Gbits / s)$ . - Bits 7:2 = Reserved.
5	Bit $0 = 0, 1$ Bit $1 = 0, 1$ Bit 2 = 0,1 Bit 3 = 0,1 Bit $4 : 7 = 0$	D0.0, D1.0, D2.0, D4.0, D8.0	Training Control. 0=De-assert, 1 = Assert: - Bit 0 - Hot Reset, - Bit 1 - Disable Link - Bit 2 - Enable Loopback - Bit 3 - Disable Scrambling - Bit 4:7 - Reserved, Set to 0
6-15		D10.2 for TS1 ID D5.2 for TS2 ID	TS1 /TS2 Ordered-Set Identifier.

Electrical Idle Ordered-Set

The Electrical Idle Ordered-Set consists of four symbols, starting with the COM symbol and followed by three IDL symbols. This Ordered-Set is transmitted to a receiver prior to the transmitter placing its transmit half of the Link in the Electrical Idle state. The receiver detects this Ordered-Set, de-gates its error detection logic and prepares for the Link to go to the Electrical Idle state. Shortly after transmitting the Electrical Idle Ordered-Set, the transmitter drives a differential voltage of less than

20 mV

peak. For more details on Electrical Idle Ordered-Set usage and the Electrical Idle Link state, see "Electrical Idle" on page 464 .

FTS Ordered-Set

The FTS Ordered-Set consists of four symbols, starting with the COM symbol and followed by three FTS symbols. A transmitter that wishes to transition the state of its Link from the L0s low power state (Electrical Idle) to the L0 state sends a defined number of FTS Ordered-Sets to the receiver. The minimum number of FTS Ordered-Sets that the transmitter must send to the receiver is sent to the transmitter by the receiver during Link training and initialization. See "TS1 and TS2 Ordered-Sets" on page 505, Symbol 3.

PCI Express System Architecture

SKIP Ordered-Set

The Skip Ordered-Set consists of four symbols, starting with the COM symbol and followed by three SKP symbols. It is transmitted at regular intervals from the transmitter to the receiver and is used for Clock Tolerance Compensation as described in "Inserting Clock Compensation Zones" on page 436 and "Receiver Clock Compensation Logic" on page 442.

Link Training and Status State Machine (LTSSM)

General

Figure 14-5 on page 510 illustrates the top-level states of the Link Training and Status State Machine (LTSSM). Each state consists of substates that, taken together, comprise that state. The first LTSSM state entered after exiting Fundamental Reset (Cold or Warm Reset) or Hot Reset is the Detect state.

The LTSSM consists of 11 top-level states: Detect, Polling, Configuration, Recovery, L0, L0s, L1, L2, Hot Reset, Loopback, and Disable. These states can be grouped into five categories:

The Link Training states.

A Link Re-Training State.

The Power Management states.

The Active Power Management states, and

Other states.

Upon any type of Reset exit, the flow of the LTSSM is through the following Link Training states:

Detect

=>

Polling

=>

Configuration

=>

Chapter 14: Link Initialization & Training

If a Link error occurs while in the L0 state and it renders the Link inoperable, the LTSSM transitions to the Link Re-Training state (the Recovery state). In this state, the Link is retrained and returned to normal operation (i.e., the L0 state). A Link that is placed in a low power state (such as L1 or L2), returns to the L0 state via the Recovery state.

Without any high-level software involvement, if there are no packets transmitted on the Link (i.e., if the Logical Idle Sequence is transmitted) and a time-out occurs, the device may place its Link into a low power state such as L0s or L1. These are the Active Power Management states.

Power management software may place a device into one of the lower device power states such as

D 1, D 2, {D 3}_{Hot}

{D 3}_{Cold}

. Doing so causes the Link to transition from L0 to one of the lower Power Management states such as L1 or L2.

While in the Configuration or the Recovery state, a Link can be directed to enter the Disable state or the Loopback state. While in the Recovery state, a device that receives the Electrical Idle Ordered-Set transits through the Hot Reset state before going to the Detect state. The three states-Disable, Loopback, and Hot Reset-are part of the Other states group.

Overview of LTSSM States

Below is a brief description of the 11 LTSSM states:

Detect: This is the initial state after reset. While the spec states that the LTSSM may also enter the Detect state as directed by the Data Link Layer, it does not indicate under what circumstances this would occur. In this state, a device detects the presence or absence of a device connected at the far end of the Link. The Detect state may also be entered from a number of other LTSSM states as described later in this chapter.

Polling: The following conditions are established during the Polling state: - Bit Lock.

Symbol Lock.

Lane Polarity.

Lane Data Rate.

Compliance testing also occurs in this state.

During compliance testing, the transmitter outputs a specified compliance pattern. This is intended to be used with test equipment to verify that all of the voltage, noise emission and timing specifications are within tolerance. During the Polling state, a device transmits TS1 and TS2 Ordered-Sets and responds to received TS1 and TS2 Ordered-Sets. Higher bit rate support is advertised via the exchange of TS1 and TS2 Ordered-Sets with the Rate ID field

=

the highest supported rate.

Configuration: The following conditions are established during the Configuration state:

Link width.

Link Number.

Lane reversal.

Polarity inversion (if necessary).

Lane-to-Lane de-skew is performed.

Both transmitter and receiver are communicating at the negotiated data rate (as of

6 / 16 / 03

,the generation 1 data rate of

2.5 Gb / s

). During this state, scrambling can be disabled, the Disable and Loopback states can be entered, and the number of FTS Ordered-Sets required to transition from the L0s state to the L0 state is established.

L0: This is the normal, fully active state of a Link during which TLPs, DLLPs and PLPs can be transmitted and received.

Recovery: This state is entered from the L0 state due to an error that renders the Link inoperable. Recovery is also entered from the L1 state when the Link needs re-training before it transitions to the L0 state. In Recovery, Bit Lock and Symbol Lock are re-established in a manner similar to that used in

the Polling state. However, the time to transition through this state is much shorter than having to go through the Polling state and then transitioning to the L0 state. Lane-to-Lane de-skew is performed. The number of FTS Ordered-Sets required to transition from the L0s state to the L0 state is reestablished.

L0s: This is a low power, Active Power Management state. It takes a very short time (in the order of 50 ns) to transit from the L0s state back to the L0 state (because the LTSSM does not have to go through the Recovery state). This state is entered after a transmitter sends and the remote receiver receives Electrical Idle Ordered-Sets while in the L0 state. Exit from the L0s state to the L0 state involves sending and receiving FTS Ordered-Sets. When transitioning from L0s exit to L0, Lane-to-Lane de-skew must be performed, and Bit and Symbol Lock must be re-established.

L1: This is an even lower power state than L0s. L1 exit latency (via Recovery) is longer compared to L0s exit latency (see "Electrical Physical Layer State in Power States" on page 481). Entry into L1 can occur in one of two ways:

The first is automatic and does not involve higher-level software. A device with no scheduled TLPs or DLLPs to transmit can automatically place its Link in the L1 state after first being in the L0 state (while the device remains in the D0 power state).

The second is as a result of commands received from the power management software placing a device into a lower power device state (D1, D2,or ${D 3}_{Hot}$ ). The device automatically places its Link in the L1 state.

L2: This is the lowest power state. Most of the transmitter and receiver logic is powered down (with the exception of the receiver termination, which must be powered for the receiver to be in a low impedance state). Main power and the clock are not guaranteed,though $V_{aux}$ power is available. When Beacon support is required by the associated system or form factor specification, an upstream port that supports this wakeup capability must be able to send the Beacon signal and a downstream port must be able to detect the Beacon signal (see "Beacon Signaling" on page 469). Beacon signaling or the WAKE# signal is used by a device in the ${D 3}_{Cold}$ state to trigger a system wakeup event (i.e., a request for main power supply re-activation). Another power state defined by the specification is the L3 state, but this state does not relate to the LTSSM states. The L3 Link state is the full-off state where the $V_{aux}$ power signal is not available. A device in L3 cannot trigger a wakeup event unless power is re-applied to the device through some other mechanism.

Loopback: This state is used as a test and fault isolation state. Only entry and exit of this state is specified. The details of what occurs in this state are unspecified. Testing can occur on a per Lane basis or on the entire config-

ured Link. The Loopback Master device sends TS1 Ordered-Sets to the Loopback Slave with the Loopback bit set in the TS1 Training Control field. The Loopback Slave enters Loopback when it receives two consecutive TS1 Ordered-Sets with the Loopback bit set. How the Loopback Master enters into the Loopback state is device specific. Once in the Loopback state, the Master can send any pattern of symbols, as long as the 8b/10b encoding rules are followed.

Disable: This state allows a configured Link to be disabled (e.g., due to a surprise removal of the remote device). In this state, the transmitter driver is in the electrical high impedance state and the receiver is enabled and in the low impedance state. Software commands a device to enter the Disable state by setting the Disable bit in the Link Control register. The device then transmits 16 TS1 Ordered-Sets with the Disable Link bit set in the TS1 Training Control field. A connected receiver is Disabled when it receives TS1 Ordered-Sets with the Disable Link bit set.

Hot Reset: This state is entered when directed to do so by a device's higher layer, or when a device receives two, consecutive TS1 Ordered-Sets with the Hot Reset bit set in the TS1 Training Control field (see "In-Band Reset or Hot Reset" on page 491).

Detailed Description of LTSSM States

The subsections that follow provide a description of each of the LTSSM states. Most of the 11 LTSSM states are divided into two or more substates. SubState diagrams are used in the dicussions that follow to illustrate the substates.

Detect State

This state is the initial state at power-on time after a Fundamental Reset or after a Hot Reset command generated by the Software Layer. Entry into this state must occur within

80 ms

of Reset as described in "Reset Exit" on page 496. The Detect state can also be entered from the Disable, Loopback or L2 states. The Detect state is entered if the Configuration, Recovery or Polling states do not complete sucessfully. Figure 14-6 shows the Detect substate machine.

Detect.Quiet SubState

Entry-

From Fundamental Reset or Hot Reset. Also from L2, Loopback, Disable, Polling, Configuration and Recovery states.

PCI Express System Architecture

During Detect Quiet-

The transmitter is in the Electrical Idle state. The Electrical Idle Ordered-Set does not have to be transmitted before placing the Link in the Electrical Idle state.

The transmitter drives a DC common mode voltage (it does not have to meet the $0 - 3.6 V$ specification).

2.5Gbit/s (Generation 1) transfer rate is initialized (but this is not the rate that will necessarily be advertised via the TS1 and TS2 Ordered-Sets).

The Data Link Layer is sent LinkUp $= 0$ .

Exit to Detect.Active-

After

12 ms

timeout or when Link exits the Electrical Idle state.

Detect.Active SubState

Entry from Detect.Quiet-

This state is entered after

12 ms

or when the Link exits the Electrical Idle state.

During Detect.Active-

The transmitter device detects if receivers are connected on all Lanes of the Link. The transmitter starts at a stable DC common mode voltage on all Lanes. This voltage can be

V_{DD}, GND

,or some other stable voltage in-between. The transmitter then drives a DC common mode voltage other than the one currently presented. A receiver is detected based on the rate at which the D+ and D- lines charge to the new voltage. At design time, the device is designed with knowledge of the charge time to change the voltage (based on the assumed line impedance and transmitter impedance without receiver termination). With a receiver attached at the other end, the charge time will be longer than if there is no connected receiver. For more details on the receiver detection process, see "Receiver Detection" on page 459.

Exit to Detect.Quiet-

Occurs if a receiver is not detected. The loop from Detect.Quiet to Detect.Active is repeated every

12 ms

,as long as no receiver is attached. The next state is Polling if a Receiver is detected on all unconfigured Lanes.

Exit to Polling—

If the device detects a receiver attached. The Device must now drive a DC common voltage within the

0 - 3.6 V V_{TX-CM-DC}

specification.

Special Case-

If all Lanes of a device are not connected to a receiver. For example, a x4 device is connected to a

\times 2

device. In that case,the device detects that some Lanes (two Lanes) are connected to a receiver, while others are not. Those Lanes connected to a receiver belong to one LTSSM.There are two choices at this point:

Chapter 14: Link Initialization & Training

Those Lanes not connected to a receiver belong to another LTSSM (if they can operate as a separate Link-see "Designing Devices with Links that can be Merged" on page 522). The other LTSSM continues to repeat the receiver detection sequence described above.

Those Lanes that are not connected to a receiver and cannot become part of another Link and LTSSM must transition the unconnected Lanes to the Electrical Idle state.

Figure 14-6: Detect State Machine

Polling State

Introduction

This state is the first time in the Link training and initialization process that PLPs (such as TS1 and TS2 Ordered-Sets) are exchanged between the two connected devices. Figure 14-7 shows the substates of the Polling state machine.

Polling.Active SubState

Entry from Detect-

Transmitters drive a DC common mode voltage within the spec limits on all Lanes on which it detected a receiver.

Entry from Polling.Compliance-

If Electrical Idle exit is detected at the receiver on ALL Lanes that detected a receiver during Detect, the Transmitter exits Polling.Compliance by transmitting 1024 TS1 Ordered-Sets.

Entry from Polling.Speed-

While in Polling.Speed, the Transmitter enters the Electrical Idle state for a minimum of TTX-IDLE-MIN and no longer than

2 ms

. An Electrical Idle ordered set is sent prior to entering the Electrical Idle state. The DC common mode voltage does not have to be within specification. The Data rate is changed on all Lanes to the highest common data rate supported on both sides of the Link indicated by the training sequence.

During Polling.Active—

Bit/Symbol Lock are obtained as described in the next bullet (see "Symbol Boundary Sensing (Symbol Lock)" on page 441 and "Achieving Bit Lock" on page 440 for further details).

The transmitters of the two connected devices transmit a minimum of 1024 consecutive TS1 Ordered-Sets on all connected Lanes. The two devices come out of the Detect state at different times, hence the TS1 Ordered-Set exchange of the two devices are not synchronized with one another. The PAD symbol is used in the Lane and Link Numbers fields of the TS1 Ordered-Sets. 1024 TS1 Ordered-Sets amounts to 64 µs of time to achieve Bit and Symbol lock.

Exit to Polling.Configuration-

The next state will be Polling.Configuration if one of the following conditions is true:

If a device receives eight consecutive TS1 or TS2 Ordered-Sets (or their complement due to polarity inversion) with Lane and Link set to the PAD symbol on ALL Lanes and at least 1024 TS1 Ordered Sets are transmitted, or

After a 24ms timeout, if:

A device receives eight consecutive TS1 or TS2 Ordered-Sets (or their complement) with the Lane and Link numbers set to PAD symbol on ANY Lanes that detected a receiver during Detect, AND

at least 1024 TS1 Ordered-Sets were transmitted, AND

all Lanes that detected a receiver detected an exit from Electrical Idle at least once since entering Polling.Active (this prevents one or more bad transmitters or receivers from holding up Link configuration).

Exit to Polling.Compliance-

If at least one Lane that detected a receiver during Detect has never detected an exit from Electrical Idle since entering Polling.Active (a passive test load such as a resistor on at least one Lane forces all Lanes into Polling.Compliance).

Exit to Detect-

If no TS1 or TS2 Ordered-Sets are received with the Link and Lane number fields set to the PAD symbol on any Lane. Also, the highest advertised speed must be lowered to generation 1 (if not already advertised as such).

Polling.Configuration SubState

Entry from Polling.Active-

Enters Polling.Configuration if either of the following two conditions are true:

If a device receives eight consecutive TS1 or TS2 Ordered-Sets (or their complement due to polarity inversion) with the Lane and Link numbers set to the PAD symbol on ALL Lanes and at least 1024 TS1 Ordered Sets are transmitted, or

After a $24 ms$ timeout,if a device receives eight consecutive TS1 or TS2 Ordered-Sets (or their complement) with the Lane and Link numbers set to the PAD symbol on ANY Lanes that detected a receiver while in the Detect state, and at least 1024 TS1 Ordered-Sets were transmitted, AND all Lanes that detected a receiver detected an exit from Electrical Idle at least once since entering Polling.Active (prevents one or more bad transmitters or receivers from holding up Link configuration).

During Polling.Configuration-

If a receiver sees the complement of the TS1/TS2 Ordered-Sets, it has to invert the polarity of its differential input pair terminals. Basically, if D21.5 rather than D10.2 is received in the TS1 Ordered-Set, or if D26.5 rather than D5.2 is received for the TS2 Ordered-Set, then the receiver (not the transmitter) must invert its signal polarity. Polarity Inversion is a mandatory feature (see "Link Initialization and Training Overview" on page 500 for example of Polarity Inversion) and must be implemented on all Lanes independently.

The Transmitter sends more than eight TS2 Ordered-Sets.

Exit to Configuration-

Assumes that no speed

> 2.5 Gbits / s

is identified in the Data Rate Identifier field of the TS2 Ordered-Set. After receiving eight consecutive TS2 Ordered-Sets and transmitting 16 TS2 Ordered-Sets after receiving one TS2 Ordered-Set, exit to Configuration.

PCI Express System Architecture

Exit to Polling.Speed-

The next state is Polling.Speed after eight consecutive TS2 Ordered Sets, with Link and Lane numbers set to the PAD symbol (K23.7), are received on any Lanes that detected a Receiver during Detect, 16 TS2 Ordered Sets are transmitted after receiving one TS2 ordered set, and at least one of those same Lanes is transmitting and receiving a Data Rate Identifier greater than

2.5 Gb / s

Exit to Detect-

If neither of the two exit conditions are met, exit to Detect after a 48ms timeout.

Polling.Compliance SubState

Entry from Polling.Active-

The next substate is Polling.Compliance if at least one Lane that detected a receiver during Detect has never detected an exit from the Electrical Idle state on its receiver since entering Polling.Active (a passive test load, such as a resistor, on at lease one Lane forces all Lanes into Polling.Compliance).

During Polling.Compliance-

A test probe (of

50 Ohms

impedance) or a

50 Ohm

impedance to ground hooked to the transmit pair on any Lane causes the device to enter Polling.Compliance (see "Transmit Driver Compliance Test and Measurement Load" on page 479"). In this state, the device (a pattern generator) is required to generate the compliance pattern on the Link. The compliance pattern selected produces the worst case interference between neighboring Lanes and results in the worst case EMI. Test equipment hooked to the Link is used to test for EMI noise, cross-talk, Bit Error Rate (BER), etc.

The Transmitter outputs the compliance pattern on all Lanes that detected a receiver during Detect. The pattern consists of the $8 b / 10 b$ symbols K28.5, D21.5, K28.5, and D10.2. Current running disparity (CRD) must be set to negative before sending the first symbol.

No Skip Ordered-Sets are transmitted during Polling.Compliance.

Polling.Compliance Exit-

The compliance state is exited when an Electrical Idle exit is detected on all the Lanes that detected a receiver during Detect. The transmitter exits Polling.Compliance by transmitting 1024 TS1 Ordered-Sets.

Polling.Speed SubState

Entry from Polling.Configuration-

Polling.Speed is entered if:

Eight consecutive TS2 Ordered-Sets are received, and

16 TS2 Ordered-Sets are transmitted after receiving one TS2, and

at least one of the Lanes is transmitting and receiving with a Data Rate Identifier in the TS2 Ordered-Set that is higher than $2.5 Gb / s$ .

During Polling.Speed-

In this state, the transmitter enters the Electrical Idle state for at least 50 UI (20ns), but no longer than 2ms. An Electrical Idle Ordered-Set is sent prior to entering the Electrical Idle state and the DC common mode voltage ( $V_{TX-CM-DC}$ ) does not have to be within the specified tolerance.

During this state, the data rate is changed on all Lanes to the highest common data rate supported by both ends of the Link.

Exit to Polling.Active-

This is the default.

Figure 14-7: Polling State Machine

Configuration State

General

The main function of this state is the assignment of Link numbers and Lane numbers to each Link that is connected to a different device. The Link is also De-skewed in this state.

PCI Express System Architecture

An upstream device sends TS1 Ordered-Sets on all downstream Lanes. This starts the Link numbering and Lane numbering process. If the width determination and Lane numbering is completed successfully, then TS2 Ordered-Sets are transmitted to the neighboring device to confirm the Link Width, Link Number and Lane Number for each Link connected to a different device.

While in the Configuration state, the Link Training bit in the Link Status register is set by hardware (see Figure 14-22 on page 552). This bit is set on Root Complex ports and on downstream Switch ports. It is not set in Endpoints or a Switch upstream ports.

Figure 14-8: Configuration State Machine

Configuration.RcvrCfg SubState

Entry from Polling or Recovery-

This state is entered after the normal completion of the Polling state (as described in "Polling.Configuration SubState" on page 517). It is also entered if the Recovery state fails to complete successfully (as described in"Recovery State" on page 532).

During Configuration.RcvrCfg-

The Link Number of each Link connected to a unique device is negotiated.

The Lanes of each unique Link are numbered starting with Lane 0 . If necessary, the Lane Numbers are reversed.

Those Lanes that are not part of a new Link are disabled and enter the Electrical Idle state. Disabled Lanes are re-enabled if the device enters the Detect state again.

Each device advertises its N_FTS value in the TS1/TS2 Ordered-Sets it sends to the remote device.

A receiver uses the COM symbol in the received TS1 and TS2 Ordered-Sets to de-skew the Lanes of the Link (see "Lane-to-Lane De-Skew" on page 444).

Rather than go through a tedious process to explain the Configuration.RcvrCfg function, three examples are presented in "Examples That Demonstrate Configuration.RcvrCfg Function" on page 524. That section describes the Link Numbering and Lane Numbering procedure.

Exit to Configuration.Idle-

When the Link Numbering and Lane Numbering process has completed successfully.

Exit to Detect-

The next state is Detect if, after a 2ms timeout, no Link or Lanes could be configured, or if all Lanes receive two consecutive TS1 Ordered-Sets with the Link and Lane Number fields set to the PAD symbol.

Exit to Disable or Loopback-

If directed to enter the Disable or Loopback state by higher layers:

Software can inform a Loopback Master connected to the Link to enter the Loopback state in an implementation specific manner. The Loopback Master device continuously sends TS1 Ordered-Sets to the Loopback Slave with the Loopback bit set in the TS1 Training Control field until the Loopback slave returns TS1 Ordered-Sets with the Loopback bit set. The Loopback Slave enters Loopback when it receives two consecutive TS1 Ordered-Sets with the Loopback bit set. PCI Express System Architecture

Similarly, software can command a device to enter the Disable state by setting the Disable bit in the Link Control register (see Figure 14-23 on page 553). This device (a downstream port) then transmits 16 TS1 Ordered-Sets with the Disable Link bit set in the TS1 Training Control field. A connected receiver (on an upstream port) is disabled when it receives TS1 Ordered-Sets with the Disable Link bit set.

Configuration.Idle SubState

Entry from Configuration.RcvrCfg-

When the Link Numbering and Lane Numbering process has completed successfully.

During Configuration.Idle-

The Link is fully configured. Bit Lock and Symbol Lock have been achieved.

The Link data rate has been selected. The Link and Lane Numbers have been assigned.

The Transmitter sends Logical Idle sequences (see "Logical Idle Sequence" on page 436) on all configured Lanes. At least 16 Logical Idle sequences are sent.

The Receiver waits for the receipt of the Logical Idle data.

The receivers Data Link Layer LinkUp status bit is set to 1 .

Exit to L0—

Occurs when eight Logical Idle symbols are received on all configured Lanes and 16 Logical Idles are sent after receiving 1 Logical Idle. L0 is the full-on power state during which normal packet transmission and reception can occur. The differential transmitters and receivers are enabled in the low impedance state.

Exit to Detect-

Occurs when, after a 2ms timeout, no Logical Idle symbols have been exchanged.

Designing Devices with Links that can be Merged

General. A designer decides how many Lanes to implement on a given Link based on performance requirements for that Link. The specification requires that a device that implements a multi-Lane Link must be designed to operate as a one-x1 Link also. This allows such a multi-Lane Link device to operate when and if it connects to a

\times 1

Link device (Link performance is lower, however).

An optional feature allows two or more downstream Links (associated with different ports) of a switch to be combined to form a wider Link that is connected to one device. Figure 14-9 on page 524 shows a Switch with one upstream port and four downstream ports. The Switch supports eight

upstream Lanes and eight downstream Lanes. On the downstream side, the Switch supports four ports. It is therefore four-x2 capable. By combining two ports, it is also two-x4 capable. As required by the specification, each port must be

\times 1

capable.

Four-x2 Configuration. The Switch is capable of supporting up to four downstream ports,with each port a

\times 2

port (four-

\times 2

capable on the downstream side) that connects to four devices (left side of Figure 14-9 on page 524). The bridge internally consists of one upstream logical bridge and four downstream logical bridges.

During Link Training, while in the Configuration.RcvrCfg substate, the LTSSM of each switch downstream port establishes that it is connected to four devices with

x 2

Links each. Essentially,the switch consists of four ports, four LTSSMs, four Physical Layers, four Data Link Layers and four Transaction Layers.

Two-x4 Configuration. This switch design also allows its downstream Lanes to be combined into two downstream

\times 4

ports (right side of Figure 14-9). In other words, the eight downstream Lanes may be wired to two, independent

\times 4

devices. In thise case,the switch consists of one upstream logical bridge and two downstream logical bridges.

During Link Training, while in the Configuration.RcvrCfg substate, the LTSSM of each switch downstream port establishes that it is connected to two downstream devices with

\times 4

Links each. Essentially,the switch in this configuration has two downstream ports and the four switch LTSSMs are merged into two LTSSMs. The switch has two Physical Layers, two Data Link Layers and two Transaction Layers on the downstream side.

The switch is capable of four-x2 Links on the downstream side (left) or two-x4 Links (right), depending on how the designer chooses to wire up the downstream switch Lanes.

During the Configuration.RcvrCfg state, the LTSSM discovers how the downstream Lanes are wired. Each Link that connects to a unique device is numbered uniquely and each Lane of a Link is also numbered. Designing a switch with this capability is no trivial task, so the feature that permits the combining or splitting of Links to form a wider or narrower Link is optional.

It is a requirement that each multi-Lane port be able to operate as a

\times 1

port when connected to a

\times 1

device.

PCI Express System Architecture

Examples That Demonstrate Configuration.RcvrCfg Function

The Link numbering and Lane numbering process is initiated by an upstream device during the Configuration.RcvrCfg substate. A Root Complex or a Switch downstream port would initiate the Configuration.RcvrCfg process. Endpoints and upstream ports are downstream devices and do not initiate this process.

TS1 and TS2 Ordered-Sets are transmitted and received during this substate. Upon exit from the Configuration.RcvrCfg substate, each Link has been initialized with a Link number (this indirectly establishes the number of ports a device supports). Each Lane has also been initialized with a Lane number (this indirectly establishes the Link width).

Three examples are covered in the next three sections.

RcvrCfg Example 1

Consider Figure 14-10 on page 527. Device A is one-x4 capable, one-x2 capable and one-x1 capable (One-x1 support is required by the spec). Device B is one-x4 capable, one-x2 capable and one-x1 capable (it is required to support this capability by the spec). The device pins associated with each Lane are physically

numbered 0,1,2 and 3 (shown in the Figure 14-10), though the assigned Logical Lane numbers may have been changed while in Configuration.RcvrCfg substate (in this example, the Logical Lane Number remains the same as the physical Lane Number).

Link Number Negotiation.

Mechanism: Upstream Device A transmits TS1 Ordered-Sets with the Link Number for each group of connected Lanes set to a device-specific initial value. As an example, a switch with four downstream ports may initially set the Link numbers to $0, 1, 2$ ,and 3 . The Lane Number field is initially set to the PAD symbol (K23.7).

Actions Taken: This implies that Device A sends four TS1 Ordered-Sets on the four Lanes. The four TS1 Ordered-Sets each contain a Link Number

n, n, n, n

,and the Lane Number fields are set to the PAD symbol. Even though Device A is also capable of One-x1 and One-x2 operation, Device A starts by assuming the capability that maximizes the use of all connected Lanes.

Mechanism: Downstream Device B returns TS1 Ordered-Sets on all connected Lanes that received TS1 Ordered-Sets with the assigned common Link Number for Lanes it can support as one Link. The Lane Numbers are initailly set to the PAD symbol (K23.7).

Actions Taken: Device B returns a TS1 Ordered-Set on all four Lanes. The TS1 Ordered-Sets on each Lane contains Link Number

n

. The Lane Number field contains the PAD symbol. Device A sees the TS1 Ordered-Set with Link numbers of

n

on each Lane. Device A establishes that its four Lanes are connected to one downstream device and that the Link is numbered as

n

. Device A has received confirmation from Device B that its Link can be numbered

n

,where

n

is a number between 0 and 255. The Link is configured as a One-x4 Link.

The Link Number of

n

is a logical Link Number that is not stored in a defined configuration register. This number is hard-wired by design, and not related to the Port Number field of the Link Capability Register.

Also, the Negotiated Link Width field in the Link Status register of both the upstream and downstream devices are updated with "000100", indicating a x4 Link (see Figure 14-22 on page 552).

Lane Number Negotiation.

Mechanism: Upstream Device A sends TS1 Ordered-Sets on all connected Lanes with the configured Link Number and unique Lane Numbers starting with 0 for each Lane. PAD symbols are no longer sent in the Lane Number field.

Actions Taken: Device A sends four TS1 Ordered-Sets with Link Number of

n

and Lane Numbers of 0,1,2 and 3,respectively,on each connected Lane.

Mechanism: Downstream Device B returns TS1 Ordered-Sets on all connected Lanes with the same Link Number as contained in the received TS1 Ordered-Sets and the same Lane Numbers for each Lane as indicated in the received TS1 Ordered-Sets.

If the downstream device's Lanes are hooked in the reverse manner and it does not support the Lane Reversal feature, it returns the TS1 Ordered-Sets with the Lane Number field indicating the manner in which it wants the Lanes to be numbered. Hopefully, the upstream device supports Lane Reversal and accepts the reverse order in which the downstream device wants the Lanes numbered.

Actions Taken: Device B returns four TS1 Ordered-Sets with a Link Number of

n

and Lane Numbers of 0,1,2 and 3,respectively,on each connected Lane.

Confirmation of Link Number and Lane Number Negotiated.

Step 5 and 6. Mechanism: Device A and B confirm the Link Number and Lane Numbers negotiated by exchanging TS2 Ordered-Sets.

Actions Taken: Device A and B exchange TS2 Ordered-Sets with the Link Number set to

n

and the Lane Numbers set to 0,1,2 and 3,respectively, for each of the four Lanes. In this example, the Logical Lane Numbers of both devices remain the same as the physical Lane Numbers.

Chapter 14: Link Initialization & Training

Figure 14-10: Example 1 Link Numbering and Lane Numbering

RcvrCfg Example 2

Consider Figure 14-11 on page 529. This is an example in which upstream device

A

is capable of one-x4,or two-x2,or two-x1. The narrowest Link capability that uses ALL Lanes is two-x2. The two ports of Device A are each x2 capable and the physical Pin Number (Lane number) of each port is 0 and 1 . Device B and

C

each have one port that is

\times 2

capable and the physical Pin Number (Lane number) of each port is 0 and 1 .

Using a strapping option on Device A (or by default), it starts the Configuration.RcvrCfg substate by reporting its two-x2 capability when transmitting TS1 Ordered-Sets to the downstream devices.

Link Number Negotiation:

Mechanism: Upstream Device A transmits TS1 Ordered-Sets with an assumed Link Number value for each group of Lanes capable of acting as unique Links. For now, the Lane Number is set to the PAD symbol (K23.7)

Actions Taken: Device A sends TS1 Ordered-Sets on all four Lanes. The TS1 Ordered-Set on each Lane contains Link number

n, n, n + 1

,and

n + 1

. The Lane Number field contains the PAD symbol.

Mechanism: Downstream Devices B and C return TS1 Ordered-Sets containing the Link number for Lanes it can support as one Link. The Lane Number is initially set to the PAD symbol (K23.7).

Actions Taken: Device B and C return TS1 Ordered-Sets on each Lane, each containing a Link Number of

n

for Device B and a Link Number of

n + 1

for Device C. The Lane Number field is initially set to the PAD symbol. Device A receives TS1 Ordered-Sets on two of the Lanes with a Link Number of

n

and TS1 Ordered-Sets on the other two Lanes with a Link Number of

n + 1

The Link Number of

n

and

n + 1

is a logical Link Number that is not stored in a defined configuration register. This number is not related to the Port Number field of the Link Capability Register of upstream Device A's port or Device B's or Device C's port.

Also, the Negotiated Link Width field in the Link Status register of both upstream ports and downstream ports are updated with "000010," indicating a x2 Link (see Figure 14-22 on page 552).

Lane Number Negotiation.

Mechanism: Device A realizes that its Lanes are divided into two Links and sends TS1 Ordered-Sets on all connected Lanes with Link Number $n$ on two Lanes and Link Number $n + 1$ on the other two Lanes. PAD symbols are no longer sent in the Lane Number field.

Actions Taken: Device A sends a TS1 Ordered-Set on two Lanes with a Link Number of

n

in both of the TS1s,a Lane Number of 0 in one of the TS1s, and a Lane Number of 1 in the other TS1. Device A also sends a TS1 on each of the other two Lanes with a Link Number of

n + 1

in both of the TS1s, a Lane Number of 0 in one TS1, and a Lane Number of 1 in the other TS1.

Mechanism: Downstream Device B and C returns TS1s on all connected Lanes with the same Link Number as contained in the received TS1 Ordered-Sets and the same Lane Numbers for each Lane as in the received TS1 Ordered-Sets.

If the downstream devices' Lanes are hooked in the reverse manner and they do not support the Lane Reversal feature, they return the TS1 Ordered-Sets with the Lane Number fields reversed. Hopefully, the upstream device supports Lane Reversal and accepts the reverse order that the downstream devices want the Lanes numbered.

Chapter 14: Link Initialization & Training

Actions Taken: Device B returns TS1s on each Lane with Link Number of

n

and a Lane Number of 0 in one TS1 Ordered-Set and a Lane Number of 1 in the other. Device C returns a TS1 on each Lane with a Link Number of

n + 1

and a Lane Number of 0 in one TS1 Ordered-Set and a Lane Number of 1 in the other.

Confirmation of Link Number and Lane Number Negotiated.

Step 5 and 6 Mechanism: Device A and B/C confirm the Link Numbers and Lane Numbers negotiated by exchanging TS2 Ordered-Sets.

Actions Taken: Device A and B exchange a TS2 Ordered-Set on each Lane with the Link Number set to

n

and the Lane Numbers set to 0 and 1, respectively, for each of the two Lanes on the first Link. Device A and C exchange a TS2 Ordered-Set on each Lane with the Link Number set to

n + 1

and the Lane Numbers set to 0 and 1,respectively,for each of the two Lanes of the second Link. Each Lane of the two Links in this example are logically numbered 0 and 1 , matching the physical Pin number (Lane number) of each Lane.

Figure 14-11: Example 2 Link Numbering and Lane Numbering

RcvrCfg Example 3

Consider Figure 14-12 on page 532. This is an example in which upstream device

A

is capable of one-x4,or two-x2,or two-x1 (the same as Device

A

in the previous example). The narrowest Link capability that uses ALL Lanes is two-x2. Each Lane of Device A's two ports are physically numbered 0 and 1.

Via a strapping option on Device A (or by default), it reports its two-x2 capability when transmitting TS1 Ordered-Sets to the downstream device. In this example, Device A initially assumes that it has two-x2 downstream Links. Also, both of Device A's Links are connected to downstream Device B. Device B is

\times 4

capable. Its pins (or Lanes) are physically numbered 3, 2, 1, 0, respectively. In this example, assume that Device B does not support Lane Reversal, but Device A does support Lane Reversal.

Link Number Negotiation.

Mechanism: Upstream Device A transmits TS1 Ordered-Sets with an assumed Link Number field for each group of Lanes capable of being unique Links that use up all the Lanes. For now, the Lane Number is set to the PAD symbol (K23.7).

Actions Taken: Device A sends four TS1 Ordered-Sets on the four Lanes. Each TS1 Ordered-Set contains the respective Link number

(n, n

n + 1, n + 1

). The Lane Number field is initially set to the PAD symbol.

Mechanism: Downstream Device B returns TS1 Ordered-Sets with an assigned common Link number for Lanes it can support as one Link. The Lane Number is initially set to the PAD symbol (K23.7).

Actions Taken: Device B returns TS1 Ordered-Sets on each Lane, each containing a Link Number of

n

. The Lane Number field is initially set to the PAD symbol. Device A sees TS1 Ordered-Sets on four Lanes with a Link Number of

n

,telling Device A that its four Lanes are connected to one downstream device and that the Link should be numbered

n

The Link Number of

n

is a logical Link Number that is not stored in a defined configuration register. This number is not related to the Port Number field of the Link Capability Register.

Also, the Negotiated Link Width field in the Link Status register of both the upstream port and the downstream port are updated with "000100", indicating a x4 Link (see Figure 14-22 on page 552).

Chapter 14: Link Initialization & Training

Lane Number Negotiation.

Mechanism: Device A realizes that its four Lanes are combined into one Link (One-x4) and sends TS1 Ordered-Sets on all connected Lanes with one assumed Link Number, $n$ ,and a unique Lane Number is assigned to each Lane of the Link. PAD symbols are no longer sent in the Lane Number field.

Actions Taken: Device A sends TS1 Ordered-Sets on four Lanes with a Link Number of

n

and Lane Numbers

0, 1, 2

,and 3,respectively,numbered left to right.

Mechanism: Downstream Device B returns a TS1 on all connected Lanes with the same Link Number $n$ as contained in the received TS1 Ordered-Set. Assume that the Lanes are hooked up in reverse manner as shown in Figure 14-12 on page 532 and that Device B does not support the Lane Reversal feature.

If the downstream device's Lanes are reversed and it does not support Lane Reversal, it returns the TS1 Ordered-Sets with the Lane Number fields reversed. Hopefully, the upstream device supports Lane Reversal and accepts the reverse ordering of the Lanes.

Actions Taken: Device B returns a TS1 Ordered-Set on each Lane with a Link Number of

n

and Lane Numbers of 3,2,1 and 0,respectively, numbered from left to right.

Confirmation of Link Number and Lane Number Negotiated.

Step 5 and 6 Mechanism: Device A and B confirm the Link Number and the Lane Numbers negotiated by exchanging of TS2 Ordered-Sets. Actions Taken: Device A and B exchange a TS2 Ordered-Set on each Lane with the Link Number set to $n$ and the Lane Numbers set to 3,2,1 and 0 (Lanes reversed), respectively, for each of the four Lanes.

Device A's physical Pin Numbers (Lane numbers) for the four Lanes from left to right are 0,1 , and 0,1 (the same numbers repeated-because Device A is two-x2 port capable). However, Device A ends up with logical Lane Numbers of 3, 2, 1, 0, from left to right. Device B's physical Pin Numbers are

3, 2, 1

and 0,from left to right. The logical Lane Numbers remain the same as the physical Lane Numbers: 3, 2, 1 and 0 .

Consider what would happen if Device A did not support Lane Reversal (Lane Reversal is an optional feature). In Step 4, Device B returns four TS1 Ordered-Sets with Lane Numbers of 3, 2, 1 and 0 . Device A would not be able to reverse the physical Lane numbers of 0,1,2 and 3 that it proposed in Step 3. The Link training process freezes at this point. This is a Link training

PCI Express System Architecture

error and is reported by the upstream device (Device A) via the Link Training Error bit in the Link Status register (see Figure 14-22 on page 552). A system designer would be wrong to hook the Lanes of two devices that do not support Lane Reversal in a reversed manner.

Figure 14-12: Example 3 Link Numbering and Lane Numbering

Recovery State

The Recovery state is also referred to as the Re-Training state. It is not entered during Link training (which occurs when a device comes out of reset). The Recovery state is entered when a receiver needs to regain Bit and Symbol Lock, or if an error occurs while in L0 that renders the Link inoperable. Rather than going through the Polling and Configuration states (which have longer latencies associated with them), the Recovery state has a much shorter latency (the PLLs are already operational and may only need to be sync'd). The number of

FTS Ordered-Sets (N_FTS) required for L0s exit is re-established in Recovery and the Link is de-skewed. The Link Number, Lane Numbers and bit transfer rate (2.5Gbits/s) remain unchanged. If any of these three variables have changed since the time the link was in the Configuration state, the LTSSM transitions from the Recovery state to the Configuration state.

Reasons that a Device Enters the Recovery State

Exit from L1 (requires that the receiver be re-trained).

Exit from L0s when the receiver is unable to achieve Bit/Symbol Lock due to the reception of an insufficient number of FTS Ordered-Sets.

In case of an error that renders the Link unreliable, software sets the Retrain Link bit in the Link Control Register (see Figure 14-23 on page 553).

An error condition that occurs in the L0 state that renders the Link unreliable may automatically cause the Data Link Layer or Physical Layer logic to initiate a re-train cycle.

Reception of TS1 or TS2 Ordered-Sets on any configured Lane from a remote transmitter signals the receiver to retrain the link.

A receiver detects that the Link has transitioned to the Electrical Idle state on all configured Lanes without first receiving the Electrical Idle Ordered-Sets from the transmitter.

Initiating the Recovery Process

Both devices on a Link go through Recovery together. One of the two devices initiates the Recovery process, transmitting TS1 Ordered-Sets to its neighbor. The neighbor goes through Recovery and returns the favor by returning TS1 Ordered-Sets that the initiator's receiver uses to go through Recovery. In transmitting and receiving TS1 Ordered-Sets, both the receiver and the transmitter of the Ordered-Sets regains Bit/Symbol Lock, and then returns to the L0 state.

Refer to Figure 14-13 on page 537 for the detailed steps involved in completing the Recovery process described below.

Recovery.RcvrLock SubState

Entry from L0-

A device enters Recovery for the reasons sited in "Reasons that a Device

Enters the Recovery State" on page 533.

Entry from L1—

The Receiver detects Electrical Idle exit, or when directed by higher-level software. Electrical Idle exit means that the receiver detected a valid differential voltage and starts seeing TS1 Ordered-Sets.

Entry from L0s-

The Receiver enters into Recovery when it detects an N_FTS timeout (i.e., if the receiver is unable to re-obtain Bit/Symbol Lock after receiving N_FTS

FTS Ordered-Sets, or if it receives an insufficient number of FTS Ordered-Sets, then instead of going to L0, it goes to Recovery).

During Recovery.RcvrLock-

The transmitter sends TS1 Ordered-Sets on all configured Lanes (with the same Link and Lane Numbers as set during the Configuration state). The specification is unclear about how many TS1 Ordered-Sets the transmitter should send, but the author ventures to guess that it should send TS1 Ordered-Sets until this substate is exited.

If the Extended Sync bit is set by software in the Link Control register (see Figure 14-23 on page 553), the transmitter must send a minimum of 1024 TS1 Ordered-Sets to allow an external monitoring device (i.e., a tool), if connected, to sync (obtain Bit/Symbol Lock).

The receiver uses the received TS1 Ordered-Sets to obtain Bit/Symbol Lock.

A device advertises its N_FTS value via the TS1 Ordered-Sets it sends to the remote device. This number can change from what it was during the Configuration state.

A receiver uses the COM symbol in the received TS1 and TS2 Ordered-Sets to de-skew the Lanes of the Link (see "Lane-to-Lane De-Skew" on page 444).

Exit to Recovery.RcvrCfg-

A receiver moves to Recover.RcvrCfg if eight consecutive TS1 or TS2 Ordered-Sets are received without Link and Lane Number changes.

Exit to Configuration —

After

24 ms

,if the receiver detects at least one TS1 (but not eight consecutive TS1 Ordered-Sets) on ANY configured Lanes and the Link Number and Lane Number are the same as the numbers Transmitted in the TS1 Ordered-Sets, then it exits to Configuration.

Exit to Detect—

After a

24 ms

timeout,if the receiver does not detect TS1 or TS2 Ordered-Sets, or it detects TS1 or TS2 Ordered-Sets with the Link or Lane Number different from the numbers in the transmitted TS1 or TS2 Ordered-Sets, then it exits to Detect.

Recovery.RcvrCfg SubState

Entry from Recovery.RcvrLock-

A receiver moves to Recover.RcvrCfg if eight consecutive TS1 or TS2 Ordered-Sets are received without Link and Lane Number changes.

During Recovery.RcvrCfg——

The Transmitter sends TS2s on all configured Lanes (with the same Link and Lane Numbers configured earlier). Again, the specification is unclear about how many TS2 Ordered-Sets the transmitter should send, but the author ventures to guess that it should send TS2 Ordered-Sets until this substate is exited.

If the N_FTS value changes, the device must note the new value.

If the Link was not de-skewed in the Recovery.RcvrLock substate, a receiver uses the COM symbol in the received TS1 and TS2 Ordered-Sets to de-skew the Lanes of the Link (see "Lane-to-Lane De-Skew" on page 444).

Exit to Recovery.Idle-

If eight consecutive TS2 Ordered-Sets are received with no Link/Lane Number changes and 16 TS2 Ordered-Sets are sent after receiving one TS1 or TS2 Ordered-Set, then exit to Recovery.Idle.

Exit to Configuration—

If eight consecutive TS1 Ordered-Sets are received on ANY Lane with Link or Lane Numbers that do not match what is being transmitted, then exit to Configuration state.

Exit to Detect-

Exit to Detect after a

48 ms

timeout and the state machine has not exited to Recovery.Idle or Configuration state.

Recovery.Idle SubState

Entry from Recovery.RcvrCfg-

Enter from Recovery.RcvrCfg if eight consecutive TS2 Ordered-Sets are received with no Link/Lane Number changes and 16 TS2 Ordered-Sets are sent after receiving one TS1 or TS2 Ordered-Set.

During Recovery.Idle-

The Transmitter sends Logical Idle symbols on all configured Lanes unless exiting to Disable, Hot Reset, Configuration, or Loopback.

The Receiver waits for the receipt of Logical Idle symbols on all Lanes.

Exit to Disable, Loopback or Hot Reset-

If directed by higher layers to enter the Disable, Loopback or Hot Reset state. The device transmits TS1 or TS2 (TS2 not valid for Hot Reset case) with the Disable, Loopback or Hot Reset bits set.

If a device receives two consecutive TS1s or TS2s (TS2 not valid for Hot Reset case) with the Disable, Loopback or Hot Reset bit set, it exits to the Disable, Loopback or Hot Reset state respectively.

Software can inform a Loopback Master connected to the Link to enter the Loopback state using an implementation specific mechanism. The Loopback Master device continuously sends TS1 Ordered-Sets to the Loopback PCI Express System Architecture

Slave with the Loopback bit set in the TS1 Training Control field until the Loopback slave returns TS1 Ordered-Sets with the Loopback bit set. The Loopback Slave enters Loopback when it receives two consecutive TS1s with the Loopback bit set.

Similarly, software can command a device to enter the Disable state by setting the Disable bit in the Link Control register (see Figure 14-23 on page 553). This device (a downstream port) then transmits 16 TS1 Ordered-Sets with the Disable Link bit set in the TS1 Training Control field. A connected receiver (on the upstream port) is disabled when it receives a TS1 with the Disable Link bit set.

Similarly, software can command a device to enter the Hot Reset state by setting the Secondary Bus Reset bit in the Bridge Control register (see "In-Band Reset or Hot Reset" on page 491). This device (a downstream port) then transmits TS1 Ordered-Sets continuously for

2 ms

with the Hot Reset bit set in the TS1 Training Control field. A receiver (in the upstream port) detects the Hot Reset when it receives at least two TS1 Ordered-Sets with the Hot Reset bit set.

Exit to Configuration-

Exits to the Configuration state if directed by a higher layer to re-configure the link, or if two consecutive TS1s are received with Lane numbers set to the PAD symbol.

Exit to L0-

If eight Logical Idle symbols are received on all configured Lanes.

Exit to Detect-

Exit to Detect after a

2 ms

timeout if the LTSSM does not exit to any of the other states above.

Chapter 14: Link Initialization & Training

Figure 14-13: Recovery State Machine

L0 State

Enter from Configuration-

This state is entered from Configuration.Idle substate if eight Logical Idle symbols are received on all configured Lanes and 16 Logical Idles are sent after receiving one Logical Idle.

Enter from Recovery-

This state is entered from the Recovery.Idle substate if eight Logical Idle symbols are received on all configured Lanes.

Enter from L0s-

This state is entered from L0s if a device receives the appropriate number of FTS Ordered-Sets and re-obtains Bit and Symbol Lock.

During L0-

This is the fully-operational Link state during which TLP, DLLP and

PLP transmission and reception can occur.

The differential transmitters and receivers are enabled in the low impedance state.

LinkUp $= 1$

Exit to Recovery-

A device enters Recovery for any of the reasons sited in "Reasons that a Device Enters the Recovery State" on page 533.

PCI Express System Architecture

Exit to L0s—

The Transmitter enters L0s when directed to do so by its higher layers. A

Receiver enters L0s when it receives an Electrical Idle Ordered-Set and the

Link transitions to the Electrical Idle state.

Exit to L1-

See "L1 State" on page 541 for a detailed description.

Exit to L2-

See "L2 State" on page 543 for a detailed description.

L0s State

This is a lower power state that has the shortest exit latency to L0. Devices manage entry and exit from this state automatically without any higher level software involvement.

L0s Transmitter State Machine

Figure 14-14 on page 539 shows the transmitter state machine associated with L0s state entry and exit.

Tx_L0s.Entry SubState.

Entry from L0-

The L0s state machine is entered when the device is directed to do so by an upper layer. This may occur via a timeout mechanism triggered due to periods of inactivity (no TLP, DLLP or PLP transmission activity) on the Link.

During Tx_L0s.Entry-

The Transmitter sends an Electrical Idle Ordered-Set and the Link enters the Electrical Idle state.

The Transmitter drives a DC common mode voltage between $0 - 3.6 V$ .

Exit to Tx_L0s.Idle-

Exit to Tx_L0s.Idle after

50 UI

(20ns) while the transmitter drives a stable DC common mode voltage.

Tx_L0s.Idle SubState.

Entry from Tx_L0s.Entry-Enter Tx_L0s.Idle after 50 UI (20 ns) while the transmitter drives a stable DC common mode voltage. During Tx_L0s.Idle—

The Link is in the Electrical Idle state.

The transmitter's output impedance could be low or high.

Exit to Tx_L0s.FTS—

Exit to Tx_L0s.FTS if directed to do so by a higher layer. For example, when it is time for a device to resume packet transmission, it will exit this state.

Chapter 14: Link Initialization & Training

Tx_L0s.FTS SubState.

Entry from Tx_L0s.Idle-

Enter Tx_L0s.FTS if directed to do so by a higher layer.

During Tx_L0s.FTS—

To exit the Electrical Idle substate, the transmitter sends the number of FTS Ordered-Sets specified by N_FTS. The N_FTS number is defined during Link Training (Configuration and Recovery states) during which each device advertises the number of FTS sets it requires to achieve lock.

If the Extended Synch bit is set (see Figure 14-23 on page 553), the transmitter sends 4096 FTS Ordered-Sets instead of N_FTS number of FTS Ordered-Sets.

Follow this by one Skip Ordered-Set. No SKIP Ordered-Sets are transmitted during the transmission of FTS Ordered-Sets.

Exit to L0—

Exit to L0 state after the Skip Ordered-Set transmission.

Figure 14-14: LOs Transmitter State Machine

PCI Express System Architecture

L0s Receiver State Machine

Figure 14-15 on page 541 shows the receiver state machine associated with L0s state entry and exit.

Rx_L0s.Entry SubState.

Entry from L0-

This lower power state is entered if a receiver receives an Electrical Idle Ordered-Set.

During Rx_L0s.Entry—

Wait in the state for minimum of $50 UI$ (20ns).

The receiver's input impedance remains low.

Exit to Rx_L0s.Idle—

Exit to Rx_L0s.Idle after 50 UI (20ns).

Rx_L0s.Idle SubState.

Entry from Rx_L0s.Entry—

Enter Rx_L0s.Idle after 50 UI (20ns).

During Rx_L0s.Idle-

Wait until the receiver detects an Electrical Idle exit (i.e., a valid differential voltage is seen on the receivers).

Exit to Rx_L0s.FTS—

The next state is Rx_L0s.FTS if the receiver detects Electrical Idle exit on any configured Lane.

Rx_L0s.FTS SubState.

Entry from Rx_L0s.Idle-

Enter this state from Rx_L0s.Idle if the receiver detects Electrical Idle exit on any configured Lane.

During Rx_L0s.FTS—

Receiver obtains Bit/Symbol Lock if a sufficient number of FTS Ordered-Sets are received.

The receiver must be able to receive packets after this state.

Exit to L0--

Exit to L0 state after Skip Ordered-Set reception and a sufficient number of FTS Ordered-Sets are received (as advertised during the Configuration or

Recovery states via the N_FTS field of the TS1/TS2 Ordered-Set).

Exit to Recovery-

Recovery state is entered if an N_FTS timeout occurs (i.e., if the receiver receives an insufficient FTS Ordered-Sets to re-obtain Bit/Symbol Lock).

Chapter 14: Link Initialization & Training

Figure 14-15: L0s Receiver State Machine

L1 State

This is a lower power state than L0s and has a longer exit latency than the L0s exit latency. Devices can manage entry and exit from this state automatically without any higher level software involvement. In addition, Power management software may direct a device to place its upstream Link into L1 (both directions of the Link go to L1) when the device is placed in a lower power device state such as D1, D2, or D3.

Figure 14-16 on page 542 shows the L1 entry and Exit state machine. This state machine is described in the subsections that follow.

L1.Entry SubState

Entry from L0—

The L0s state machine is entered when a device's higher layer directs the device to do so.

During L1.Entry—

The Transmitter sends an Electrical Idle Ordered-Set and the Link enters the Electrical Idle state.

PCI Express System Architecture

The Transmitter drives a DC common mode voltage between $0 - 3.6 V$ .

Exit to L1.Idle-

Exit to L1.Idle after

50 UI

(20ns),while the transmitter drives a stable DC common mode voltage.

L1.Idle SubState

Entry from L1.Entry—

Enter L1.Idle after

50 UI

(20ns),while the transmitter drives a stable DC common mode voltage.

During L1.Idle-

The Link is in the Electrical Idle state.

The transmitter's output impedance could be low or high, while the receiver's remains in the low impedance state.

Remain in this state until the receiver detects Electrical Idle exit (a valid differential voltage associated with the reception of a TS1 Ordered-Set used to signal L1 exit).

Exit to Recovery-

Exit to Recovery after the receiver detects the Electrical Idle exit condition, or if the device is directed to do so.

Figure 14-16: L1 State Machine

L2 State

This is even lower power state than L1 and has a longer exit latency than L1 exit latency. Power Management software directs a device to place its upstream Link into L2 (both directions of the Link go to L2) when the device is placed in a lower power device state such as

{D 3}_{Cold}

Figure 14-17 on page 544 shows the L2 entry and Exit state machine. This state machine is described next.

L2.Idle SubState

Entry from L0—

This state is entered when directed to do so by higher layers and an Electrical

Idle Ordered-Set is exchanged between neighbors across a Link.

During L2.Idle-

The Receiver remains in the low impedance state.

The Transmitter must remain in the Electrical Idle state for a minimum of 50 UI (20ns).

The Receiver starts looking for the Electrical exit condition.

DC common mode voltage doesn't have to be in spec. May be turned off. Exit to L2.TransmitWake-

When an upstream port is directed to send the Beacon signal due to a wakeup event. Also, when a Beacon is received on at least Lane 0 of a switch downstream port.

Exit to Detect-

When a Beacon is received on at least Lane 0 of a Root Complex downstream port or if a Root Port is directed by a higher layer to go to the Detect state. Also, exit to Detect if an upstream Lane detects the Electrical Idle exit condition.

L1.TransmitWake SubState

Entry from L1.Idle-

Enter into L2.TransmitWake when an upstream port is directed to send the Beacon signal due to a wakeup event. Also, exit to L2.TransmitWake when a Beacon signal is received on at least Lane 0 of a switch downstream port.

During L1.TransmitWake-

Transmit the Beacon signal on at least Lane 0 of the upstream port in the direction of the Root Complex.

Exit to Detect-Go to Detect if an upstream port detects Electrical Idle exit condition.

PCI Express System Architecture

Figure 14-17: L2 State Machine

Hot Reset State

Hot Reset is an in-band signaled reset triggered by software as explained in "In-Band Reset or Hot Reset" on page 491. The state machine in Figure 14-18 on page 545 describes entry to and exit from the Hot Reset state.

Entry from Recovery-

Links that are directed to do so by higher layers enter Hot Reset through the Recovery state.

During Hot Reset-

On all Lanes, the transmitter (on a downstream port) continuously transmits TS1s with the Hot Reset bit set and containing the configured Link and Lane Numbers. The Hot Reset initiator also resets itself.

A receiver detects Hot Reset when it detects at least two TS1s with the Hot Reset bit set. It enters the Hot Reset state through recovery.

$LinkUp = 0$ .

Exit to Detect-

Exit to detect after a

2 ms

timeout.

Chapter 14: Link Initialization & Training

Disable State

A Disabled Link is a Link that is off and does not have to have the DC common mode voltage driven. If, for example, software wishes to turn off a faulty Link, it can do so by setting the Link Disable bit (see Figure 14-23 on page 553) in the Link Control register of a device. That device transmits TS1s with the Link Disable bit asserted. The state machine in Figure 14-19 on page 546 describes entry to and exit from the Disable state.

Entry from Configuration or Recovery-

All Lanes transmit 16 TS1 Ordered-Sets with the Link Disable bit asserted and then transition to Electrical Idle after transmitting the Electrical Idle Ordered-Set. If no Electrical Idle Ordered-Set is transmitted, then the receiver transitions to the Detect state after

2 ms

. The DC common mode voltage does not have to be within spec while in Detect.

During Disable-

Remain in Disable state until the Disable exit condition is detected.

The DC common mode voltage does not have to be within spec.

$LinkUp = 0$ .

Loopback State

The Loopback feature is a test and debug feature and is not used in normal operation. A Loopback master device (such as a tester) when connected to a device's Link (the device under test is the Loopback slave when in the Loopback state) can place the Link and Loopback slave into the Loopback state by transmitting TS1 Ordered-Sets with the Loopback bit asserted. The Loopback master can serve as the BIST (Built In Self Test) engine.

Once in this state, the Loopback master sends valid 8b/10b encoded symbols to the Loopback slave. The Loopback slave turns around and feeds back the symbol stream. The Loopback slave continues to perform clock tolerance compensation, so the master must ensure that it inserts Skip Ordered-Sets at the correct intervals. To perform clock tolerance compensation, the Loopback slave may have to add or delete SKP symbols to the Skip Ordered-Set that it feeds back with the symbol stream to the Loopback master. If SKP symbols are added by the Loopback slave, they have to be of the same disparity as the received SKP symbols.

The Loopback state is exited when the Loopback master transmits the Electrical Idle Ordered-Set and the receiver detects that the Link has transitioned to the Electrical Idle state.

See Figure 14-20 on page 549 for a description of Loopback entry and exit procedure.

Loopback.Entry SubState

Entry—

As directed by higher layers, a Loopback master can transmit TS1 Ordered-Sets with the Loopback bit set.

During Loopback.Entry-

The Loopback Master continuously transmits TS1 Ordered-Sets with the Loopback bit set.

The Loopback Slave returns the identical TS1 Ordered-Sets.

$LinkUp = 0$ .

Exit to Loopback.Active-

When the master receives TS1 Ordered-Sets, the slave has entered the Loopback.Active substate.

Exit to Loopback.Exit—

If the master does not receive identical TS1 Ordered-Sets, or does not receive TS1 Ordered-Sets for

100 ms

,it transitions to the Loopback.Exit state.

PCI Express System Architecture

Loopback.Active SubState

Entry from Loopback.Entry-

If the master receives TS1s identical to those it transmitted, the slave has entered the Loopback.Active substate.

During Loopback.Active-

The Loopback master transmits valid $8 b / 10 b$ symbols with valid disparity.

The Loopback Slave returns the identical $8 b / 10 b$ symbos with valid disparity while performing periodically performing clock tolerance compensation.

Exit to Loopback.Exit-

The Loopback master transmits at least

1 ms

of Electrical Idle Ordered-Sets. A receiver detects Loopback exit when it receives the Electrical Idle Ordered-Set, or senses the Electrical Idle state of the Link.

Loopback.Exit SubState

Entry from Loopback.Active-

The Loopback master transmits at least

1 ms

of Electrical Idle Ordered-Sets and then enters the Electrical Idle state. A receiver detects Loopback exit when it receives the Electrical Idle Ordered-Set, or senses the Electrical Idle state of the Link.

During Loopback.Exit-

The Loopback master transmits Electrical Idle Ordered-Sets for at least $2 ms$ .

The Loopback Slave must enter Electrical Idle on all Lanes for 2ms. Before entering Loopback.Exit, the slave must echo back all symbols it received from the master.

The device then exits to the Detect state.

Chapter 14: Link Initialization & Training

Link Status Register

The Link Status Register is pictured in Figure 14-22 on page 552 and each bit field is described in the subsections that follow.

Link Speed[3:0]:

This field is read-only and indicates the negotiated Link speed of the PCI Express Link. It is updated during the Polling state of the LTSSM. Currently, the only defined encoding is

0001 b

,indicating a Link speed of

2.5 Gbits / s

Negotiate Link Width[9:4]

This field indicates the result of link width negotiation. There are seven possible widths, all other encodings are reserved. The defined encodings are:

000001b: for $\times 1$ .

000010b for $\times 2$ .

000100b for $\times 4$ .

001000b for $\times 8$ .

001100b for $\times 12$ .

010000b for $\times 16$ .

100000b for $\times 32$ .

Training Error[10]

This bit is set by hardware when a Link Training error has occurred. It is cleared by the hardware upon successful training of the Link when the Link has entered the L0 (active) state. This bit is only supported in upstream devices such as a Root Complex or Switch down stream ports.

Link Training[11]

This bit is set by the hardware while Link Training is in progress and is cleared when Link Training completes. The LTSSM is either in the Configuration or Recovery state when this bit is set.

Link Control Register

The Link Control Register is pictured in Figure 14-23 on page 553 and each bit field is described in the subsections that follow.

Link Disable

When set to one, the link is disabled. It is not applicable to and is reserved for Endpoint devices and for an upstream port on a Switch. When this bit is written, any read immediately reflects the value written, regardless of the state of the Link. Writing this bit causes the device to transmit 16 TS Ordered-Sets with the Link Disable bit asserted.

Retrain Link

This bit allows software to initiate Link re-training. This could be used in error recovery. The bit is not applicable to and is reserved for Endpoint devices and the upstream ports of a Switch. When set to one, this directs the LTSSM to the Recovery state before the completion of the Configuration write request is returned.

Extended Synch

This bit is used to force the transmission of 4096 FTS (Fast Training Sequence) Ordered-Sets in L0s followed by a single Skip Ordered-Set prior to entering L0. It also forces the transmission of 1024 TS1 Ordered-Sets in L1 prior to entering

15 Power Budgeting

The Previous Chapter

The previous chapter described the function of the Link Training and Status State Machine (LTSSM) of the Physical Layer. It also described the initialization process of the Link from Power-On or Reset, until the full-on L0 state, where traffic on the Link can begin. In addition, the chapter described the lower power management states L0s, L1, L2, L3 and briefly discusses entry and exit procedure to/from these states.

This Chapter

This chapter describes the mechanisms that software can use to determine whether the system can support an add-in card based on the amount of power and cooling capacity it requires.

The Next Chapter

The next chapter provides a detailed description of PCI Express power management, which is compatible with revision 1.1 of the PCI Bus PM Interface Specification and the Advanced Configuration and Power Interface, revision 2.0 (ACPI). In addition PCI Express defines extensions that are orthogonal to the PCI-PM specification. These extensions focus primarily on Link Power and PM event management. This chapter also provides an overall context for the discussion of power management, by including a description of the OnNow Initiative, ACPI, and the involvement of the Windows OS is also provided.

Introduction to Power Budgeting

The primary goal of the PCI Express power budgeting capability is to allocate power for PCI Express hot plug devices, which can be added to the system during runtime. This capability ensures that the system can allocate the proper amount of power and cooling for these devices. PCI Express System Architecture

The specification states that "power budgeting capability is optional for PCI Express devices implemented in a form factor which does not require hot plug, or that are integrated on the system board." None of the form factor specifications released at the time of this writing required support for hot plug and did not require the power budgeting capability. However, form factor specifications under development will require hot plug support and may also require power budgeting capability.

System power budgeting is always required to support all system board devices and add-in cards. The new power budgeting capability provides mechanisms for managing the budgeting process. Each form factor specification defines the minimum and maximum power for a given expansion slot. For example, the Electromechanical specification limits the amount of power an expansion card can consume prior to and during configuration, but after a card is configured and enabled, it can consume the maximum amount of power specified for the slot. Chapter 18, entitled "Add-in Cards and Connectors," on page 685. Consequently, in the absence of the power budgeting capability registers, the system designer is responsible for guaranteeing that power has been budgeted correctly and that sufficient cooling is available to support any compliant card installed into the connector.

The specification defines the configuration registers that are designed to support the power budgeting process, but does not define the power budgeting methods and processes. The next section describes the hardware and software elements that would be involved in power budgeting, including the specified configuration registers.

The Power Budgeting Elements

Figure 15-2 illustrates the concept of Power Budgeting for hot plug cards. The role of each element involved in the power budgeting, allocation, and reporting process is listed and described below:

System Firmware Power Management (used during boot time)

Power Budget Manager (used during run time)

Expansion Ports (ports to which card slots are attached)

Add-in Devices (Power Budget Capable)

System Firmware — System firmware, having knowledge of the system design, is responsible for reporting system power information. The specification recommends the following power information be reported to the PCI Express power budget manager, which allocates and verifies power consumption and dissipa-Chapter 15: Power Budgeting

tion during runtime:

Total system power available.

Power allocated to system devices by firmware

Number and type of slots in the system.

Firmware may also allocate power to PCI Express devices that support the power budgeting capability configuration register set (e.g., a hot-plug device used during boot time). The Power Budgeting Capability register (see Figure 15-1) contains a System Allocated bit that is intended to be set by firmware to notify the power budget manager that power for this device has been included in the system power allocation. Note that the power manager must read and save power information for hot-plug devices that are allocated by the system, in case they are removed during runtime.

Figure 15-1: System Allocated Bit

The Power Manager - The power manager initializes when the OS installs, at which time it receives power-budget information from system firmware. The specification does not define the method for communicating this information.

The power budget manager is responsible for allocating power for all PCI Express devices. This allocation includes:

PCI Express devices that have not already been allocated by the system (includes embedded devices that support power budgeting).

Hot-plugged devices installed at boot time.

New devices added during runtime. PCI Express System Architecture

Expansion Ports — Figure 15-2 on page 561 illustrates a hot plug port that must have the Slot Power Limit and Slot Power Scale fields within the Slot Capabilities register implemented. The firmware or power budget manager must load these fields with a value that represents the maximum amount of power supported by this port. When software writes to these fields the port delivers the Set_Slot_Power_Limit message to the device. These fields are also written when software configures a card that has been added during a hot plug installation.

The PCI Express specification requires that:

Any downstream port of a Switch or a Root Complex that has a slot attached (i.e., the Slot Implemented bit within its PCI Express Capabilities register is set) must implement the Slot Capabilities register.

Software must initialize the Slot Power Limit Value and Scale fields of the Slot Capabilities register of the Switch or Root Complex Downstream Port that is connected to an add-in slot.

The Upstream Port of an Endpoint, Switch, or a PCI Express-PCI Bridge must implement the Device Capabilities register.

When a card is installed in a slot, and software updates the power limit and scale values in the Downstream port of the Switch or Root Complex, that port will automatically transmit the Set_Slot_Power_Limit message to the Upstream Port of the Endpoint, Switch, or a PCI Express-PCI Bridge on the installed card.

The recipient of the Message must use the value in the Message data payload to limit usage of the power for the entire card/module, unless the card/module will never exceed the lowest value specified in the corresponding electromechanical specification.

Add-in Devices-Expansion cards that support the power budgeting capability must include the:

Slot Power Limit Value and Slot Limit Scale fields within the Device Capabilities register.

Power Budgeting Capability register set for reporting power-related information.

These devices must not consume more power than the lowest power specified by the form factor specification. Once power budgeting software allocates additional power via the Set_Slot_Power_Limit message, the device can consume the power specified, but not until it has been configured and enabled.

PCI Express System Architecture

Slot Power Limit Control

Software is responsible for determining the maximum amount of power that an expansion device is allowed to consume. This power allocation is based on the power partitioning within the system, thermal capabilities, etc. Knowledge of the system's power and thermal limits comes from system firmware. The firmware or power manager (which receives power information from firmware) is responsible for reporting the power limits to each expansion port.

Expansion Port Delivers Slot Power Limit

Software writes to the Slot Power Limit Value and Slot Power Limit Scale fields of the Slot Capability register to specify the maximum power that can be consumed by the device. Software is required to specify a power value that reflects one of the maximum values defined by the specification. For example, the electromechanical specification defines maximum power listed in Table 15-1.

Table 15-1: Maximum Power Consumption for System Board Expansion Slots

	X1 Link		X4/X8 Link	X16 Link
Standard Height	10W (max)	25W (max)	25W (max)	25W (max)	40W (max)
Low Profile Card	10W (max)		10W (max)	25W (max)

When these registers are written by power budget software, the expansion port sends a Set_Slot_Power_Limit message to the expansion device. This procedure is illustrated in Figure 15-3 on page 563.

Figure 15-3: Slot Power Limit Sequence

When Hot Plug software is notified of a card insertion request, Power and Clock are restored to the slot.

Hot Plug software calls configuration and power budgeting software to configure and allocate power to the device.

Power budget software may interrogate the card to determine its power requirements and characteristics.

Power is then allocated based on the device's requirements and the system's capabilities

Power software writes to the Slot Power Scale and Slot Power Value fields within the expansion port.

Writes to these fields command the port to send the Set_Slot_Power_Limit message to convey the contents of the Slot Power fields.

The slot receives the message and updates its Captured Slot Power Limit Value and Scale fields.

These values limit the power that the expansion device can consume once it is enabled by its device driver.

Expansion Device Limits Power Consumption

The device driver reads the values from the Slot Power Limit and Scale fields to verify that the power available is sufficient to operate the device. Several conditions may exist:

The power available is $\geq$ the power required to operate the device at full capability. In this case, the driver enables the device by writing to the configuration Command register, permitting the device to consume up to the amount of power specified in the Power Limit fields.

The power available is sufficient to operate the device but not at full capability. In this case, the driver is required to configure the device such that it consumes no more power than specified in the Power Limit fields.

The power available is $<$ the power required to operate the device. In this case, the driver must not enable the card and must report the inadequate power condition to the upper software layers, which ideally would inform the end user of the power-related problem.

The power available exceeds the maximum power specified by the form factor specification. This condition should not occur. However, if it does, the device is not permitted to consume power beyond the maximum permitted by the form factor.

The power available is $<$ the lowest value specified by the form factor specification. This is a violation of the specification, which states that the expansion port "must not transmit a Set_Slot_Power_Limit Message which indicates a limit that is lower than the lowest value specified in the electromechanical specification for the slot's form factor." See Table 15-1 on page 562.

Some devices implemented on expansion devices may consume less power than the lowest limit specified for the form factor. Such devices are permitted to discard the information delivered in the Set_Slot_Power_Limit Messages. When the Slot Power Limit Value and Scale fields are read, these devices return zeros.

The Power Budget Capabilities Register Set

These registers permit power budgeting software to allocate power more effectively based on information provided by the device through its power budget data select and data register. This feature is similar to the data select and data fields within the power management capability registers. However, the power budget registers provide more detailed information that is useful to software

when determining the effects of expansion cards that are added during runtime on the system power budget and cooling requirements. Through this capability, a device can report the power it consumes:

from each power rail

in various power management states

in different operating conditions

These registers are not required for devices implemented on the system board or on expansion devices that do not support hot plug. Figure 15-4 on page 565 illustrates the power budget capabilities register set and shows the data select and data field that provide the method for accessing the power budget information.

The power budget information is maintained within a table that consists of one or more 32-bit entries. Each table entry contains power budget information for the different operating modes supported by the device. Each table entry is selected via the data select field, and the selected entry is then read from the data field. The index values start at zero and are implemented in sequential order. When a selected index returns all zeros in the data field, the end of the power budget table has been located. Figure 15-5 on page 566 illustrates the format and types of information available from the data field.

Figure 15-4: Power Budget Capability Registers

16 Power Management

The Previous Chapter

The previous chapter described the mechanisms that software can use to determine whether the system can support an add-in card based on the amount of power and cooling capacity it requires.

This Chapter

This chapter provides a detailed description of PCI Express power management, which is compatible with revision 1.1 of the PCI Bus PM Interface Specification and the Advanced Configuration and Power Interface, revision 2.0 (ACPI). In addition PCI Express defines extensions that are orthogonal to the PCI-PM specification. These extensions focus primarily on Link Power and PM event management. This chapter also provides an overall context for the discussion of power management, by including a description of the OnNow Initiative, ACPI, and the involvement of the Windows OS is also provided.

The Next Chapter

PCI Express includes native support for hot plug implementations. The next chapter discusses hot plug and hot removal of PCI Express devices. The specification defines a standard usage model for all device and platform form factors that support hot plug capability. The usage model defines, as an example, how push buttons and indicators (LED's) behave, if implemented on the chassis, add-in card or module. The definitions assigned to the indicators and push buttons, described in this chapter, apply to all models of hot plug implementations.

Introduction

PCI Express power management (PM) defines two major areas of support:

PCI-Compatible Power Management. PCI Express power management is based upon hardware and software compatible with the PCI Bus Power Management Interface Specification, Revision 1.1 (also referred to as PCI-PM) and the Advanced Configuration and Power Interface Specification, Revision 2.0 (commonly known as ACPI). This support requires that all PCI Express functions include the PCI Power Management Capability registers, which permits transitions between function PM states.

Native PCI Express Extensions. These extensions define autonomous hardware-based Link Power Management, mechanisms for waking the system, a Message transaction to report Power Management Events (PME), and low power to active state latency reporting and calculation.

This chapter is segmented into five major sections:

The first section is intended as a primer for the discussion of power management, by reviewing the role of system software in controlling power management features. This section restricts the discussion to the power-management software from the Windows Operating System perspective.

The second section "Function Power Management" on page 585 discusses PCI-PM required by PCI Express for placing functions into their low power states. This section also documents the PCI-PM capability registers used in PCI Express. Note that some of the register definitions are modified or not used by PCI Express functions.

Next, "Link Active State Power Management" on page 608 describes the autonomous Link power management that occurs when a device is in its active state (D0). Active State Power Management (ASPM) is a hardware-based link power conservation mechanism. Software enables ASPM and reads latency values to determine the level of ASPM appropriate, but does not initiate transitions into ASPM.

The third section "Software Initiated Link Power Management" on page 629 discusses the link power management, which is triggered by PCI-PM software when it changes the power state of a device. PCI Express devices are required to automatically conserve link power when software places a device into a low power state, including D3cold, (caused by the reference clock and main power being completely removed from a device).

Finally, "Link Wake Protocol and PME Generation" on page 638 covers Power Management Events (PME) and wakeup signaling. Devices may

request that software return them to the active state so they can handle an event that has occurred. This is done by sending PME messages. When power has been removed from a device, auxiliary power is required to monitor events and to signal Wakeup for reactivating the link. Once a device has been re-powered and the link has been re-trained the PME message can be sent.

Primer on Configuration Software

The PCI Bus PM Interface Specification describes how to implement the PCI PM registers that are required in PCI Express. These registers permit the OS to manage the power environment of both PCI and PCI Express functions.

Rather than immediately diving into a detailed nuts-and-bolts description of the PCI Bus PM Interface Specification, it's a good idea to begin by describing where it fits within the overall context of the OS and the system. Otherwise, this would just be a disconnected discussion of registers, bits, signals, etc. with no frame of reference.

Basics of PCI PM

The most popular OSs currently in use on PC-compatible machines are Windows 98/NT/2000/XP. This section provides an overview of how the OS interacts with other major software and hardware elements to manage the power usage of individual devices and the system as a whole. Table 16-1 on page 569 introduces the major elements involved in this process and provides a very basic description of how they relate to each other. It should be noted that neither the PCI Power Management spec nor the ACPI spec (Advanced Configuration and Power Interface) dictate the policies that the OS uses to manage power. It does, however, define the registers (and some data structures) that are used to control the power usage of PCI and PCI Express functions.

Table 16-1: Major Software/Hardware Elements Involved In PC PM

Element	Responsibility
OS	Directs the $o$ verall system power management. To accomplish this goal, the OS issues requests to the ACPI Driver, WDM (Windows Driver Model) device drivers, and to the PCI Express Bus Driver. Applicatio programs that are power conservation-aware interact with the OS to accomplish device power management.

PCI Express System Architecture

Table 16-1: Major Software/Hardware Elements Involved In PC PM (Continued)

Element	Responsibility
ACPI Driver	Manages configuration, power management, and thermal control of devices embedded on the system board that do not adhere to any industry standard interface specification. Examples could be chipset-specific registers, system board-specific registers that control power planes, etc. The PM registers within PCI Express function (embedded or otherwise) are defined by the PCI PM spec and are there fore not managed by the ACPI driver, but rather by the PCI Express Bus Driver (see entry in this table).
WDM Device Driver	The WDM driver is a Class driver that can work with any device that falls within the Class of devices that it was written to control. The fact that it's not written for a specific device from a specific vendor means that it doesn't have register and bit-level knowledge of the device's interface. When it needs to issue a command to or check the status of the device, it issues a request to the Miniport driver supplied by the vendor of the specific device The WDM also doesn’t understand device characteristics that are pecu- liar to a specific bus implementation of that device type. As an example, the WDM doesn't understand a PCI Express device's configuration reg ister set. It depends on the PCI Express Bus Driver to communicat with PCI Express configuration registers. When it receives requests from the OS to control the power state of its PCI Express device, it passes the request to the PCI Express Bus Driver When a request to power down its device is received from the OS, the WDM saves the contents of its associated PCI Express function’s device-specific registers (in other words, it performs a context save) and then passes the request to the PCI Express Bus Driver to change the power state of the device. Conversely, when a request to re-power the device is received from th OS, the WDM passes the request to the PCI Express Bus Driver t change the power state of the device. After the PCI Express Bus Driver has re-powered the device, the WDM then restores the context to the PCI Express function’s device-specific registers.
Miniport Driver	Supplied by the vendor of a device, it receives requests from the WDM Class driver and converts them into the proper series of accesses to the device's register set.

Table 16-1: Major Software/Hardware Elements Involved In PC PM (Continued)

Element	Responsibility
PCI Express Bus Driver	This driver is generic to all PCI Express-compliant devices. It manages their power states and configuration registers, but does not have knowledge of a PCI Express function's device-specific register set (that knowledge is possessed by the Miniport Driver that the WDM driver uses to communicate with the device's register set). It receives requests from the device's WDM to change the state of the device's power man- agement logic: When a request is received to power down the device, the PCI Express Bus Driver is responsible for saving the context of the function’s PCI Express configuration Header registers and any New Capability regis- ters that the device implements. Using the device's PCI Express config- uration Command register, it then disables the ability of the device to act as a Requester or to respond as the target of transactions. Finally, i writes to the PCI Express function’s PM registers to change its state. Conversely, when the device must be re-powered, the PCI Express Bu Driver writes to the PCI Express function's PM registers to change its state. It then restores the function's PCI Express configuration Header registers to their original state.
PCI Express PM registers within each PCI Express function's PCI Express configura- tion space.	The location, format and usage of these registers is defined by the PC Express PM spec. The PCI Express Bus Driver understands this spec and therefore is the entity responsible for accessing a function's PM reg- isters when requested to do so by the function's device driver (i.e., its WDM).
System Board power plane and bus clock control logic	The implementation and control of this logic is typically system board design-specific and is therefore controlled by the ACPI Driver (under the OS’s direction).

OnNow Design Initiative Scheme Defines Overall PM

A whitepaper on Microsoft's website clearly defines the goals of the OnNow Design Initiative and the problems it addresses. The author has taken the liberty of reproducing the text verbatim from the Goals section of that paper.

PCI Express System Architecture

Goals

The OnNow Design Initiative represents the overall guiding spirit behind the sought-after PC design. The following are the major goals as stated in an OnNow document:

The PC is ready for use immediately when the user presses the On button.

The PC is perceived to be off when not in use but is still capable of responding to wake-up events. Wake-up events might be triggered by a device receiving input such as a phone ringing, or by software that has requested the PC to wake up at some predetermined time.

Software adjusts its behavior when the PC's power state changes. The operating system and applications work together intelligently to operate the PC to deliver effective power management in accordance with the user's current needs and expectations. For example, applications will not inadvertently keep the PC busy when it is not necessary, and instead will proactively participate in shutting down the PC to conserve energy and reduce noise.

All devices participate in the device power management scheme, whether originally installed in the PC or added later by the user. Any new device can have its power state changed as system use dictates.

System PM States

Table 16-2 on page 572 defines the possible states of the overall system with reference to power consumption. The "Working", "Sleep", and "Soft Off" states are defined in the OnNow Design Initiative documents.

Table 16-2: System PM States as Defined by the OnNow Design Initiative

Power State	Description
Working	The system is completely usable and the OS is performing power management on a device-by-device basis. As an example, the mod may be powered down during periods when it isn't being used.

Table 16-2: System PM States as Defined by the OnNow Design Initiative (Continued)

Power State	Description
Sleeping	The system appears to be off and power consumption has been reduced. The sleep levels a system may implement is system design-specific. The amount of time it takes to return to the “Worl ing" state is inversely proportional to the selected level of power con- servation. Here are some examples: - The system may keep power applied to main memory, thereby pre- serving the OS and application programs in memory. The proces- sor's register set contents may also be preserved. In this case, program execution can be resumed very quickly. The system may copy the complete contents of main memory and the processor's register set contents to disk, and then remove power from the processor and main memory. In this case, the restart time will be longer because memory must restore both before resuming program execution.
Soft Off	The system appears to be off and power consumption has been greatly reduced. It requires a full reboot to return to the "Working state (because the contents of memory have been lost).
No Power	This state isn't listed in the OnNow Design Initiative documents. The system has been disconnected from its power source.

Device PM States

The OnNow Design Initiative also defines the PM states at the device level. They are listed and defined in Table 16-4 on page 574. Table 16-3 on page 573 presents the same information in a slightly different form.

Table 16-3: OnNow Definition of Device-Level PM States

State	Description
D0	Device support: Mandatory. State in which device is on and running. It i receiving full power from the system and is delivering full functionality to the user. This is the initial state entered after a device completes reset.

PCI Express System Architecture

Table 16-3: OnNow Definition of Device-Level PM States (Continued)

State	Description
D1	Device support: Optional. Class-specific low-power state (refer to "Device Class-Specific PM Specifications" on page 576) in which device context (see "Definition of Device Context" on page 574) may or may not be lost.
D2	Device support: Optional. Class-specific low-power state (“Device Class-Specific PM Specifications” on page 576) in which device contex (see “Definition of Device Context” on page 574) may or may not be lost. Attains greater power savings than D1. A device in the D2 state can caus devices to lose some context.
D3	Device support: Mandatory. State in which device is off. Device context is lost. Power can be removed from the device.

Table 16-4: Concise Description of OnNow Device PM States

Device Power State	Power Consumption	Time to Return to D0 State
D0	Highest	NA
D1	$< D 0$	Faster than D2
D2	$< D 1$	Faster than D3
D3	For all intents and purposes, none, although there might be some negli- gible consumption.	Slowest

Definition of Device Context

General. During normal operation, the operational state of a device is constantly changing. Software external to the device (e.g., its device driver, the PCI Express Bus Driver, etc.) writes values into some of its registers, reads its status, etc. In addition, the device may contain a processor that executes device-specific code to control the device's interaction with the system as well as with an external element such as a network. The state of the device at a given instant in time is defined by (but not limited to) the following:

The contents of the device's PCI Express configuration registers.

The state of the device's IO registers that its device driver interacts with.

If the device contains a processor, its current program pointer as well as the contents of some of the processor's other registers.

This is referred to as the current device context. Some or all of this information might be lost if the device's PM state is changed to a more aggressive power conservation level:

If the device is placed in the D1 or D2 state, it may or may not lose some of this context information.

If the device is placed in the D3 state, it will lose its context information.

Assume that a device is placed in a more aggressive power conservation state that causes it to lose some or all of its context information. If the device's context information is not restored when the device is placed back in the D0 state (i.e., fully-operational), it will no longer function correctly.

PM Event (PME) Context. Assume that the OS sets up a modem to wake up the system if the phone rings (in other words, on a Ring Detect) and that the system is then commanded to power down by the OS (e.g., in response to the user depressing the power switch). Remember that "power down" is a relative term within the context of power management. The chipset has power applied and monitors the PME# signal. To support this feature, the modem must implement:

A PME (Power Management Event) Message capability.

A PME enable/disable control bit.

A PME status bit that indicates whether or not the device has send a PME message.

One or more device-specific control bits that are used to selectively enable/ disable the various device-specific events (such as Ring Detect) that can cause the device to send a PME message.

Corresponding device-specific status bits that indicate why the device issued a PME message.

It should be obvious that the modem could not wake the system (by sending a PME message) if the logic described in the bullet list also lost power when the device is commanded to enter the D3 (off) state. It wouldn't "remember" that it was supposed to do so or why, would not be enabled to do so, etc. In other words, for the Ring Detect to successfully wake the system, the device's PME context information must not be lost when the device is placed in the D3 state.

PCI Express System Architecture

Device Class-Specific PM Specifications

Default Device Class Specification. As mentioned earlier in this chapter, the OnNow Design Initiative provides a basic definition of the four possible power states (D0 - through - D3). It also defines the minimum PM states that all device types must implement. The document that provides this definition is the Default Device Class Power Management spec. This document mandates that all devices, irrespective of device category, must implement the PM states defined in Table 16-5 on page 576.

Table 16-5: Default Device Class PM States

State	Description
D0	Device is on and running. It is receiving full power from the system and is delivering full functionality to the user
D1	This state is not defined and not used.
D2	This state is not defined and not used.
D3	Device is off and not running. Device context is assumed lost, and there is no need for any of it to be preserved in hardware. This state should con- sume the minimum power possible. Its only requirement is to recognize $t$ bus-specific command to re-enter D0. Power can be removed from the device while in D3. If power is removed, the device will receive a bus-spe cific hardware reset upon reapplication of power, and should initialize itself as in a normal power on.

Device Class-Specific PM Specifications. Above and beyond the power states mandated by the Default Device Class Specification, certain categories (i.e., Classes) of devices may require:

the implementation of the intermediate power states (D1 and/or D2)

that devices within a class exhibit certain common characteristics when in a particular power state.

The rules associated with a particular device class are found in a set of documents referred to as Device Class Power Management Specifications. Currently, Device Class Power Management Specifications exist for the following device classes:

Audio

Communications

Display

Input

Network

PC Card

Storage

They are available on Microsoft's Hardware Developers' web site.

Power Management Policy Owner

General. A device's PM policy owner is defined as the software module that makes decisions regarding the PM state of a device.

In Windows OS Environment. In a Windows environment, the policy owner is the class-specific driver (i.e., the WDM) associated with devices of that class.

PCI Express Power Management vs. ACPI

PCI Express Bus Driver Accesses PCI Express Configuration and PM Registers

As indicated in Table 16-1 on page 569 and Figure 16-1 on page 578, the PCI Express Bus Driver understands the location, format and usage of the PM registers defined in the PCI Power Management spec. It therefore is the software entity that is called whenever the OS needs to change the power state of a PCI Express device (or to determine its current power state and capabilities), or to access its configuration registers. Likewise,

The IEEE 1394 Bus Driver understands the location, format and usage of the PM registers defined in the 1394 Power Management spec.

The USB Bus Driver understands the location, format and usage of the PM registers defined in the USB Power Management spec.

Note that a discussion of the 1394 and USB Bus drivers is outside the scope of this book.

ACPI Driver Controls Non-Standard Embedded Devices

There are devices embedded on the system board whose register sets do not adhere to any particular industry standard spec. At boot time, the BIOS reports these devices to the OS via a set of tables (the ACPI tables; also referred to as the namespace; ACPI stands for Advanced Configuration and Power Interface).

PCI Express System Architecture

When the OS needs to communicate with any of these devices, it calls the ACPI Driver. The ACPI Driver executes a handler (referred to as a Control Method) associated with the device. The handler is found in the ACPI tables that were passed to the OS by the BIOS at boot time. The handler is written by the system board designer in a special interpretive language referred to as ACPI Source Language, or ASL. The format of ASL is defined in the ACPI spec. The ASL source is then compiled into ACPI Machine Language, or AML. Note that AML is not a processor-specific machine language. It is a tokenized (i.e., compressed) version of the ASL source code. The ACPI Driver incorporates an AML token interpreter that enables it to "execute" a Control Method.

A discussion of ACPI is outside the scope of this book. It is only mentioned because the OS uses a combination of ACPI and Bus Driver services (such as the PCI Express (PCI-XP) Bus Driver) to manage the system's power and configuration.

Figure 16-1: Relationship of OS, Device Drivers, Bus Driver, PCI Express Registers, and ACPI

Some Example Scenarios

Figure 16-2 on page 581, Figure 16-3 on page 583, and Figure 16-4 on page 584 illustrate some example PM scenarios. It should be noted that these illustrations are meant to be introductory in nature and do not cover all possible power state changes. The examples focus on turning a PCI Express function Off (from a power perspective), or turning it On. This implies two possible states for a device (D0 and D3). While it's possible a function only has two states, a function may additionally implement other optional, intermediate power states (D1 and/or D2). The possible power states are discussed later in this chapter.

The following are some of the terms used in the illustrations:

IO Request Packet, or IRP. The OS communicates a request to a Windows device driver by issuing an IRP to it. There are different categories of IRPs; for example, a Power IRP is used to request a change in the PM state of a device to or get its current PM state.

Windows Driver Model, or WDM. A device driver written for the Windows environment that controls a device or a group of similar devices (e.g., network adapters).

General Purpose Event, or GPE. ACPI-related events. The chipset implements a GPE register which is used to selectively enable or disable recognition of various GPEs. When recognition of a specific GPE is enabled (such as a PM event) and that event occurs, the chipset generates an SCI (System Control Interrupt) to the processor. This invokes the GPE handler within the ACPI Driver which then reads the GPE Status registers in the chipset to determine which GPE caused the interrupt.

System Control Interrupt, or SCI. A system interrupt used by hardware to notify the OS of ACPI events. The SCI is an active low, shareable, level-sensitive interrupt.

Control Method. A Control Method is a definition of how the OS can perform a simple hardware task. For example, the OS invokes a Control Method to read the temperature of a thermal zone. See the definition of ASL. An ACPI-compatible system must provide a minimal set of common Control Methods in the ACPI tables. The OS provides a set of well-defined Control Methods that ACPI table developers can reference in their Control Methods. OEMs can support different revisions of chipsets with one BIOS by either including Control Methods in the BIOS that test configurations and respond as needed or by including a different set of Control Methods for each chipset revision. PCI Express System Architecture

ACPI Source Language, or ASL. Control Methods are written in a language called ASL which is then compiled into AML (ACPI Machine Language). AML is comprised of a highly-compressed series of tokens that represent the ASL code. The AML code is interpreted and executed by an AML interpreter incorporated within the ACPI Driver.

Scenario-OS Wishes To Power Down PCI Express Devices. Fig-

ure 16-2 on page 581 illustrates the basic series of actions required when the OS wishes to power down all PCI Express devices and associated links in the fabric (i.e., remove the reference clock and Vcc) to conserve maximum power. Before doing this, it must first ensure that all functions within all PCI Express devices have been powered down.

If all of the PCI functions within all PCI Express devices are already powered down, skip to step 11.

The OS issues a Power IRP to the device driver (WDM) to transition all device functions to the lowest power state.

The WDM saves the current content of the functions device-specific registers.

The WDM disables the device's ability to generate interrupt requests by clearing its interrupt enable bit in its function-specific register set.

The WDM passes the Power IRP to the PCI Express Bus Driver.

The Bus Driver saves the current content of the function's configuration Header registers and any New Capability register sets that it may implement, along with extended configuration registers.

The PCI Express Bus Driver disables the function's ability to act as a Requester and Completer by clearing the appropriate bits in its configuration Command register.

The PCI Express Bus Driver writes to the function's PCI PM registers to set the lowest power state (off).

The PCI Express Bus Driver passes an IRP completion notice to the WDM.

The WDM passes the IRP completion notice to the OS. Steps 2-through-10 are repeated until all PCI functions within all devices have been placed in the powered down state.

The OS issues a Power IRP to the ACPI driver requesting that it turn off the reference clock and Vcc.

The ACPI driver runs the appropriate AML Control Method to turn off the clock and power.

The ACPI driver passes the IRP completion notice to the OS.

Scenario-Restore All Functions To Powered Up State. Figure 16-3 on page 583 illustrates the basic series of actions required when the OS wishes to power up a PCI Express function that was placed in the powered down state earlier.

It's possible that the OS had removed power to all PCI Express devices and turned off the PCI Express reference clock as in the previous example. To restore function back to their operating condition, the OS issues a Power (i.e., Power Management) IRP to the ACPI Driver requesting that the links be turned back on. In response, the ACPI Driver would execute the AML code necessary to turn on the PCI Express reference clock generator and re-apply power to the devices. It should be obvious that PCI-XP devices closest to the Host Bridge/Root Complex must be powered up first. When the ACPI Driver has completed this operation, it issues an IRP completion notice back to the OS. If the reference clock and power had not been turned off earlier, this step can be skipped.

The OS issues a Power IRP to the PCI Express device's WDM requesting that the device be restored to the full power state. The WDM passes the IRP to the PCI Express Bus Driver.

The PCI Express Bus Driver writes to the device's PCI Express PM registers to power up device.

The PCI Express Bus Driver restores the contents of the device's PCI Express configuration Header registers and any New Capability register sets that the device implements. This automatically restores the device's PCI Express configuration Command register enable bits to their original states.

The PCI Express Bus Driver passes an IRP completion notice back to the WDM.

The WDM restores the content of the device's device-specific IO or memory-mapped IO registers. This causes the device's interrupt enable bit to be restored, re-enabling the device's ability to generate interrupt requests. The device is now ready to resume normal operation.

The WDM returns an IRP completion notice to the OS.

Figure 16-3: Example of OS Restoring a PCI Express Function To Full Power

Scenario-Setup a Function-Specific System WakeUp Event. Fig-

ure 16-4 on page 584 illustrates the OS preparing a PCI Express device so that it will wake up the system (send a PME message) when a particular device-specific event occurs.

The OS issues a Power IRP to the device driver (WDM) to enable the device to wake up the system on a specified event.

The WDM writes to device-specific registers within the device to enable the event that will cause the system to wake up.

The WDM passes the IRP to the PCI Express Bus driver. PCI Express System Architecture

The PCI Express Bus Driver writes to the function's PM registers to enable its PME# logic.

The PCI Express Bus Driver returns the IRP completion notice to the WDM.

The WDM returns the IRP completion notice to the OS.

The OS issues a Power IRP to the ACPI driver requesting that the PCI Express Power Management Event (PME) monitoring logic be enabled to generate an ACPI interrupt (referred to as an SCI, or System Control Interrupt).

The ACPI driver enables the chipset's GPE logic to generate an SCI when PME# is detected asserted.

The ACPI driver returns the IRP completion notice to the OS.

Figure 16-4: OS Prepares a Function To Cause System WakeUp On Device-Specific Event

Function Power Management

PCI Express devices are required to support power management. Consequently, several registers and related bit fields must be implemented as discussed below.

The PM Capability Register Set

The PCI-PM specification defines the PM Capability register set that is located in PCI-compatible configuration space above the configuration header. The register is one in potentially many Capability registers that are linked together via pointers. The Capability ID of the PM register set is

01 h

. To determine the location of the PM registers software can perform the following checks. The registers described below must be implemented by PCI Express devices:

Software checks bit 4 (Capabilities List bit) of the function's Configuration Status register. A one indicates that the Capabilities Pointer register is implemented in the first byte of dword $13 d$ of the function’s configuration Header space.

The programmer then reads the dword-aligned pointer from the Capabilities Pointer register and uses it to read the indicated dword from the function's configuration space. This is the first dword of the first New Capability register set.

Refer to Figure 16-5 on page 586. If the first (i.e., least-significant) byte of the dword read contains Capability ID $01 h$ ,this identifies it as the PM register set used to control the function's power state. If the ID is something other than $01 h$ ,then this is the register set for a New Capability other than PM (e.g., PCI Express Capability registers). The byte immediately following the Capability ID byte is the Pointer to Next Capability field that specifies the start location (within the function's configuration space) of the register set for the next New Capability (if there are any additional New Capabilities). 00h indicates there isn't any, while a non-zero value is a valid pointer. As software traverses the linked-list of the function's New Capabilities, its PM register set will be located. A detailed description of the PM registers can be found in "Detailed Description of PCI-PM Registers" on page 596.

Figure 16-5: PCI Power Management Capability Register Set

16 150				1st Dword 2nd Dword
Power Management Capabilities (PMC)		Pointer to Next Capability	Capability ID 01h
	Bridge Support Extensions (PMCSR_BSE)	Control/StatusRegister (PMCSR)

Device PM States

Each PCI Express function must support the full-on (D0) PM state and the full-off (D3) PM state. The D1 and D2 PM states are optional, as are the PM registers. The sections that follow provide a description of the possible PM states that may be supported by a PCI Express function.

D0 State—Full On

Mandatory. In this state, no power conservation is in effect and the device is fully-functional. All PCI Express functions must support the D0 state. There are two substates of the D0 PM state: D0 Uninitialized and D0 Active. No software-based power conservation is in effect in either of these two states. However, PCI Express defines Active State Power Management (ASPM) that is handled autonomously under hardware control to reduce link power consumption when the device is in this state. Table 16-6 on page 587 summarizes the PM policies while in the D0 state.

D0 Uninitialized. A function enters the D0 Uninitialized state in one of two ways:

As a result of the Fundamental Reset being dectected, or

When commanded to transition from the ${D 3}_{hot}$ to the D0 PM state by software.

In either case, the function may exhibit characteristics that it has after detecting Fundamental Reset. In other words, its registers may be returned to their default state (before the function was configured and enabled by software) though this is not required. The function exhibits the following characteristics:

It only responds to PCI Express configuration transactions.

Its Command register enable bits are all returned to their default states.

It cannot initiate transactions.

It cannot act as the target of memory or IO transactions.

D0 Active. Once the function has been configured and enabled by software, it is in the D0 Active PM state and is fully functional.

Table 16-6: D0 Power Management Policies

Link PM State	Function PM State	Registers and/orState that must be valid	Power	Actions permitted to Function	Actions permitted by Function
L0	D0 unini- tialized	PME con- text **	$< 10 W$	PCI Express config transac- tions.	None
L0 L0s (required)* L1 (optional)*	D0 active	all	full	Any PCI Express trans- action.	Any transac- tion, interrupt, or PME. **
L2/L3	D0 active	$N / A^{* * *}$

Active State Power Management

** If PME supported in this state.

*** This combination of Bus/Function PM states not allowed.

D1 State—Light Sleep

Optional. This is a light sleep power conservation state. The function cannot:

initiate TLPs (except PME Message TLP, if enabled)

act as the target of transactions other than PCI Express configuration transactions. The function's PM registers are implemented in its configuration space and software must be able to access these registers while the device is in the D1 state.

Other characteristics of the D1 state are:

Link automatically enters the L1 power conservation state when PM software places the function into the D1 state.

The function may reactivate the link and send a PME message to notify PM software that the function has experienced an event that requires it be returned to full power (assuming that it supports the generation of PM events while in the D1 state and has been enabled to do so).

The function may or may not lose its context in this state. If it does and the device supports PME, it must maintain its PME context (see "PM Event (PME) Context" on page 575) while in this state.

The function must be returned to the D0 Active PM state in order to be fully-functional.

PCI Express System Architecture

Table 16-7 lists the PM policies while in the D1 state.

Table 16-7: D1 Power Management Policies

Link PM State	Function PM State	Registers and/or State that must be valid	Power	Actions permitted to Function	Actions permitted by Function
L1	D1	Math. Shad dynads-based data and $* $ $^{* 1}$ * $^{1 * 1}$ * $^{1 * 1}$ * $^{1 * 1}$ *	pazyeyrujun $_{0}$ Cl 5	suopdesued pue suoppesued Synod ssar(IdA) bands WD step downap Aq patternalder [1] work using the an adjacent sampled isproduceansdn an hq para883.11 si up![m '07 01uopoesuen uoyeinZyuod e SupdatapProof.	$* $ saSessaW $HWd$ partitional you alreadA are suggested.At six L $^{*}$ (Hunted Aew bads sserp adiap inq)of Xdeq [1] with up. 1.1 and 1.1 and 1.1 and 1.1“AWD reward by Lq para331.11 s
L2-L3		NA *

This combination of Bus/Function PM states not allowed.

** If PME supported in this state.

D2 State-Deep Sleep

Optional. This power state provides more power conservation than the D1 PM state and less than the

{D 3}_{hot}

PM state. The function cannot:

initiate TLPs (except PME Message TLP).

act as the target of transactions other than PCI Express configuration transactions. The function's PM registers are implemented in its configuration space and software must be able to access these registers while the device is in the D2 state.

Other characteristics of the D2 state are:

The function transitions its link to the L1 state when PM software transitions function to the D2 state.

The function may send a PME message to notify PM software that it needs to be returned to the active state to handle an event that has occurred (assuming that it supports the generation of PM events while in the D2 state and has been enabled to do so).

The function may or may not lose its context in this state. If the function loses context and the device supports PME messages, it must maintain its PME context (see "PM Event (PME) Context" on page 575) while in this state.

The function must be returned to the D0 Active PM state in order to be fully-functional.

Table 16-8 on page 590 illustrates the PM policies while in the D2 state.

Table 16-8: D2 Power Management Policies

Link PM State	Function PM State	Registers and/or State that must be valid	Power	Actions permitted to Function	Actions permitted by Function
L1	D2	SIDIS(S) DYPODS-SSPP DIAL* $^{*}$ IXALIOD HWA pue	yets WD payoddns jamo[ IX2Pazy requirum $0$ ( $1$ ) $5$ , to	sue.y pue suogoesue.y Syuod ssadxA IDdIted.A1) dads Wd ssep databa Aq patthu.iacI would worksued on adiabap and sampled style.Idn aq Á para331.1 si you4 4701esuent uoneanSynode SupdatapProof.	** sa3essaW HWdpattern and how Khlerid [4] are suppressed.]1 STYL $^{}$ (Huadad Kew dads SSPP advap ING)[deq [1] with uon] usual of adiabatic.*AWD remaint by Aq paraBS(1) :
L2/L3	$N / A^{* *}$

If PME supported in this state.

** This combination of Bus/Function PM states not allowed.

D3—Full Off

Mandatory. All functions must support the D3 PM state. This is the PM state in which power conservation is maximized. There are two ways that a function can be placed into the D3 PM state:

Removal of power (Vcc) from the device. This is referred to as the ${D 3}_{cold}$ PM state. The function could transition into the ${D 3}_{cold}$ state for one of two reasons: if the link it resides on is placed in the L2 or L3 state; or the system is unplugged.

Power is still applied to the function and software commands the function to enter the D3 state. This is referred to as the ${D 3}_{hot}$ PM state.

The following two sections describe the

D 3_{hot}

and

D 3_{cold}

PM states.

{D 3}_{Hot}

State. Mandatory. As mentioned in the previous section,a function is placed into the

{D 3}_{hot} PM

state under program control (by writing the appropriate value into the PowerState field of its PMCSR register).

The function cannot:

initiate TLPs (except PME Message TLP, if enabled and PME_TO_ACK Message TLP).

act as the target of transactions other than PCI Express configuration transactions and PME_Turn_Off Message TLP. The function's PM registers are implemented in its configuration space and software must be able to access these registers while the device is in the $D 3_{hot}$ state.

Other characteristics of the

{D 3}_{hot}

state are:

The function transitions its link to the L1 state when PM software transitions function to the ${D 3}_{hot}$ state.

The function may send a PME message to notify PM software of its need to be returned to the full active state (assuming that it supports the generation of PM events while in the $D 3_{hot}$ state and has been enabled to do so).

The function almost certainly loses its context in this state. If it does and the device supports the generation of PME messages while in the ${D 3}_{hot}$ state,it must maintain its PME context (see "PM Event (PME) Context" on page 575) while in this state.

The function must be returned to the D0 Active PM state in order to be fully-functional.

The function exits the

D 3_{hot}

state under two circumstances:

If $Vcc$ is subsequently removed from the device,it transitions from $D 3_{hot}$ to the ${D 3}_{cold}$ PM state.

Software can write to the PowerState field of the function's PMCSR register to change its PM state to ${DO}_{uninitialized}$ .

When programmed to exit

{D 3}_{hot}

and return to the D0 PM state,the function returns to the D0 Uninitialized PM state (but Fundamental Reset is not required to be asserted). Table 16-9 on page 592 lists the PM policies while in the

{D 3}_{hot}

state.

Table 16-9:

D 3_{hot}

Power Management Policies

Bus PM State	Function PM State	Registers and/or State that must be valid	Power	Actions permitted to Function	Actions permitted by Function
L0	${D 3}_{hot}$	NA*
L1		PME con- text. **	areas WD payoddins jamo[ IXAUpaz [121]* [111] $0$ C] $2$ ] $1$ to	suogoesue.q Synod ssadx [1] [1](1) II) is a力 $_{⋆ ⋆}$ agessaw 1seopeo.iq HOT UMLTIN[11] Ay large Indoo Arud died asAll,-afters $0$ T sh of XDEq such issued.	PME message PME_TO_ACK message* PM_Enter_L23 DLLP*** (These can occur only after the link returns to L0)
L2/L3 Ready		L2/L3 Ready entered following the PME_Turn_Off handshake sequence,which prepares a device for power removal***
L2/L3		NA *

This combination of Bus/Function PM states not allowed.

** If PME supported in this state.

*** See "L2/L3 Ready Handshake Sequence" on page 634 for details regarding the sequence.

{D 3}_{Cold}

State. Mandatory. Every PCI Express function enters the

{D 3}_{Cold}

PM state upon removal of power (Vcc) from the function. When power is restored, a Fundamental Reset must also be asserted or the device must generate an internal reset. The function then transitions from the

D 3_{Cold}

state to the

D^{0}

Uninitialized state. A function capable of generating a PME must maintain its PME context while in this state and when transitioning to the D0 state. Since power has been removed from the function, the function must utilize some auxiliary power source to maintain the PME context while in

{D 3}_{Cold}

state. When the device goes to D0 Uninitialized state, if capable and enabled to do so, it generates a PME message to inform the system of the wake up event. For more information on the auxiliary power source, refer to "Auxiliary Power" on page 645.

Table 16-10 on page 593 illustrates the PM policies while in the D3Cold state.

Table 16-10: D3 cold Power Management Policies

Bus PM State	Function PM State	Registers and/or State that must be valid	Power	Actions permitted to Function	Actions permitted by Function
L0	${D 3}_{cold}$
L1
L2		* $^{1}$ X01100 HWd	AUX Power	Bus reset only	Signal Beacon or WAKE#
L3		None		Bus reset only	None

If PME supported in this state.

** The method used to signal a wake to restore clock and power depends on form factor.

Function PM State Transitions

Figure 16-6 on page 594 illustrates the permissible PM state transitions for a PCI Express function. Table 16-11 on page 594 provides a description of each transition.

Table 16-12 on page 596 illustrates the delays involved in transitioning from one state to another from both a hardware and a software perspective.

Figure 16-6: PCI Express Function Power Management State Transitions

Table 16-11: Description of Function State Transitions

From State	To State	Description
D0 Uninitialized	D0 Active	Occurs under program control when function has been completely configured and enabled by its driver.
D0 Active	D1	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D1.
	D2	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D2.
	${D 3}_{hot}$	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D3hot.

Table 16-11: Description of Function State Transitions (Continued)

From State	To State	Description
D1	D0 Active	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D0.
	D2	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D2.
	${D 3}_{hot}$	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D3hot.
D2	D0 Active	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D0.
D2	${D 3}_{hot}$	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D3 $_{hot}$ .
${D 3}_{hot}$	${D 3}_{cold}$	Occurs when the Power Control logic removes Power from the function.
${D 3}_{hot}$	D0 Uninitialized	Occurs when software writes to the PowerState field in the function's PMCSR register and sets the state to D0.
${D 3}_{cold}$	D0 Uninitialized	Wake event causes power (Vcc) to be restored, Fun- damental Reset also becomes active. This causes the function to return to the D0 Uninitialized state. If wake not supported, Fundamental Reset causes the transition to the D0 Uninitialized state.

Table 16-12: Function State Transition Delays

Initial State	Next State	Minimum software-guaranteed delays
D0	D1	0
D0 or D1	D2	$200 μ s$ from new state setting to first access to function (including config accesses).
D0, D1, or D2	${D 3}_{hot}$	$10 ms$ from new state setting to first access to function (including config accesses)
D1	D0	0
D2	D0	$200 μ s$ from new state setting to first access to function (including config accesses).
${D 3}_{hot}$	D0	$10 ms$ from new state setting to first access to function (including config accesses).
${D 3}_{cold}$	D0

Detailed Description of PCI-PM Registers

The PCI Bus PM Interface spec defines the PM registers (see Figure 16-7 on page 596) that are implemented in both PCI and PCI Express functions. These registers provide software with information regarding the function's PM capabilities and permit software to control the PM properties of the function. Since the PM registers are implemented in the PCI Express function's configuration space, software uses PCI configuration accesses to read and write the PM registers. The sections that follow provide a detailed description of these registers.

Figure 16-7: PCI Function's PM Registers

16 158 70				1st Dword 2nd Dword
Power Management Capabilities (PMC)		Pointer to Next Capability	Capability ID 01h
Data Register	Bridge Support Extensions (PMCSR BSE)	Control/Status Register (PMCSR)

Chapter 16: Power Management

PM Capabilities (PMC) Register

Mandatory for function that implements PM. This 16-bit read-only register is interrogated by software to determine the PM capabilities of the function. Figure 16-8 on page 597 illustrates the register and Table 16-13 on page 597 describes each bit field.

Figure 16-8: Power Management Capabilities (PMC) Register - Read Only

Table 16-13: The PMC Register Bit Assignments

Bit(s)	Description
15:11	PME_Support field. Indicates the PM states within which the function capable of sending a PME message (Power Management Event). 0 in a bi indicates PME notification is not supported in the respective PM state.
	Bit Corresponds to PM State
	11D0
	12 $D 1$
	13D2
	14 $D 3_{hot}$
	15 ${D 3}_{cold}$ (function requires aux power for PME logic
	and Wake signaling via beacon or WAKE# pin
	\| Systems that support wake from D3 $_{cold}$ must also support aux po
	Similarly, components that support wake must use aux power to signal the wakeup.
	Bits 31, 30, and 27 must be set to 1b for virtual PCI-PCI Bridges imple- mented within Root and Switch Ports. This is required for ports that for- ward PME Messages.
10	D2_Support bit. 1 = Function supports the D2 PM state.

PCI Express System Architecture

Table 16-13: The PMC Register Bit Assignments (Continued)

Bit(s)	Description
9	D1_Support bit. 1 = Function supports the D1 PM state.
8:6	Aux_Current field. For a function that supports generation of the PME message from the D3 $_{cold}$ state, this field reports the current demand mad upon the 3.3Vaux power source (see "Auxiliary Power" on page 645) by the function’s logic that retains the PME context information. This infor- mation is used by software to determine how many functions can simulta neously be enabled for PME generation (based on the total amount of current each draws from the system 3.3Vaux power source and the power sourcing capability of the power source). - If the function does not support PME notification from within the $D 3_{cold}$ PM state,this field is not implemented and always returns zero when read. Alternatively, a new feature defined by PCI Express per- mits devices that do not support PMEs to report the amount of Aux current they draw when enabled by the Aux Power PM Enable bit within the Device Control register. If the function implements the Data register (see "Data Register" on
	Bit
	8 7 6Max Current Required
	1 1 1375mA
	1 1 0320mA
	101270mA
	100220mA
	011160mA
	010100mA
	0 0 1 $55 mA$
	0 0 0 $0 mA$

Table 16-13: The PMC Register Bit Assignments (Continued)

Bit(s)	Description
5	Device-Specific Initialization (DSI) bit. A one in this bit indicates that immediately after entry into the D0 Uninitialized state, the function requires additional configuration above and beyond setup of its PCI co figuration Header registers before the Class driver can use the function. Microsoft OSs do not use this bit. Rather, the determination and initializa tion is made by the Class driver.
4	Reserved.
3	PME Clock bit. Does not apply to PCI Express. Must be hardwired to 0.
2:0	Version field. This field indicates the version of the PCI Bus PM Interface spec that the function complies with.
	Bit
	2 1 0Complies with Spec Version
	0 0 11.0
	0101.1 (required by PCI Express)

PM Control/Status (PMCSR) Register

Mandatory for all PCI Express Devices. This register is used for the following purposes:

If the function implements PME capability, this register contains a PME Status bit that reflects whether or not a previously-enabled PME has occurred or not.

If the function implements PME capability, this register contains a PME Enable bit that permits software to enable or disable the function's ability to assert the PME message or WAKE# signal.

If the optional Data register is implemented (see "Data Register" on page 603), this register contains two fields that:

permit software to select the information that can be read through the Data register;

and provide the scaling factor that the Data register value must be multiplied by.

The register's PowerState field can be used by software to determine the current PM state of the function and to place the function into a new PM state.

Bit(s)	Value at Reset	Read/ Write	Description
31:24	all zeros	Read- only	See “Data Register” on page 603.
23	zero	Read- only	Not used in PCI Express
22	zero	Read- only	Not used in PCI Express
21:16	all zeros	Read- only	Reserved

Table 16-14: PM Control/Status Register (PMCSR) Bit Assignments (Continued)

Bit(s)	Value at Reset	Read/ Write	Description
15	randmosa cas	IIM 4!A auo e read O OL *IIIM/peaY	PME_Status bit. Optional. Only implemented if the function supports PME notification, otherwise this bit is always zero. If the function supports PME, this bit reflects whether the function has experienced a PME (even if the PME_En bit in this register has disabled the function's ability to send a PME message). If set to one, the function has experi- enced a PME. Software clears this bit by writing a one to it. After reset, this bit is zero if the function doesn't support PME from D3cold. If the function supports PME from $D 3_{cold}$ : - this bit is indeterminate at initial OS boot time. - otherwise, it reflects whether the function has experi- enced a PME If the function supports PME from $D 3_{cold}$ ,the state of this bit must persist (is sticky) while the function remains in the D3 $_{cold}$ state and during the transition from D3 $_{cold}$ to the D0 Uninitialized state. This implies that the PME logic must use an aux power source to power this logic during these conditions (see "Auxiliary Power" or page 645).
14:13	Device- specific	Read- only	Data_Scale field. Optional. If the function does not implement the Data register (see "Data Register" on page 603), this field is hardwired to return zeros. If the Data register is implemented, the Data_Scale field is mandatory and must be implemented as a read-only field. The value read from this field represents the scaling factor that the value read from the Data register must be multiplied by. The value and interpretation of the Data_Scale field depends on the data item selected to be viewed through the Data register by the Data_Select field (see description in the next row of this table).

PCI Express System Architecture

Table 16-14: PM Control/Status Register (PMCSR) Bit Assignments (Continued)

Bit(s)	Value at Reset	Read/ Write	Description
12:9	0000b	atm/peak Read- only	Data_Select field. Optional. If the function does not implement the Data register (see "Data Register" on page 603), this field is hardwired to return zeros If the Data register is implemented, the Data_Select field is mandatory and is implemented as a read/write field. The value placed in this register selects the data value to be viewed through the Data register. That value must then be multiplied by the value read from the Data_Scale field (see previous row in this table).
8	randwichesage was		PME_En bit. Optional. 1 = enable function's ability to send PME messages when an event occurs. $0$ = disable. If the function does not support the generation of PMEs from any power state, this bit is hardwired to always return zero when read. After reset, this bit is zero if the function doesn't support PME from D3 $_{cold}$ . If the function supports PME from ${D 3}_{cold}$ : - this bit is indeterminate at initial OS boot time - otherwise, it enables or disables whether the function can send a PME message in case a PME occurs If the function supports PME from $D 3_{cold}$ ,the state of this bit must persist while the function remains in the $D 3_{cold}$ state and during the transition from $D 3_{cold}$ to the D0 Uninitialized state. This implies that the PME logic must use an aux power source to power this logic during these conditions.
7:2	all zeros		Reserved

Table 16-14: PM Control/Status Register (PMCSR) Bit Assignments (Continued)

Bit(s)	Value at Reset	Read/ Write	Description
1:0	00b		PowerState field. Mandatory. Software uses this field to determine the current PM state of the function (by read- ing this field) or to place it into a new PM state (by writ- ing to this field). If software selects a PM state that isn't supported by the function, the writes must complete nor mally, but the write data is discarded and no state change
		R/W	occurs.

			10PM State
			00D0
			01D1
			10D2
			11 ${D 3}_{hot}$

Data Register

Optional, read-only. Refer to Figure 16-10 on page 605. The Data register is an optional, 8-bit, read-only register. If implemented, the Data register provides the programmer with the following information:

Power consumed in the selected PM state. This information is useful in power budgeting.

Power dissipated in the selected PM state.This information is useful in managing the thermal environment.

Other, device-specific information regarding the function's operational characteristics. Currently, the spec only defines power consumption and power dissipation information to be reported through this register.

If the Data register is implemented,

the Data_Select and Data_Scale fields of the PMCSR registers must also be implemented

the Aux_Current field of the PMC register must not be implemented.

Determining Presence of the Data Register. Perform the following procedure to determine the presence of the Data register:

Write a value of 0000b into the Data_Select field of the PMCSR register.

Read from either the Data register or the Data_Scale field of the PMCSR register. A non-zero value indicates that the Data register as well as the Data_Scale and Data_Select fields of the PMCSR registers are implemented. If a value of zero is read, go to step 3 .

If the current value of the Data_Select field is a value other than 1111b, go to step 4. If the current value of the Data_Select field is 1111b, all possible Data register values have been scanned and returned zero, indicating that neither the Data register nor the Data_Scale and Data_Select fields of the PMCSR registers are implemented.

Increment the content of the Data_Select field and go to step 2.

Operation of the Data Register. The information returned is typically a static copy of the function's worst-case power consumption and power dissipation characteristics (obtained from the device's data sheet) in the various PM states. To use the Data register, the programmer uses the following sequence:

Write a value into the Data_Select field (see Table 16-15 on page 605) of the PMCSR register to select the data item to be viewed through the Data register.

Read the data value from Data register.

Multiply the value by the scaling factor read from the Data_Scale field of the PMCSR register (see "PM Control/Status (PMCSR) Register" on page 599).

Multi-Function Devices. In a multi-function PCI Express device, each function must supply its own power-oriented information and the power information related to their common logic must be reported through function zero's Data register (see Data Select Value

= 8

in Table 16-15 on page 605).

Virtual PCI-to-PCI Bridge Power Data. The specification does not overtly state a requirement for PCI-to-PCI bridge functions that are part of a port within the Root Complex or Switch regarding data field use. However, to maintain PCI-PM compatibility bridges must report the power information they consume. In this same fashion software could read the virtual PPB Data registers at each port of a switch to determine the power consumed by the switch in each power state. Based on PCI-PM each PCI Express function would be responsible for reporting its own power-related data. Chapter 16: Power Management

Figure 16-10: PM Registers

16 150				1st Dword 2nd Dword
Power Management Capabilities		Pointer to Next Capability	Capability ID
	Bridge Support Extensions (PMCSR_BSE)	Control/Status Register (PMCSR)

Table 16-15: Data Register Interpretation

Data Select Value	Data Reported in Data Register	Interpretation of Data Scale Field in PMCSR	Units/ Accuracy
00h	Power consumed in D0 state	$00 b = unknown$ $01 b =$ multiply by 0.1 $10 b =$ multiply by 0.01 $11 b =$ multiply by 0.001	Watts
01h	Power consumed in D1 state
02h	Power consumed in D2 state
03h	Power consumed in D3 state
04h	Power dissipated in D0 state
05h	Power dissipated in D1 state
06h	Power dissipated in D2 state
07h	Power dissipated in D3 state
08h	In a multi-function PCI device, function 0 indi- cates the power consumed by the logic that is com- mon to all of the functions residing within this pack- age.

Table 16-15: Data Register Interpretation (Continued)

Data Select Value	Data Reported in Data Register	Interpretation of Data Scale Field in PMCSR	Units/ Accuracy
09h-0Fh. Spec actu- ally shows this as decimal values 9-15. Author has chosen to represent in hex.	Reserved for future use of function 0 in a multi-func- tion device.	Reserved	TBD
08h-0Fh. Spec actu- ally shows this as decimal values 8-15. Author has chosen to represent in hex.	Reserved (single function devices and other func- tions (greater than func- tion 0) within a multi-function device	Reserved	TBD

Introduction to Link Power Management

PCI-PM compatible software places devices into one of four states as described in previous sections. PCI Express defines link power management that relates to each of the four device states. Table 16-16 on page 607 lists the Device states (D-States) and the associated Link states (L-states) permitted by the specification. Each relationship is described below:

D0 - When a device is in the D0 state is it fully powered and fully functional and the link is typically active (e.g. in the L0 state). PCI Express devices are required to support Active State Power Management (ASPM) that permits link power conservation even when the device is in the D0 state. Two low-power states are defined:

L0 standby, or L0s (required)

L1 ASPM (optional)

Both of these states are managed autonomously by hardware and completely invisible to software. A critical element associated with ASPM is returning to the L0 state with very short latencies. Additional configuration registers permit software to calculate the worst case latencies to determine if ASPM will violate latency requirements of the transactions.

D1 & D2 - When software places a device into either the D1 or D2 state the link is required to transition to the L1 state. The downstream component signals the port in the upstream device (root or switch) to which it attaches, to enter the

L1 state. During L1 the reference clock and power remain active.

{D 3}_{hot}

- When software places a device into the D3 state,the device signals a transition to L1 just as done in the D1 and D2 states. However, because the device is in the

D 3_{hot}

state software may choose to remove the reference clock and power from the device called

{D 3}_{cold}

. Prior to removing the clock and power, software initiates a handshake process that places a device into the L2/ L3 Ready state (i.e., power and clock still on, but ready for power to be removed)

{D 3}_{cold}

— This state indicates that the clock and power have been removed. However, auxiliary (AUX) Power may remain available after the main power rails are powered down. In this case, the link state is referred to as L2. When main power is removed and no AUX power is available it is referred to as L3.

Table 16-16: Relationship Between Device and Link Power States

Downstream ComponentD-State	Permissible Upstream ComponentD-State	Permissible InterconnectState
D0	D0	L0, L0s, & L1 (optional)
D1	D0-D1	L1
D2	D0-D2	L1
D3 hot	D0-D3 hot	L1, L2/L3 Ready
D3 cold	D0-D3 cold	L2 (AUX Pwr),L3

Table 16-17 on page 608 provides additional information regarding the Link power states.

Table 16-17: Link Power State Characteristics

State	Description	PM SW Directed	Active State Link PM	Reference Clocks	Main Power	PLL	Vaux
L0	Fully Active	Yes (D0)	On	On	On	On	On/Off
L0s	Standby	No	Yes (D0)	On	On	On	On/Off
L1	Low Power Standby	Yes* (D1-D3 hot)	Yes (option) (D0)	On	On	On/Off	On/Off
L2/L3 Ready	Staging for power removal	Yes /PME_Turn_Off\ handshake seq.)	No	On	On	On/Off	On/Off
L2	Low Power Sleep	Yes**	No	Off	Off	Off	On
L3	Off (Zero Power)	N/A	N/A	Off	Off	Off	Off

The L1 state is entered due to PM software placing a device into the D1, D2, or D3 states or optionally L1 is entered autonomously under hardware control when Active State Power Management is supported for L1.

** The specification describes the L2 state as being software directed. The other L-states in the table are listed as software directed because software initiates the transition into these states. For example, when software initiating a device power state change to D1, D2, or D3 devices must respond by entering the L1 state. Software also causes the transition to the L2/L3 Ready state by initiating a PME_Turn_Off message. Finally, software also initiates the removal of power from a device after the device has transitioned to the L2/L3 Ready state. This results in a transition to either the L2 or L3 pseudo-states (so called because power is removed from the devices and actual link state transitions do not apply). Because Vaux power is available in L2, a wakeup event can be signaled causing software to be notified.

Link Active State Power Management

PCI Express includes a feature that requires link power conservation even though the device has not been placed in a low-power state by software. Consequently this feature is call "Active State" power management and functions only when the device is in the DO state. Transitions into and out of Active State Power Management (ASPM) are handled solely by Hardware.

Two low power states are defined for ASPM:

L0 standby (L0s) - this state is required by all PCI Express devices and applies to a single direction on the link. The latency to return to the L0 state is specified to be very short.

L1 ASPM - this state is optional and can be entered to achieve a greater degree of power conservation than L0s. This state also results in both directions of the link being placed into the L1 state.

Figure 16-11 illustrates the link state transitions and highlights the transitions between L0, L0s, and L1. Note that transitions between L0s and L1 require the link to be returned to the L0 state.

Figure 16-11: ASPM Link State Transitions

The Link Capability register specifies a device's support for Active State Power Management. Figure 16-12 on page 610 illustrates the ASPM Support field within this register. Notice that the only two combinations are supported via this register: PCI Express System Architecture

L0s only and

L0s and L1.

Figure 16-12: ASPM Support

Software can enable and disable ASPM via the Active State PM Control field of the Link Control Register as illustrated in Figure 16-13 on page 611. The possible settings are listed in Table 16-18 on page 610. The following discussion of ASPM presume that the related features are enabled.

Note: The specification recommends that ASPM be disabled for all components in the path associated with Isochronous transactions, if the additional latencies associated with ASPM exceeds the limits of the isochronous transactions.

Table 16-18: Active State Power Management Control Field Definition

Setting	Description
00b	L0s and L1 ASPM disabled
01b	L0s enabled and L1 disabled
10b	L1 enabled and L0s disabled
11b	L0s and L1 enabled

L0s State

L0s is a link power state that can be entered by any port and applied to a single direction of the link. For example, a large volume of traffic in conventional PC-based systems results from PCI and PCI Express devices sending data to main system memory. This means that the upstream lanes will have heavy traffic, while the downstream lanes will carry occasional ACK TLPs. These downstream lanes can enter the L0s state to conserve power during the stretches of idle bus time.

Entry into L0s

A transmitting port initiates the transition from L0 to L0s after detecting a period of idle time on the transmit link. Details regarding the meaning of idle, how L0s is entered, and the resulting transmitter and receiver states after L0s has been entered is discussed in this section.

Entry into L0s Triggered by Link Idle Time. Entry into L0s is managed for a single direction of the link based on detecting a period of link idle time. Ports are required to enter L0s after detecting idle time of no greater than

7 μ s

. Idle is defined by the specification differently for each category of device. Each category must satisfy the bulleted items listed to be considered in the idle state:

Endpoint Port or Root Port:

o No TLPs are pending transmission or Flow Control credits for a pending TLP are temporarily unavailable.

o No DLLPs are pending transmission.

Upstream Switch Port:

o The receive lane of all downstream ports are already in the L0s state.

o No TLPs are pending transmission, or no FC credits are available for pending TLPs.

o No DLLPs are pending transmission.

Downstream Switch Port:

o The Switch's Upstream Port's Receive Lanes are in the L0s state

o No TLPs are pending transmission, or no FC credits are available for pending TLPs.

o No DLLPs are pending for transmission

The Transaction and Data Link Layers have no knowledge of whether the transmitting side of the Physical Layer has entered L0s; however, the conditions that trigger a transition to L0s must be continuously reported from the Transaction and Link layers to the Physical Layer.

Note that the receiving side of a port must always support entering L0s even if software has disabled ASPM for this port. This allows a device at the other end of the link (that is enabled for ASPM) to still transition one side of the link to the L0s state.

Flow Control Credits Must be Delivered. A pending TLP that cannot be sent due to insufficient FC credits satisfies one of the requirements for an idle condition for all device categories listed above. Consequently, if flow control credits are received during L0s that permits delivery of the pending TLP, the transmitting port must initiate the return to L0. Also, if the receive buffer (associated with the transmit side that is in L0s) makes additional flow control credits available, a transmitter must initiate the return to L0 and deliver the related FC_Update DLLP to the port at the opposite end of the link.

Transmitter Initiates Entry to L0s. When sufficient idle time has been observed on the transmit side of the link, the transmitter forces the transition from L0 to L0s by taking the following steps. Following the sequence of events, both the transmitter and receiver will have transitioned to L0s:

Transmitter delivers an "electrical idle" ordered set to the receiver and places its transmitter into the $Hi - Z$ state.

When the receiver detects the "electrical idle" ordered set, it places its receiver into the Lo-Z state.

The transmitter and receiver are now in their electrical idle states and have reduced power consumption. Synchronization between the transmitter and receiver has been lost and retraining is required. The specification requires that the PLL in the receiver must remain active to allow quick re-synchronization and recovery from L0s back to L0.

Exit from L0s State

If the transmitter detects that the idle condition has disappeared, it must initiate the sequence necessary to exit L0s and a return to L0. The specification encourages designers to monitor events that give an early indication that L0s exit is imminent and to start the recovery process to speed up the transition back to L0. For example, if the receiving side of the link receives a non-posted TLP, the transmitter side knows that it will shortly receive a request to send a completion transaction. Consequently, the transmitter can start the transition back to L0 prior to receiving the completion request.

Transmitter Initiates L0s Exit. When the transmitter whether in the upstream or downstream component recognizes that it must transition the link from L0s to L0, it initiates a sequence that re-establishes the connection with the receiver:

Transmitter exits $Hi - Z$ state and issues one or more Fast Training Sequence (FTS) Ordered Sets needed by the receiver. The number of FTS Ordered Sets required by the receiver to re-synchronize (N_FTS) was previously communicated during link training following fundamental reset.

Following N_FTS one Skip ordered set is delivered.

The receiver receives the number of FTS (N_FTS) Ordered Sets it needs to establish bit lock (PLL), symbol lock (alignment of 10-bit symbols), and lane-to-lane deskew. After receiving the Skip ordered set the receiver is ready to resume normal operation.

Actions Taken by Switches that Receive L0s Exit. A switch's receiving port in the L0s state that receives the L0s to L0 transition sequence must also transmit an L0s exit to other switch ports currently in the L0s state. Two specific cases must be considered:

Switch Port Receives L0s to L0 transition from Downstream. The switch must signal an L0s to L0 on the upstream port if it is currently in the L0s state. This prepares the link facing the Root Complex for transmission of a transaction that will likely be coming from the endpoint or downstream switch that signaled the transition.

Switch Port Receives L0s to L0 transition from Upstream. The switch must signal an L0s to L0 transition on all downstream ports currently in the L0s state.

Any switch port in the L1 state (not L1 ASPM) has been placed into L1 due to software having previously transitioned the device to a D1 or higher power savings state. These ports remains unaffected by L0s to L0 transitions. However, once the upstream link has completed the transition to L0, a subsequent transaction may target this port, causing a transition from L1 to L0.

L1 ASPM State

The optional L1 ASPM state provides power savings greater than L0s, but with the cost of much greater recovery latency. This state also results in both directions of the link being placed into the L1 state and results in Link and Transaction layer deactivation within each device.

Entry into this state is initiated only by the downstream component (an endpoint or the upstream port of a switch). Note that a switch may support L1 ASPM on any combination of its ports. The port at the opposite end of the link can be a root port or the downstream port of a switch. In either case, the upstream component must agree to enter the L1 state through a negotiation process with the downstream component. (See Figure 16-14 on page 615) Note that exiting the L1 ASPM state can be initiated by either the downstream or upstream port.

Figure 16-14: Ports that Initiate L1 ASPM Transitions

Downstream Component Decides to Enter L1 ASPM

The specification does not precisely define all conditions under which an endpoint or upstream port of a switch decides to attempt entry into the L1 ASPM state. The specification does suggest that one requirement might be that both directions of the link have entered L0s and have been in this state for a preset amount of time. The requirements specified include:

ASPM L1 entry is supported and enabled

Device-specific requirements for entering L1 have been satisfied

No TLPs are pending transmission

No DLLPs are pending transmission PCI Express System Architecture

If the downstream component is a switch, then all of the switch's downstream ports must be in the L1 or higher power conservation state, before the upstream port can initiate L1 entry.

Negotiation Required to Enter L1 ASPM

Because of the long latency required to recover from L1 ASPM, a negotiation process is employed to ensure that the port at the other end of the link is enabled for L1 ASPM entry and is prepared to enter it. The negotiation involves sending several transactions:

PM_ Active_State_Request_L1 - this DLLP is issued by the downstream port to start the negotiation process.

PM_ Request_Ack - this DLLP is returned by the upstream port when all of its requirements to enter L1 ASPM have been satisfied.

PM_Active_State_Nak - this TLP is returned by the upstream port when it is unable to enter the L1 ASPM state.

The upstream component may or may not accept the transition to the L1 ASPM state. The following scenarios describe a variety of circumstances that result in both conditions.

Scenario 1: Both Ports Ready to Enter L1 ASPM State

Figure 16-15 on page 618 summarizes the sequence of events that must occur to enable transition to the L1 ASPM state. This scenario assumes that all transactions have completed in both directions and no new transaction requirements emerge during the negotiation.

Downstream Component Issues Request to Enter L1 State. Once the downstream component has fulfilled all the requirements to transition to the L1 state, it can issue the request to enter L1 once the following steps have completed:

TLP scheduling is blocked at the Transaction Layer.

The Link Layer has received acknowledgement for the last TLP it had previously sent (i.e., the replay buffer is empty).

Sufficient flow control credits are available to allow transmission of the largest possible packet for any FC type. This ensures that the component can issue a TLP immediately upon exiting the L1 state.

The downstream component delivers the PM_ Active_State_Request_L1 DLLP to notify the upstream component of the request to enter the L1 state. This transaction is sent repeatedly until the upstream component returns a

response - either a PM_Request_ACK DLLP or a PM_Active_State_NAK TLP.

Upstream Component Requirements to Enter L1 ASPM. As illustrated in Figure 16-14 on page 615 the upstream component may be either a Root Complex Port, or a Switch Downstream Port. These ports must accept a request to enter a low power L1 state if all of the following conditions are true:

The Port supports ASPM L1 entry and is enabled to do so

No TLP is scheduled for transmission

No Ack or Nak DLLP is scheduled for transmission

Upstream Component Acknowledges Request to Enter L1. The

upstream component sends a PM_Request_ACK DLLP to notify the downstream component of its agreement to enter the L1 ASPM state. Prior to sending this acknowledgement, it must complete the following:

Block scheduling of any TLPs.

The Upstream component must have received acknowledgement for the last TLP previously sent (i.e., its replay buffer is empty).

Sufficient flow control credits are available to allow transmission of the largest possible packet for any FC type. This ensures that the component can issue a TLP immediately upon exiting the L1 state.

The Upstream component then sends a PM_Request_Ack DLLP and continues sending the transaction continuously until it receives the Electrical Idle ordered set on its receive lanes.

Downstream Component Detects Acknowledgement. When the Downstream component detects a PM_Request_Ack DLLP, it knows that the upstream device has accepted the request. In response, the downstream component stops sending the PM_Active_State_Request_L1 DLLP, disables DLLP and TLP transmission, and places its transmit (upstream) lanes into the Electrical Idle state.

Upstream Component Receives Electrical Idle. When the Upstream component receives an Electrical Idle ordered set on its Receive Lanes (signaling that the Downstream component has entered the L1 state), it then stops sending the PM_Request_Ack DLLP, disables DLLP and TLP transmission, and places its transmit (downstream) lanes into the Electrical Idle state.

PCI Express System Architecture

Figure 16-15: Negotiation Sequence Required to Enter L1 Active State PM

Scenario 2: Upstream Component Transmits TLP Just Prior to Receiving L1 Request

This scenario presumes that the upstream component has just received a request to send a TLP to the downstream component as it prepares to request entry in to the L1 state. Currently, the downstream device is unaware of the TLP being sent and the upstream device is unaware of the request to entry L1. Several negotiation rules define the actions that ensure that this situation is managed correctly.

TLP Must Be Accepted by Downstream Component. Note that after the downstream device sends the PM_Active_State_L1 DLLP it must wait for a response from the upstream component. While waiting, the receive side of the downstream component must be able to accept TLPs and DLLPs from the upstream device. Furthermore, it must also be able to send a DLLP as required. In this example the downstream component must respond to the TLP. Two possibilities exist:

an ACK DLLP is returned to verify successful receipt of the TLP.

a NAK DLLP is returned if a TLP transmission error is detected. This results in a transaction retry of the TLP from the upstream component. Retries are permitted during negotiation.

In summary, the specification requires that all TLPs be acknowledged prior to entering the L1 state.

Upstream Component Receives Request to Enter L1. The specification requires that the upstream component immediately accept or reject the request to enter the L1 state. However, it further states that prior to sending a PM_Request_ACK DLLP it must:

Block scheduling of new TLPs

Wait for acknowledgement of the last TLP previously sent, if necessary. The specification further states that the upstream component may issue retries in the event that a NAK DLLP is received from the downstream component, or a Link Acknowledgement timeout condition occurs.

Once all outstanding TLPs have been acknowledged, and all other conditions are satisfied, the upstream device must return a PM_Request_ACK DLLP.

Scenario 3: Downstream Component Receives TLP During Negotiation

During the negotiation sequence the downstream device may receive a new TLP targeting the upstream device. Recall that when a device begins the L1 ASPM negotiation process, it must block new TLP scheduling. This prevents a race condition between completing the transition to L1 and sending a new TLP that would otherwise prevent entry into L1 ASPM. Consequently, once all requirements to enter L1 have been satisfied and the downstream device has scheduled delivery of the PM_Request_L1 DLLP it must complete the transition to the L1 state (if a PM_Request_ACK is received). Then it can initiate the transition from L1 ASPM to L0 and send the TLP.

Scenario 4: Upstream Component Receives TLP During Negotiation

Note that in the event that the upstream component needs to transfer a TLP or DLLP after sending the PM_Request_Ack DLLP, it is required to complete the transition to L1. It must then initiate a transition from L1 to L0, after which the TLP or DLLP can be sent.

Scenario 5: Upstream Component Rejects L1 Request

Figure 16-16 on page 621 summarizes the negotiation sequence when the upstream component rejects the request to enter the L1 ASPM state.

The negotiation begins normally with the downstream component sending the request DLLP to enter L1. However, the upstream device returns a PM_Active_State_Nak TLP to indicate rejection of the request. The reasons for the upstream component rejecting the request to enter L1 include:

does not support L1 ASPM

supports L1 ASPM, but software has not enabled this feature within the Link Control register

One or more TLPs are scheduled for transfer across the link

ACK or NAK DLLPs are scheduled for transfer

Once the upstream component sends the rejection message, it can send TLPs and DLLPs as required.

If the downstream component receives a rejection it must transition to L0s if possible.

Figure 16-16: Negotiation Sequence Resulting in Rejection to Enter L1 ASPM State

Exit from L1 ASPM State

Either component can initiate the transition from L1 to L0 when it needs to communicate via the link. Whether the upstream or downstream component initiates the exit from L1, the procedure is the same and does not involve any negotiation as does L1 entry. When switches are involved in exiting from L1 the specification requires that other switch ports in the ASPM low power states must also transition to the L0 state if they are possibly in the path of the transaction causing the exit. These issues are discussed in subsequent sections.

L1 ASPM Exit Signaling. The specification states that exit from L1 is invoked by exiting electrical idle, which could consist of a variety of signaled states. However, because re-training (recovery) is required to transition the link back to the L0 state, it seems reasonable that exit signaling PCI Express System Architecture would begin by transmitting the TS1 ordered set to the opposite port. The receiving port, in turn initiates recovery by signaling the TS1 ordered set back to the originating devices's receive port. The Physical Layer's Link Training State Machine completes the Recovery state after which the link will be returned to L0. Refer to"Recovery State" on page 532 for details.

Switch Receives L1 Exit from Downstream Component. This sections describes the switch behavior when a downstream component initiates exit from L1 (TS1 ordered set) to the switch. As pictured in Figure 16-17 on page 623, the Switch must respond to L1 exit signaling by returning the TS1 ordered set to the downstream component,and within

1 μ s

(from signal L1 Exit downstream) it must also transmit the TS1 ordered set on its upstream link (but only when the upstream port is also in the L1 ASPM state).

The expectation is that the downstream component, having initiated L1 exit, is preparing to send a TLP traveling in the upstream direction. Because L1 exit latencies are relatively long, the specification states that a switch "must not wait until its Downstream Port Link has fully exited to L0 before initiating an L1 exit transition on its Upstream Port Link." This prevents accumulated latencies that would otherwise result if all L1 to L0 transitions occurred in a linear fashion.

Figure 16-17: Switch Behavior When Downstream Component Signals L1 Exit

Switch Receives L1 Exit from Upstream Component. This section describes the switch behavior when an upstream component signals L1 Exit (TS1 ordered set) to a switch. In this case, the switch must send the TS1 ordered set back upstream,and within

1 μ s

it must also signal the TS1 ordered set to force all downstream ports that are in the L1 ASPM state to also return to L0. The goal as in the previous example is to shorten the overall latency in returning to the L0 state. Figure 16-18 on page 624 summarizes these requirements. Note that the link attaching Switch F and EndPoint (EP)

E

is in the

L 1

state due to software having previously placed EP E into the D1 state, which caused the link to transition to L1. Only links in the L1 ASPM state are transitioned to L0 as a result of the Root Complex (RC) initiating the exit from L1 ASPM.

Figure 16-18: Switch Behavior When Upstream Component Signals L1 Exit

ASPM Exit Latency

PCI Express provides the mechanisms to ensure that the ASPM exit latencies for L0s and L1 do not exceed the latency requirements of the endpoint devices. Each device must report its L0s and L1 exit latencies from the moment ASPM exit is signaled. Endpoints also report the total acceptable latency that they can tolerate when performing accesses (typically to and from main memory). This latency is a function of the data buffer size within the device. If the chain of devices that reside between the endpoint and target device have a total latency

that exceeds the acceptable latency reported by the endpoint, software can disable ASPM to avoid unacceptable latency for a given endpoint.

The exit latencies reported by a device will change depending on whether the devices on each end of a link share a common reference clock or not. Consequently, the Link Status register includes a bit called Slot Clock that specifies whether the component uses an external reference clock provided by the platform, or an independent reference clock (perhaps generated internally). Software checks these bits within devices at both ends of each link to determine whether they use a common clock. If so, software sets the Common Clock bit to report the common clock implementation to both devices. Figure 16-20 on page 628 illustrates the registers and related bit fields involved in managing the ASPM exit latency.

Reporting a Valid ASPM Exit Latency

Because the clock configuration affects the exit latency that a device will experience, devices must report the source of their reference clock via the Slot Clock status bit within the Link Status register. This bit is initialized by the component to report the source of its reference clock. If this bit is set (1), the clock uses the platform generated reference clock and cleared (0) if it uses an independent clock.

If system firmware or software determines that the components at each end of the link use the platform clock then the reference clocks within both devices will be in phase. This results in shorter exit latencies from L0s and L1, and is reported in the Common Clock field of the Link Control register. Components must then update their reported exit latencies to reflect the correct value. Note that if the clocks are not common then the default values will be correct and no further action is required.

L0s Exit Latency Update. Exit latency for L0s is reported in the Link Capability register based on the default assumption that a common clock implementation does not exist. L0s exit latency is also reported during link training (via the TS1 Ordered Sets) by specifying the number of FTS Ordered Sets (N_FTS) required to exit L0s. Consequently, if software detects a common clock implementation, the Common Clock field is set and System firmware or software must write to the Retrain Link bit in the Link Control register, to force retraining. During retraining new N_FTS values are reported to the transmitter at the opposite end of the link and new values are also reported in L0s Latency field of the Link Capability register. PCI Express System Architecture

L1 Exit Latency Update. Following link retraining, new values will also be reported in the

L 1

Latency field.

Calculating Latency Between Endpoint to Root Complex

Figure 16-19 on page 627 illustrates an endpoint whose transactions must transverse 2 switches in the path between the endpoints and Root Complex. This example presumes that all links in the path are in the L1 state.

Endpoint B needs to send a packet to main memory and begins the wake sequence by initiating a TS1 ordered set on link B/C at time "T." The L1 exit latency for EP B is a maximum of 8, but Switch C has a maximum exit latency of $16 μ s$ . Therefore,the exit latency for this link is $16 μ s$ .

Within $1 μ s$ of detecting the L1 exit on link B/C,Switch C signals L1 exit on link $C / F$ at $T + 1 μ s$ .

Link C/F completes exit from L1 in $16 μ s$ ,completing at T+17μs.

Switch F signals exit from L1 to the Root Complex within $1 μ s$ of detecting L1 exit from Switch C (T+2μs).

Link F/RC completes exit from $L 1$ in $8 μ s$ ,completing at $T + 10 μ s$ .

Total latency to transition path to target back to $L 0 = T + 17 μ s$ .

Software Initiated Link Power Management

When software initiates configuration write transactions to transition the power state of a device to conserve power, devices must respond by transitioning their link to the corresponding low power state.

D1/D2/D3 $_{Hot}$ and the L1 State

The specification requires that when all functions within a device have been placed into any of the low power states (D1,D2,or

{D 3}_{hot}

) by software,the device must initiate a transition to the L1 state. A device returns to L0 as a result of software intitiating a configuration access to the device or due to a device initiated Power Management Event (PME). See Figure 16-21.

Figure 16-21: Devices Transition to L1 When Software Changes their Power Level from D0

Upon receiving a configuration write transaction to the Power State field of the PMCSR register a device initiates the transition from L0 to L1 by sending a PM_Enter_L1 DLLP to the upstream component. Figure 16-22 on page 630 illustrates the sequence of events. In the example, software places the EndPoint (EP) device into the D2 state.

Figure 16-22: Software Placing a Device into a D2 State and Subsequent Transition to L1

Entering the L1 State

The procedure required to place the link into an L1 state is illustrated in Figure 16-23 on page 632. Each step referenced in the figure is described in greater detail below:

Once the device recognizes that all its functions are in the D2 state, the device must prepare to transition the link into its L1 state. This process begins with blocking new TLPs from being scheduled.

A TLP may have been sent by endpoint A prior to receiving the request to enter D2 that has not yet received a TLP acknowledgement. The device must not attempt to signal a link transition request until all outstanding TLPs have been acknowledged. This means that the Replay Buffer must be empty before proceeding to the L1 state.

Because of the long latencies required to return the device and link to their active states, a device must be prepared to send a maximum-sized TLP immediately upon return to the active state. Recall that insufficient Flow Control credits result in TLP transmission being blocked; therefore, before entering L1 the endpoint must have sufficient credits to permit transmission of the maximum-sized packet supported for each Flow Control type.

When the above items have been completed the device sends a PM_Enter_L1 DLLP to the upstream device. This DLLP acts as a command to instruct the upstream component to place its transmit link into the L1 state. The PM_Enter_L1 DLLP is sent continuously on the link until a PM_Request_ACK DLLP is received from the upstream device.

The upstream component upon receipt of the PM_Enter_L1 DLLP, begins its preparation for entering L1 by performing steps 6,7 , and 8 . This is the same preparation as performed by the downstream component prior to signaling the L1 transition.

All new TLP scheduling is blocked.

In the event that a previous TLP has not yet been acknowledged, the upstream device will wait until all transactions in the Replay Buffer have been acknowledged before proceeding.

Flow Control credits must be accumulated to ensure that the largest TLP can be transmitted for each Flow Control type before entering L1.

The upstream component sends a PM_Request_ACK DLLP to confirm that it's ready to enter the L1 state. This DLLP is sent continuously until an Electrical Idle ordered set is received, indicating that the acknowledgement has been accepted.

The downstream component upon receiving the acknowledgement DLLP knows that the upstream component is prepared to enter the L1 state.

The downstream device sends an Electrical Idle ordered set after which it places its transmit lanes into electrical idle (transmitter is in Hi-Z state).

The upstream component recognizes the Electrical Idle ordered set and places its transmit lanes into electrical idle. The link has now entered the L1 state.

PCI Express System Architecture

Figure 16-23: Procedure Used to Transition a Link from the L0 to L1 State

Exiting the L1 State

The exit from the L1 state can be initiated by either the upstream or downstream component. The trigger that causes an exit from L1 back to L0 is different for upstream and downstream devices as discussed below. This section also summarizes the signaling protocol used to exit L1.

Upstream Component Initiates L1 to L0 Transition. Software having placed a device into a power saving state (D1, D2, or D3) may need to transition the device back to D0 to permit device access. Power Management

software must issue a configuration write transaction to change the power state back to D0. When the configuration transaction arrives at the upstream component (a Root Port or downstream Switch Port) the port will exit the electrical idle state which initiates re-training and return of the link to the L0 state.

Once the link is active, the configuration write transaction can be delivered to the device causing the transition back to D0. The device is now ready for normal operation again.

Downstream Component Initiates L1 to L0 Transition. When a link is in the L1 state the reference clock is still active and power is still applied to devices attached to the link. A downstream device may be designed to monitor external events that would trigger a Power Management Event (PME). In conventional PCI, a PME is reported via a signal of the same name - PME#. This signal is routed to system board logic that is responsible for notifying software (typically via an interrupt) of the need to exit L1. PCI Express uses the same concept but eliminates the sideband signal with a virtual wire message that reports the PME. (See "The PME Message" on page 639 for details.)

The L1 Exit Protocol. When in the L1 state both directions of the link are in the electrical idle state. A device signals an exit from L1 by transmitting the TS1 Ordered Sets, thereby causing the exit from electrical idle. When the device at the other end of the link detects the exit from electrical idle it sends the TS1 Ordered Sets back to the originating device. This sequence triggers both devices to enter re-training (recovery). Following recovery both devices will have returned to the L0 state.

L2/L3 Ready — Removing Power from the Link

Once software has placed all functions within a device into the

D 3_{hot}

state power can be removed from the device. A typical application for this would be to place all devices in the fabric into D3 and put all devices to sleep by removing power to all devices. Depending on the system design power can also be removed from devices selectively based on the implementation of separate power planes that permits power to be removed selectively. The specification does not specify the actual mechanism that would be used to remove clock and power (main power rails).

The state transitions required to prepare devices for removing power involve the preliminary steps of entering L1 and then via a handshake protocol returning to L0 and then to the L2/L3 Ready state as illustrated in Figure 16-24.

Figure 16-24: Link States Transitions Associated with Preparing Devices for Removal of the Reference Clock and Power

L2/L3 Ready Handshake Sequence

The specification however requires a handshake sequence when transitioning to the L2/L3 Ready state. This handshake has two purposes:

to ensure that all devices are ready for reference clock and power removal.

ensure that inband PME messages being sent to the Root Complex are not lost when power is removed.

Below is an example of the handshake sequence that is required before removing the reference clock and power from all PCI Express devices in the fabric. This example assumes a system-wide power down is being initiated. However the sequence can also apply to smaller segments of the PCI Express fabric or individual devices. The required steps are summarized below and in Figure 16-25 on page 636 (which illustrates a single Root Port). The overall sequence is represented in two parts labeled

A

and

B

. The Link transitions involved in the complete sequence include:

L0 --> L1 (caused by software placing a device into D3)

L1 -> L0 (caused by software initiating a PME_Turn_Off message)

L0 -> L2/L3 Ready (caused by completion of PME_Turn_Off message handshake sequence, which culminates in a PM_Enter_L23 DLLP being sent by the device and the link going to electrical idle)

The following steps detail the sequence illustrated in Figure 16-25.

Power Management software must first place all functions within PCI Express fabric into their D3 state.

All devices initiate transitions of their links to the L1 state upon entering D3.

Power Management initiates a PME_Turn_Off TLP message that is broadcast from all Root Complex ports to all devices. (This prevents PME Messages from being sent upstream when power is removed. Otherwise a message would be lost if it is being sent when power is cut.) Note that delivery of this TLP requires each link to transition from L1 to L0 as it is forwarded downstream.

All devices must receive and acknowledge the PME_Turn_Off message by returning a PME_TO_ACK TLP message while in the D3 state.

Switches must collect the PME_TO_ACK messages from all of their enabled downstream ports and forward an aggregate PME_TO_ACK message upstream toward the Root Complex.

Subsequently, each device sends a PM_Enter_L23 DLLP when it is ready to have the reference clock and power removed. This causes each link to enter the L2/L3 Ready state. The specification states that the PM_Enter_L23 DLLP must be sent repeatedly until a PM_Request_ACK DLLP is returned. The links that enter the L2/L3 Ready state last are those attached to the device originating the PME_Turn_Off message (the Root Complex in this example).

The reference clock and power can finally be removed when all links have transitioned to the L2/L3 state. The specification further requires that clock and power cannot be removed sooner than $100 ns$ after all links attached directly to the Root Port (i.e., point of origin) have transitioned to the L2/L3 Ready state. If auxiliary (AUX) power is supplied to the devices, the link transitions to L2 and if no AUX power is available the devices are referred to as being in the L3 state.

Exiting the L2/L3 Ready State — Clock and Power Removed

As illustrated in the state diagram in Figure 16-26, a device may only exit the L2/L3 Ready state when power is removed. Note that when Vaux is available the transition is to

L 2

and when all power is removed the transition is to

L 3

Link state transitions are normally under control of the Link Training Sequence State Machine (LTSSM) within the Physical Layer. However, transitions to the L2 and L3 states result from main power being removed. Because the LTSSM operates typically on main power only, the specification refers to the L2 and L3 states as pseudo-states. These states are defined for explaining the resulting condition of a device when power is removed under Power Management software control, and are not associated with LTSSM actions.

Figure 16-26: State Transitions from L2/L3 Ready When Power is Removed

The L2 State

Some devices are designed to monitor external events and initiate a wakeup sequence so that an external event can be handled normally. Because main power is removed from the device, these device may need AUX power to monitor the events and to signal wakeup to notify software that the device needs to be revived.

The L3 State

When in this state the device has no power and therefore no means of communication. Recovery from this state requires the system to re-establish power and reference clock and receive fundamental reset.

Link Wake Protocol and PME Generation

The wake protocol provides a method for devices to reactivate the upstream link and request that Power Management software return the devices to D0 so they can perform required operations. The procedures and signaling methods used in PCI Express are different from the PCI-PM specified methods. However, PCI Express PM is designed to be compatible with PCI-PM software.

Rather than using the PCI-defined PME# sideband signal, PCI Express devices employ an inband PME message to notify PM software of a request to return the device to the full power state (D0). The ability to generate PME messages may be supported optionally within any of the low power states. Recall that devices report the PM states they support and from which of these states they can send a PME message. See Figure 16-8 on page 597.

PME messages can only be delivered once the link power state transitions to L0. The level of difficulty and latency required to reactivate the link so that a PME message can be sent is a function of a device's PM and Link state. Consequently, the steps required to complete a wakeup can include the following depending on the current link state:

Link is in non-communicating (L2) state - when a link is in the L2 state it cannot communicate because the reference clock and main power have been removed. Thus, a PME message cannot be sent until clock and power are restored, Fundamental Reset is asserted, and the link is re-trained. These events are triggered when a device signals a wakeup. This may result in all links being re-awakened that are in the path between the device needing to communicate and the Root Complex.

Link is in communicating (L1) state - when a link is in the L1 state clock and main power are still active; thus, a device simply exits the L1 state, re-trains the link (via the Recovery state) and returns the link to L0. This is the procedure discussed earlier in this chapter (See "Exiting the L1 State" on page 632.) Once the link is in L0 the PME message is delivered. Note that the devices never send a PME message while in the L2/L3 Ready state because entry into that state only occurs after PME notification has been turned off, in preparation for clock and power to be removed. (See "L2/L3 Ready Handshake Sequence" on page 634.)

PME is delivered (L0) - Once the link is in the L0 state, the device transfers the PME message to the Root Complex; thereby, notifying Power Management software that the device has observed an event that requires the device be placed back into its D0 state. Note that the message contains the

The PME message is a Transaction Layer Packet that has the following characteristics:

TC and VC values are zero (000b)

Routed implicitly to the Root Complex

Handled as Posted Transaction

Relaxed Ordering is not permitted, forcing all transactions in the fabric between the signaling device and the Root Complex to be delivered to the Root Complex ahead of the PME message

The PME Sequence

Devices may support PME in any of the low power states as specified in the PM Capabilities register. This register also specifies the amount of AUX current required by the device if it supports wakeup in the

D 3_{cold}

state. The basic sequence of events associated with signaling a PME to software is specified below and presumes that the device and system are enabled to generate PME (See "Scenario-Setup a Function-Specific System WakeUp Event" on page 583.) and the link has already been transitioned to the L0 state:

The device issues the PME message on its upstream port.

PME messages are implicitly routed to the Root Complex. Any switches in the path transition their upstream ports to L0 (if necessary) and send the packet upstream.

A root port receives the PME and forwards it to the Power Management Controller.

The controller calls power management software (typically via an interrupt). Software uses the Requester ID contained within the message to read and clear the PME_Status bit in the PMCSR and return the device to the D0 state. Depending on the degree of power conservation, the PCI Express driver may also need to restore the devices configuration registers.

PM Software may also call the device's software driver in the event that device context was lost as a result of the device being placed in a low power state. In this case, device software restores information within the device.

PME Message Back Pressure Deadlock Avoidance

The specification describes a potential deadlock scenario that is solved by specifying a PCI Express rule. The problem and solution are described below:

Background

The Root Complex typically stores the PME messages it receives in a queue, and calls PM software to handle each PME. A PME is held in this queue until PM software reads the PME_Status bit from the requesting device's PMCSR register. Once the configuration read transaction completes, this PME message can be removed from the internal queue.

The Problem

Deadlock can occur if the following scenario develops:

Incoming PME Messages have filled the PME message queue. Additional PME messages have been issued by other devices that are in the same hierarchy (downstream from the same root port) as the oldest message in the queue.

PM software, initiates a configuration read request from the Root Complex to read PME_Status from the oldest PME requester's PMCSR.

The corresponding split completion must push all previously posted PM_PME messages ahead of it (based on ordering rules).

The Root Complex cannot accept the incoming PME messages because the queue is full, and the read completion being behind the PME messages cannot reach the Root Complex to clear an entry from the queue.

No progress can be made, thus deadlock occurs.

The Solution

The deadlock is avoided if the Root Complex accepts any arriving PME messages, even when these message would overflow the queue. However, the Root Complex in this case simply discards the incoming PME message, because there is no place to store it. Consequently, the PME message is lost. Note that acceptance of a PME message still requires sufficient flow control credits.

To prevent a discarded PME message from being lost permanently, the device that sends a PME message is required to re-send it following a time-out interval, called the PME Service Time-out. If after sending a PME message, the device's PME_Status bit is not cleared with

100 ms (+ 50 % / - 5 %)

,it must re-issue the message.

The PME Context

Devices that generate PME must continue to power portions of the device that are used for detecting, signaling, and handling PME events. These items are called the PME context. Devices that support PME in the

D 3_{cold}

state use AUX power to maintain the PME context when the main power is removed. Following is a list of items that are typically part of the PME context.

the function's PME_Status bit (required) - this bit is set when a device sends a PME message and is cleared by PM software. Devices that support PME in the ${D 3}_{cold}$ state must implement the PME_Status bit as "sticky," meaning that the value is maintained across a fundamental reset.

the functions's PME_Enable bit (required) - this bit must remain set to continue enabling a function's ability to generate PME messages and signal wakeup (if required). Devices that support PME in the ${D 3}_{cold}$ state must implement PME_Enable as "sticky," meaning that the value is maintained across a fundamental reset.

device-specific status information - for example, a device might preserve event status information in cases where several different types of events can trigger a PME.

application-specific information - for example, modems that initiate wakeup would preserve Caller ID information if supported.

Waking Non-Communicating Links

When a device that supports PME in the D3cold state needs to send a PME message, it must first initiate the sequence of events needed to transition the link to the L0 state so that the message can be sent. This is typically referred to as wakeup. PCI Express defines two methods of triggering the wakeup of non-communicating links:

Beacon - a signaling technique that is driven by AUX power

WAKE# Signal - a sideband signal that is driven by AUX power

In both cases, PM software must be notified so that it can re-apply main power and restart the reference clock. This also causes generation of fundamental reset that forces a device into the

D 0_{uninitialized}

state. Once the link transitions to the L0 state, the device sends the PME message. Because reset is required to re-activate the link so that PME can be signaled, devices must maintain PME context across the reset sequence described above.

Beacon

PCI Express includes a signaling mechanism designed to operate on AUX power that does not require the differential drivers and receivers to be used. The beacon is simply a way of notifying the upstream component that software should be notified of the wakeup request. Switches upon receiving beacon on one of their downstream ports signal beacon on the upstream port. Ultimately, the beacon signal reaches the root complex, causing an interrupt that calls PM software.

Some form-factor types require support for the beacon signalling for waking the system, while others do not. The specification requires compliance with the specific form-factor specifications, and does not require beacon support for devices used in form-factors not requiring this support. However, for "universal" PCI Express components (those designed for use in a variety of form-factors) beacon support is required.

See "Beacon Signaling" on page 469 for details.

WAKE# (AUX Power)

PCI Express also provides a sideband signal called WAKE#, as a alternative to beacon signaling. This signal may be routed directly to the Root Complex or other motherboard logic, thereby causing an interrupt that will call PM software. It's also possible that a hybrid implementation can be used. In this case, WAKE# is sent to a switch, which in turn signal beacon on its upstream port. The options are illustrated in Figure 16-28 on page 644 A and B. Note that when asserted, the WAKE# signal remains low until the PME_Status bit is cleared by software.

This signal must be implemented by ATX or ATX-based form factor and by the minis. motherboard connectors and cards as well as for the mini-card form factor. No requirement is specified for embedded devices to use the WAKE# signal.

Auxiliary Power

Devices that support PME in the

D 3_{cold}

state must support the wakeup sequence (via beacon signaling or the sideband Wake# pin) and are allowed to consume the maximum auxiliary current of

375 mA

(

20 mA

maximum otherwise). The amount of current that they need is reported via the Aux_Current field within the PM Capability registers. Auxiliary power is enabled when the PME_Enable bit is set within the PMCSR register. PCI-PM limits the use of Auxiliary Current as specified above.

PCI Express extends the use of auxiliary power beyond the limitations specified by PCI-PM. Now devices that have PME disabled or that do not support PME can also consume the maximum amount of auxiliary current allowed. This new capability is enabled via software by setting the Aux Power PM Enable bit in the Device Control register, illustrated in Figure 16-29 on page 645. This capability permits devices the opportunity to support other functions such as SM Bus functionality while in a low power state. As in PCI-PM the amount of current consumed by a device is reported in the Aux_Current field in the PMC register.

Figure 16-29: Auxiliary Current Enable for Devices Not Supporting PMEs

17 Hot Plug

The Previous Chapter

The previous chapter provided a detailed description of PCI Express power management, which is compatible with revision 1.1 of the PCI Bus PM Interface Specification and the Advanced Configuration and Power Interface, revision 2.0 (ACPI). In addition PCI Express defines extensions that are orthogonal to the PCI-PM specification. These extensions focus primarily on Link Power and PM event management. This chapter also provides an overall context for the discussion of power management, by including a description of the OnNow Initiative, ACPI, and the involvement of the Windows OS is also provided.

This Chapter

PCI Express includes native support for hot plug implementations. This chapter discusses hot plug and hot removal of PCI Express devices. The specification defines a standard usage model for all device and platform form factors that support hot plug capability. The usage model defines, as an example, how push buttons and indicators (LED's) behave, if implemented on the chassis, add-in card or module. The definitions assigned to the indicators and push buttons, described in this chapter, apply to all models of hot plug implementations.

The Next Chapter

The next chapter provides an introduction to the PCI Express add-in card electromechanical specifications. It describes the card form factor, the connector details, and the auxiliary signals with a description of their function. Other card form factors are also briefly described.

Background

Some systems that employ the use of PCI and PCI-X require high availability or non-stop operation. For example, many customers require computer systems that experience downtimes of just a few minutes a year, or less. Clearly, manufacturers must focus on equipment reliability, and also provide a method of identifying and repairing equipment failures quickly. An important feature in supporting these goals is the Hot Plug/Hot Swap solutions that provide three important capabilities:

a method of replacing failed expansion cards without turning the system off

keeping the O/S and other services running during the repair

shutting down and restarting software associated with the failed device

Prior to the widespread acceptance of PCI many proprietary Hot Plug solutions were available to support this type of removal and replacement of expansion cards. However the original PCI implementation was not designed to support hot removal and insertion of cards, but a standardized solution for supporting this capability in PCI was needed. Consequently, two major approaches to hot replacement of PCI expansion devices have been developed. These approaches are:

Hot Plug PCI Card - used in PC Server motherboard and expansion chassis implementations

Hot Swap - used in CompactPCI systems based on a passive PCI backplane implementation.

In both solutions, control logic is implemented to isolate the card logic from the PCI bus via electronic switches. In conjunction with isolation logic, power, reset, and clock are controlled to ensure an orderly power down and power up of cards when they are removed and replaced. Also, status and power LEDs provide indications to the user that it is safe to remove or install the card.

The need to extend hot plug support to PCI Express cards is clear. Designers of PCI Express have incorporated Hot removal and replacement of cards as a "native" feature. The specification defines configuration registers, Hot Plug Messages, and procedures to support Hot Plug solutions.

Hot Plug in the PCI Express Environment

PCI Express Hot Plug is derived from the 1.0 revision of the Standard Hot Plug Controller specification (SHPC 1.0) for PCI. The goals of PCI Express Hot Plug are to:

support the same "Standardized Usage Model" as defined by the Standard Hot Plug Controller specification. This ensures that the PCI Express hot plug is identical from the user perspective to existing implementations based on the SHPC 1.0 specification

support the same software model implemented by existing operating systems. However, if the OS includes a SHPC 1.0 compliant driver, it will not work with PCI Express Hot Plug controllers, which have a different programming interface.

PCI Express defines the registers necessary to support the integration of a Hot Plug Controller within individual root and switch ports. Under Hot Plug software control, these Hot Plug controllers and the associated port interface within the root or switch port must control the card interface signals to ensure orderly power down and power up as cards are removed and replaced. Hot Plug controllers must:

assert and deassert the PERST# signal to the PCI Express card connector

remove or apply power to the card connector.

Selectively turn on or turn off the Power and Attention Indicators associated with a specific card connector to draw the user's attention to the connector and advertise whether power is applied to the slot.

Monitor slot events (e.g. card removal) and report these events to software via interrupts.

PCI Express Hot-Plug (like PCI) is designed as a "no surprises" Hot-Plug methodology. In other words, the user is not permitted to install or remove a PCI Express card without first notifying software. System software then prepares both the card and slot for the card's removal and replacement, and finally indicates to the end user (via visual indicators) status of the hot plug process and notification that installation or removal may be performed.

Surprise Removal Notification

PCI Express cards (unlike PCI) must implement the edge contacts with card presence detect pins (PRSNT1# and PRSNT2#) that break contact first (when the card is removed from the slot). This gives advanced notice to software of a "surprise" removal and enough time to remove power prior to the signals breaking contact.

Differences between PCI and PCI Express Hot Plug

The elements needed to support hot plug are essentially the same between PCI and PCI Express hot plug solutions. Figure 17-1 on page 653 depicts the PCI hardware and software elements required to support hot plug. PCI solutions implement a single standardized hot plug controller on the system board that permits all hot plug slots on the bus to be controlled by a single controller. Also, isolation logic is needed in the PCI environment to electrically disconnect a single card slot from the bus prior to card removal.

PCI Express Hot Plug differs from the PCI implementation due to point-to-point connections. (See Figure 17-2 on page 654) Point-to-point connections eliminate the need for isolation logic and permit the hot plug controller to be distributed to each port interface to which a connector is attached. A standardized software interface defined for each root and switch port permits a standardized software interface to control hot plug operations. Note that the programming interface for the PCI Express and PCI Hot Plug Controllers vary and require different software drivers.

Elements Required to Support Hot Plug

This section describes the hardware and software elements required to support the Hot Plug environment. Refer to Figure 17-2 on page 654 during this discussion.

Software Elements

Table 17-1 on page 655 describes the major software elements that must be modified to support Hot-Plug capability. Also refer to Figure 17-2 on page 654.

Table 17-1: Introduction to Major Hot-Plug Software Elements

Software Element	Supplied by	Description
User Interface	OS vendor	An OS-supplied utility that permits the end-user to request that a card connector be turned off in order to remove a card or turned on to use a card that has just been installed.
Hot-Plug Service	OS vendor	A service that processes requests (referred to as Hot-Plug Primitives) issued by the OS. This includes requests to: - provide slot identifiers - turn card On or Off - turn Attention Indicator On or Off - return current state of slot (On or Off) The Hot-Plug Service interacts with the Hot-Plug System Driver to satisfy the requests. The interface (i.e., API) with the Hot-Plug System Driver is defined by the OS vendor.
Standardized Hot- Plug System Driver	System Board vendor or OS	Receives requests (aka Hot-Plug Primi- tives) from the Hot-Plug Service within the OS. Interacts with the hardware Hot- Plug Controllers to accomplish requests.

Table 17-1: Introduction to Major Hot-Plug Software Elements (Continued)

Software Element	Supplied by	Description
Device Driver	Adapter card vendor	Some special, Hot-Plug-specific capabili- ties must be incorporated in a Hot-Plug capable device driver. This includes: - support for the Quiesce command. - optional implementation of the Pause command. - Support for Start command or optional Resume command.

A Hot-Plug-capable system may be loaded with an OS that doesn't support Hot-Plug capability. In this case, although the system BIOS would contain Hot-Plug-related software, the Hot-Plug Service would not be present. Assuming that the user doesn't attempt hot insertion or removal of a card, the system will operate as a standard, non-Hot-Plug system.

The system startup firmware must ensure that all Attention Indicators are Off.

The spec also states: "the Hot-Plug slots must be in a state that would be appropriate for loading non-Hot-Plug system software."

Hardware Elements

Table 17-2 on page 656 and Figure 17-2 on page 654 illustrate the major hardware elements necessary to support PCI Express Hot-Plug operation.

Table 17-2: Major Hot-Plug Hardware Elements

Hardware Element	Description
Hot-Plug Controller	Receives and processes commands issued by the Hot-Plug System Driver. One Controller is associ- ated with each root or switch port that supports hot plug operation. The PCI Express Specification defines a standard software interface for the Hot- Plug Controller.

Table 17-2: Major Hot-Plug Hardware Elements (Continued)

Hardware Element	Description
Card Slot Power Switching Logic	Permits the power supply voltages to a slot to be turned on or off under program control. Controlled by the Hot Plug controller under the direction of the Hot-Plug System Driver.
Card Reset Logic	Permits the selective assertion or deassertion of the PERST# signal to a specific slot under program control. Controlled by the Hot Plug Controller under the direction of the Hot-Plug System Driver.
Power Indicator	One per slot. Indicates whether power is currently applied to the card slot or not. Controlled by the Hot Plug logic associated with each port, at the direction of the Hot Plug System Driver
Attention Indicator	One per slot. The Attention Indicator is used to draw the attention of the operator to indicate a Hot Plug problem or failure. Controlled by the Hot Plug logic associated with this port, at the direction of the Hot-Plug System Driver.
Attention Button	One per slot. This button is pressed by the operator to notify Hot Plug software of a Hot Plug request.
Card Present Detect Pins	Two Card Present signals are defined by the PCI Express specification: PRSNT1# and PRSNT2#. PRSNT1# is located at one end of the card slot and PRSNT2# at the opposite end. These two pins are shorter that the other slot pins, allowing break-first capability upon card removal. The system board must tie PRSNT1# to ground and connect PRSNT2# to a pull-up resistor on the system board. Additional PRSNT2# pins are defined for wider connectors to support the insertion and recognition of shorter cards installed into longer connectors. The card must connect PRSNT1# to PRSNT2# to complete the current path between ground and Vcc. “Auxiliary Signals” on page 693.

Card Removal and Insertion Procedures

The descriptions of typical card removal and insertion that follow are intended to be introductory in nature. Additional detail can be found later in this chapter.

It should be noted that the procedures described in the following sections assume that the OS, rather than the Hot-Plug System Driver, is responsible for configuring a newly-installed device. If the Hot-Plug System Driver has this responsibility, the Hot-Plug Service will call the Hot-Plug System Driver and instruct it to configure the newly-installed device.

On and Off States

Definition of On and Off

A slot in the On state has the following characteristics:

Power is applied to the slot.

REFCLK is on.

The link is active or in the standby (L0s or L1 ) low power state due to Active State Power Management.

The PERST# signal is deasserted.

A slot in the Off state has the following characteristics:

Power to the slot is turned off.

REFCLK is off.

The link is inactive. (Driver at the root of switch port is in $Hi Z$ state)

The PERST# signal is asserted.

Turning Slot Off

Steps required to turn off a slot that is currently in the On state:

Deactivate the link. This may involve issuing a Electrical Idle ordered set (a sequence initiated at the Physical Layer that forces the Card's driver to enter the $Hi Z$ state.

Assert the PERST# signal to the slot.

Turn off REFCLK to the slot.

Remove power from the slot.

Turning Slot On

Steps to turn on a slot that is currently in the off state:

Apply power to the slot.

Turn on REFCLK to the slot

Deassert the PERST# signal to the slot. The system must meet the setup and hold timing requirements (specified in the PCI Express spec) relative to the rising edge of PERST#.

Once power and clock have been restored and PERST# removed, the physical layers at both ports will perform link training and initialization. When the link is active, the devices will initialize VC0 (including flow control), making the link ready to transfer TLPs.

Card Removal Procedure

When a card must be removed, a number of steps must occur to not only prepare software and hardware for safe removal of the card, but also to control indicators that provide visual evidence of the request to remove the card being processed. The condition of the indicators during normal operation are:

Attention Indicator (Amber or Yellow) - "Off" during normal operation.

Power Indicator (Green) - "On" during normal operation

Software issues "Requests" to the Hot Plug Controller via configuration write transactions that target the "Slot Control Registers implemented by Hot-Plug capable ports." These requests control power to the slot and the state of the indicators.

The exact sequence of events that occur when performing a Hot Plug card removal vary slightly depending on whether the Hot Plug operation is initiated by pressing the Attention Button or via the User Interface software utility. Each sequence is described below.

Attention Button Used to Initiate Hot Plug Removal

The sequence of events is as follows:

The operator initiates the card removal request by depressing the slot's "attention button." The Hot-Plug Controller detects this event and delivers an interrupt to the root complex. As a result of the interrupt the Hot Plug service calls the Hot Plug System Driver, which reads slot status information and detects the Attention Button request.

Next, the Hot-Plug Service issues a request to the Hot-Plug System Driver commanding the Hot Plug Controller to blink the slot's Power Indicator. The operator is granted a 5 second abort interval, from the time that the indicators starts to blink, during which the operator may press the button a second time to abort the request.

The Power Indicator continues to blink while the Hot Plug software validates the request. Note that software may fail to validate the request (e.g., the card may currently be used for some critical system operation).*

If the request is validated, the Hot-Plug Service utility commands the card's device driver to quiesce. That is, the driver must stop issuing requests to the card and complete or terminate all outstanding requests as well as disable its ability to generate transactions (including interrupt messages).

Software then issues a command to disable the card's link via the Link Control register within the root or switch port to which the slot is attached. This causes ports at both ends of the link to be disabled.

Next, software commands the Hot Plug Controller to turn the slot off.

Following successful power down software issues the Power Indicator Off Request. The operator knows that the card may be removed safely from the slot when the Power Indicator is Off.

The operator releases the Mechanical Retention Latch, causing the Hot Plug Controller to remove all switched signals from the slot (e.g., SMBus and JTAG signals). The card can now be removed.

The OS deallocates the memory space, IO space, interrupt line, etc. that had been assigned to the device and makes these resources available for assignment to other devices in the future.

If the request is not validated, software will deny the request and issue a command to the Hot Plug controller to turn the Power Indicator back ON. The specification also recommends that software notify the operator via a message or by logging an entry indicating the cause of the request denial.

Hot Plug Removal Request Issued via User Interface

The sequence of events is as follows:

The operator initiates the card removal request by selecting the Physical Slot number associated with the card to be removed. Software opens a window or presents a message requesting the operator confirm the request. Note that the Power Indicator remains on during this process.

When the operator confirms the request, the Hot-Plug Service issues a request to the Hot-Plug System Driver commanding the Hot Plug Controller to blink the slot's Power Indicator. During this time, software validates the Hot Plug request. Note that software may fail to validate the request (e.g., the card may currently being used for some critical system operation)*.

If the request is validated, the Hot-Plug Service utility commands the card's device driver to quiesce. That is, the driver must stop issuing requests to the card and complete or terminate all outstanding requests as well as disable its ability to generate transactions (including interrupt messages).

Software then issues a command to disable the card's link via the Link Control register located in the root or switch port to which the slot connects. This causes ports at both ends of the link to be disabled.

Next, software commands the Hot Plug Controller to disable the slot.

Following successful power down software issues the Power Indicator Off Request. The operator knows that the card may be removed safely from the slot when the Power Indicator is Off.

The operator releases the Manual Retention Latch (MRL), causing the Hot Plug Controller to remove all switched signals from the slot (e.g., SMBus and $V_{aux}$ signals). The card can now be removed.

The OS deallocates the memory space, IO space, interrupt line, etc. that had been assigned to the device and makes these resources available for assignment to other devices in the future.

If the request is not validated, software will deny the request and issue a command to the Hot Plug controller to turn the Power Indicator back ON. The specification also recommends that software notify the operator via a message or by logging an entry indicating the cause of the request denial.

Card Insertion Procedure

The procedure for installing a new card basically reverses the steps listed for card removal. The following steps assume that the card slot was left in the same state that it was in immediately after a card was removed from the connector (in other words, the Power Indicator is in the Off state, indicating the slot is ready for card insertion). Variations between the two methods of initiation are described below.

Card Insertion Initiated by Pressing Attention Button

The steps taken to Insert and enable a card are as follows:

The operator installs the card and secures the MRL. If implemented, the MRL sensor will signal the Hot-Plug Controller that the latch is closed, causing switched auxillary signals and $V_{aux}$ to be connected to the slot.

Next, the operator notifies the Hot-Plug Service that the card has been installed by pressing the Attention Button. This signals the Hot Plug controller of the event, resulting in status register bits being set and causing a system interrupt to be sent to the Root Complex. Subsequently, Hot Plug software reads slot status from the port and recognizes the request.

The Hot-Plug Service issues a request to the Hot-Plug System Driver commanding the Hot Plug Controller to blink the slot's Power Indicator to inform the operator that the card must not be removed. The operator is granted a 5 second abort interval, from the time that the indicators starts to blink, to abort the request by pressing the button a second time.

The Power Indicator continues to blink while Hot Plug software validates the request. Note that software may fail to validate the request (e.g., the security policy settings may prohibit the slot being enabled).*

The Hot-Plug Service issues a request to the Hot-Plug System Driver commanding the Hot Plug Controller to turn the slot on.

Once power is applied, software issues a command to turn the Power Indicator $ON$ .

Once link training is complete, the OS commands the Platform Configuration Routine to configure the card function(s) by assigning the necessary resources.

The OS locates the appropriate driver(s) (using the Vendor ID and Device ID, or the Class Code, or the Subsystem Vendor ID and Subsystem ID configuration register values as search criteria) for the function(s) within the PCI Express device and loads it (or them) into memory.

The OS then calls the driver's initialization code entry point, causing the processor to execute the driver's initialization code. This code finishes the setup of the device and then sets the appropriate bits in the device's PCI configuration Command register to enable the device.

If the request is not validated, software will deny the request and issue a command to the Hot Plug controller to turn the Power Indicator back OFF. The specification also recommends that software notify the operator via a message or by logging an entry indicating the cause of the request denial.

Card Insertion Initiated by User Interface

The steps taken to Re-enable the card are as follows:

The operator installs the card and secures the MRL. The MRL sensor signals the Hot Plug Controller to connect the switched signals to the slot.

Next, the operator informs the Hot-Plug Service (via the Hot Plug Utility program) that the card has been installed and is ready to be re-enabled. Software then prompts the user to verify that it is safe to re-enable the slot.

After the operator requests card insertion, the Hot-Plug Service issues a request to the Hot-Plug System Driver commanding the Hot Plug Controller to blink the slot's Power Indicator to inform the operator that the card must not be removed.

The Power Indicator continues to blink while Hot Plug software validates the request. Note that software may fail to validate the request (e.g., the security policy settings may be prohibit the slot being enabled).*

The Hot-Plug Service issues a request to the Hot-Plug System Driver commanding the Hot Plug Controller to reapply power to the slot.

Once power is applied, software issues a command to turn the Power Indicator ON.

Once link training is complete, the OS commands the Platform Configuration Routine to configure the card function(s) by assigning the necessary resources.

The OS locates the appropriate driver(s) (using the Vendor ID and Device ID, or the Class Code, or the Subsystem Vendor ID and Subsystem ID configuration register values as search criteria) for the function(s) within the PCI Express device and loads it (or them) into memory.

The OS then calls the driver's initialization code entry point, causing the processor to execute the driver's initialization code. This code finishes the setup of the device and then sets the appropriate bits in the device's PCI configuration Command register to enable the device.

If the request is not validated, software will deny the request and issue a command to the Hot Plug controller to turn the Power Indicator back OFF. The specification also recommends that software notify the operator via a message or by logging an entry indicating the cause of the request denial.

Standardized Usage Model

Background

Systems based on the original 1.0 version of the PCI Hot Plug specification implemented hardware and software designs that varied widely because the specification did not define standardized registers or user interfaces. Consequently, customers who purchased Hot Plug capable systems from different vendors were confronted with a wide variation in user interfaces that required retraining operators when new systems were purchased. Furthermore, every board designer was required to write software to manage their implementation-specific hot plug controller. The 1.0 revision of the PCI Hot-Plug Controller (HPC) specification defines:

PCI Express System Architecture

a standard user interface that eliminates retraining of operators

a standard programming interface for the hot plug controller, which permits a standardized hot plug driver to be incorporated into the operating system. PCI Express implements registers not defined by the HPC specification, hence the standard Hot Plug Controller driver implementations for PCI and PCI Express are slightly different.

The following sections discuss the standard user interface.

Standard User Interface

The user interface includes the following features:

Attention Indicator - shows the attention state of the slot. The indicators are specified to be on, off, or blinking. The specification defines the blinking frequency as 1 to $2 Hz$ and $50 % (+ / - 5 %)$ duty cycle. The state of this indicator is strictly under software control.

Power Indicator (called Slot State Indicator in PCI HP 1.1) - shows the power status of the slot. Power indicator states are on, off, or blinking. The specification defines the blinking frequency as 1 to $2 Hz$ and $50 % (+ / - 5 %)$ duty cycle. This indicator is controlled by software; however, the specification permits an exception in the event of a power fault condition.

Manually Operated Retention Latch and Optional Sensor - secures card within slot and notifies the system when the latch is released

Electromechanical Interlock (optional) - prevents card being removed from a slot while power is applied.

Software User Interface - allows operator to request hot plug operation

Attention Button (optional) - allows operator to manually request hot plug operation.

Slot Numbering Identification - provides visual identification of slot on the board.

Attention Indicator

As mentioned in the previous section, the specification requires the system vendor to include an Attention Indicator associated with each Hot-Plug slot. This indicator must be located in close proximity to the corresponding slot and is yellow or amber in color. This Indicator draws the attention of the end user to the slot due to the hot plug request having failed due to an operational problem. The specification makes a clear distinction between operational and validation error and does not permit the attention indicator to report validation errors. Validation errors are problems detected and reported by software prior to begin-

ning the hot plug operation. The behavior of the Attention Indicator is listed in Table 17-3 on page 665.

Table 17-3: Behavior and Meaning of the Slot Attention Indicator

Indicator Behavior	Attention State
Off	Normal — Normal Operation
On	Attention - Hot Plug Operation Failed due to an oper- ational problem (e.g., problems with external cabling, add-in cards, software drivers, and power faults)
Blinking	Locate - Slot is being identified at operator's request

Power Indicator

The power indicator simply reflects the state of main power at the slot, and is controlled by Hot Plug software. The color of this indicator is green and is illuminated when power to the slot is "on."

The specification specifically prohibits Root or switch port hardware to change the power indicator state autonomously as a result of power fault or other events. A single exception to this rule allows a platform implementation that is capable of detecting stuck-on power faults. A stuck-on fault is simply a condition in which commands issued to remove slot power are ineffective. If the system is designed to detect this condition the system may override the root or switch port's command to turn the power indicator off and force it to the "on" state. This notifies the operator that the card should not be removed from the slot even though the operator has requested the slot to be powered down. The specification further states that supporting stuck-on faults is optional and if handled via system software "the platform vendor must ensure that this optional feature of the Standard Usage Model is addressed via other software, platform documentation, or by other means."

The behavior of the power indicator and the related power states are listed in Table 17-4 on page 666. Note that

V_{aux}

remains on and switch signals are still connected until the retention latch is released or when the card is removed as detected by the Prsnt1# and Prsnt2# signals.

Table 17-4: Behavior and Meaning of the Power Indicator

Indicator Behavior	Power State
Off	Power Off — it is safe to remove or insert a card. All power has been removed as required for hot plug operation. Vaux is only removed when the Manual Retention Latch is released.
On	Power On — removal or insertion of a card is not allowed. Power is currently applied to the slot.
Blinking	Power Transition — card removal or insertion is not allowed. This state notifies the operator that software is currently removing or applying slot power in response to a hot plug request.

Manually Operated Retention Latch and Sensor

The Manual Retention Latch (MRL) is required and it holds PCI Express cards rigidly in the slot. Each MRL can implement an optional sensor that notifies the Hot-Plug Controller that the latch has been closed or opened. The specification also allows a single latch that can hold down multiple cards. Such implementations do not support the MRL sensor.

An MRL Sensor is a Switch, optical device, or other type of sensor. The sensor reports only two conditions: fully closed and open. If an unexpected latch release is detected, the port automatically disables the slot and notifies system software. Note however, that the specification prohibits ports from changing the state of the Power or Attention indicators autonomously.

The switched signals and auxillary power (Vaux) must be automatically removed from the slot when the MRL Sensor indicates that the MRL is open and must be restored to the slot when the MRL Sensor indicates that the latch is reestablished. The switched signals are:

$V_{aux}$

SMBCLK

SMBDAT

The specification also describes an alternate method for removing

V_{aux}

and SMBus power when an MRL sensor is not present. In this case, the PRSNT#1 and PRSNT#2 pins, which indicate whether a card is installed into the slot, can be used to trigger the port to remove the switched signals.

Electromechanical Interlock (optional)

The optional electromechanical card interlock mechanism provides a more sophisticated method of ensuring that a card is not removed when power is still applied to the slot. The specification does not define the specific nature of the interlock, but states that it can physically lock the add-in card or lock the MRL in place.

The lock mechanism is controlled via software; however, there is no specific programming interface defined to control the electromechanical interlock. Instead an interlock is controlled by the same port output signal that enables main power to the slot.

Software User Interface

An operator may use a software interface to request card removal or insertion. This interface is provided by system software, which also monitors slots and reports status information to the operator. The specification states that the user interface is implemented by the Operating System and consequently is beyond the scope of the specification.

The operator must be able to initiate operations at each slot independent of all other slots. Consequently, the operator may initiate a hot-plug operation on one slot using the software user interface or attention button while a hot-plug operation on another slot is in process. This can be done regardless of which interface the operator used to start the first Hot-Plug operation.

Attention Button

The Attention Button is a momentary-contact push-button switch, located near the corresponding Hot-Plug slot or on a module. The operator presses this button to initiate a hot-plug operation for this slot (e.g., card removal or insertion). Once the Attention Button is depressed, the Power Indicator starts to blink. From the time the blinking begins the operator has 5 seconds to abort the Hot Plug operation by depressing the button a second time.

The specification recommends that if an operation initiated by an Attention Button fails, the system software should notify the operator of the failure. For example, a message explaining the nature of the failure can be reported or logged. PCI Express System Architecture

Slot Numbering Identification

Software and operators must be able to identify a physical slot based on its slot number. Each hot-plug capable port must implement registers that software uses to identify the physical slot number. The registers include a Physical Slot number and a chassis number. The main chassis is always labeled chassis 0 . The chassis number for other chassis' must be a non-zero value and are assigned via the PCI-to-PCI bridge's Chassis Number register ("Introduction To Chassis/ Slot Numbering Registers" on page 859).

Standard Hot Plug Controller Signaling Interface

Figure 17-3 on page 669 represents a more detailed view of the logic within root and switch ports, along with the signals routed between the slot and port. The importance of the standardized Hot Plug Controller is the common software interface that allows the device driver to be integrated into operating systems.

The PCI Express specification in conjunction with the Card ElectroMechanical (CEM) specification define the slot signals and the support required for Hot Plug PCI Express. Following is a list of required and optional port interface signals needed to support the Standard Usage Model:

PWRLED# (required) - port output that controls state of Power Indicator

ATNLED# (required) - port output controls state of Attention Indicator

PWREN (required, if reference clock is implemented) - port output that controls main power to slot

REFCLKEN# (required) - port output that controls delivery of reference clock to the slot

PERST# (required) - port output that controls PERST# at slot

PRSNT1# (required) - Grounded at the connector

and PRSNT2# (required) - port input indicates presence of card in slot. Also pulled up on system board

PWRFLT# (required) - port input that notifies the Hot-Plug controller of a power fault condition detected by external logic

AUXEN#(required, if AUX power is implemented) - port output that controls switched AUX signals and AUX power to slot when MRL is opened and closed. The MRL# signal is required with AUX power is present.

MRL# (required if MRL Sensor is implemented, otherwise it's optional) - port input from the MRL sensor

BUTTON# (required if Attention Button is implemented, otherwise it's optional) - port input indicating operator wishes to perform a Hot-Plug operation

The Hot-Plug Controller Programming Interface

The standard programming interface to the Hot-Plug Controller is provided via the PCI Express Capability register block. Figure 17-4 on page 670 illustrates these registers and highlights the registers that are implemented by the different types of devices. Hot Plug features are primarily provided via Slot Registers that are defined for root and switch ports. The Device Capability register is also used in some implementations as described later in this chapter.

Figure 17-4: PCI Express Configuration Registers Used for Hot-Plug

Slot Capabilities

Figure 17-5 on page 671 illustrates the slot capability register and bit fields. Hardware must initialize the capability register fields to reflect the features implemented by this port. This register applies to both card slots and rack mount implementations, except for the indicators and attention button. Software must read from the device capability register within the module to determine if indicators and attention buttons are implemented. Table 17-5 on page 671 lists and defines the slot capability fields.

Figure 17-5: Attention Button and Hot Plug Indicators Present Bits

Table 17-5: Slot Capability Register Fields and Descriptions

Bit(s)	Register Name and Description
0	Attention Button Present - when set, indicates that an attention button is located on the chassis adjacent to the slot.
1	Power Controller Present - when set, indicates that a power controller is implemented for this slot.
2	MRL Sensor Present — when set, indicates that a MRL Sensor is located on the slot.
3	Attention Indicator Present — when set, indicates that an attention indi- cator is located on the chassis adjacent to the slot.
4	Power Indicator Present - when set, indicates that a power indicator is located on the chassis adjacent to the slot.

PCI Express System Architecture

Table 17-5: Slot Capability Register Fields and Descriptions (Continued)

Bit(s)	Register Name and Description
5	Hot-Plug Surprise - when set, indicates that it is possible that the user can remove the card from the system without notification.
6	Hot-Plug Capable - when set, indicates that this slot supports hot plug operation.
14:7	Slot Power Limit Value — specifies the maximum power that can be sup plied by this slot. This limit value is multiplied by the scale specified in th next field.
16:15	Slot Power Limit Scale — specifies the scaling factor for the Slot Power Limit Value.
31:19	Physical Slot Number - Indicates the physical slot number associated with this port.

Slot Power Limit Control

The specification provides a method for software to limit the amount of power consumed by a card installed into an expansion slot or backplane implementation. The registers needed to support this feature are included in the hot plug capable port within the Slot Capability register and within the expansion card or module within the device capability register.

Slot Control

Software controls the Hot Plug events via the Slot Control register. This register permits software to enable various Hot Plug features and to control hot plug operations. Figure 17-6 on page 673 depicts the slot control register and bit fields. Table 17-6 on page 673 lists and describes each field. This register acts as the programming interface to control various Hot-Plug features and to enable interrupt generation as well as enabling the sources of Hot-Plug events that can result in interrupt generation.

Table 17-6: Slot Control Register Fields and Descriptions

Bit(s)	Register Name and Description
0	Attention Button Pressed Enable. When set, this bit enables the genera- tion of a hot-plug interrupt (if enabled) or assertion of the Wake# mes- sage, when the attention button is pressed.
1	Power Fault Detected Enable. When set, enables generation of a hot-plug interrupt (if enabled) or Wake# message upon detection of a power faul
2	MRL Sensor Changed Enable. When set, enables generation of a hot- plug interrupt or Wake# (if enabled) message upon detection of a MRL sensor changed event.
3	Presence Detect Changed Enable. When set this bit enables the genera tion of the hot-plug interrupt or a Wake message when the presence detect changed bit in the Slot Status register is set.

PCI Express System Architecture

Table 17-6: Slot Control Register Fields and Descriptions (Continued)

Bit(s)	Register Name and Description
4	Command Completed Interrupt Enable. When set, enables a Hot- Plug interrupt to be generated that informs software that the hot-plug control- ler is ready to receive the next command.
5	Hot-Plug Interrupt Enable. When set, enables the generation of Hot-Plug interrupts.
6	Attention Indicator Control. Writes to the field control the state of the attention indicator and reads return the current state, as follows: - $00 b =$ Reserved - $01 b = On$ - $10 b =$ Blink - $11 b =$ Off
7	Power Indicator Control. Writes to the field control the state of the power indicator and reads return the current state, as follows: - $00 b =$ Reserved - $01 b = On$ - $10 b =$ Blink - $11 b =$ Off
8	Power Controller Control. Writes to the field switch main power to the slot and reads return the current state, as follows: - $0 b =$ Power On - $1 b =$ Power Off

Slot Status and Events Management

The Hot Plug Controller monitors a variety of events and reports these events to the Hot Plug System Driver. Software can use the "detected" bits to determine which event has occurred, while the status bit identifies that nature of the change. The changed bits must be cleared by software in order to detect a subsequent change. Note that whether these events get reported to the system (via a system interrupt) is determined by the related enable bits in the Slot Control Register.

Table 17-7: Slot Status Register Fields and Descriptions

Bit Location	Register Name and Description
0	Attention Button Pressed — set when the Attention Button is pressed. Notification of the attention button being pushed depends on the form-factor implemented: - standard card slots use a signal trace to report the event - rack and backplane implementations may rely on the Attention_Button_Pressed message. ’ refer to other form-factor specs for details regarding thos implementations.
1	Power Fault Detected — set when the Power Controller detects a power fault at this port.
2	MRL Sensor Changed — set when a MRL Sensor state change is detected.

PCI Express System Architecture

Table 17-7: Slot Status Register Fields and Descriptions

Bit Location	Register Name and Description
3	Presence Detect Changed — set when a change has been detected in the state of the Prsnt1# or Prsnt2# signals.
4	Command Completed — set when the Hot Plug Controller com- pletes a software command.
5	MRL Sensor State - when set, indicates the current state of the MRL sensor, if implemented. 0b = MRL Closed 1b = MRL Open
6	Presence Detect State — this bit reflects whether a card is installed into a slot or not (set if card present, clear if card not present). It is required for all root and switch ports that have a slot attached to the link. The specification also states that if a slot is not attached to the link, then this bit "should be hardwired to 1."

Card Slot vs Server IO Module Implementations

PCI Express supports two form factors that determine the location of the Hot-Plug indicators and attention button (See Figure 17-8 on page 677):

Standard Cards that reside in PCI-like slots - motherboard or expansion chassis implementations place Hot Plug indicators, the attention button, and MRL sensor adjacent to each slot on the board.

Server IO Modules (SIOMs) that install into racks - when modules are installed into a rack-mounted system the hot-plug indicators and attention button may be located more conveniently on the PCI Express modules as opposed to a Rack. However the specification does not preclude the possibility of indicators and buttons being located on module bays.

In addition to Server IO Modules, cards (or blades) that install into backplanes may also have indicators and the attention button located on the card. These implementations were not defined at the time of this writing. However, pro-posted SIOM implementations route the attention button and attention indicator signals through the connector.

The specification also defines messages that act as virtual wires for controlling the attention indicators and for reporting when the attention button has been pressed. The approach eliminates the need to route signals between the port and connector for the attention indicators and attention button, as done with card slots as illustrated in Figure 17-3 on page 669. See "Hot Plug Messages" on page 678.

Detecting Module and Blade Capabilities

Hot-Plug ports that attach to rack and backplane connectors may not know whether a given module or blade includes indicators or an attention button. Consequently, the specification includes this information within the Device Capabilities register. See Figure 17-9 on page 678.

Figure 17-9: Hot-Plug Capability Bits for Server IO Modules

Hot Plug Messages

When the Hot-Plug indicators and attention button are located on a module or blade, messages can be used as virtual wires to control the indicators and to report that the button has been pressed.

Attention and Power Indicator Control Messages. As discussed in Table 17-6 on page 673, the attention and power indicators each have three states: on, off, and blinking. The message transactions act as virtual wires to signal the attention indicator state. Figure 17-10 on page 680 illustrates the Hot Plug Message format and lists the values associated with each of the messages.

Attention_Indicator_On. This message is issued by the Hot Plug Controller when software writes a value of $01 b$ into the Attention Indicator Control field of the Slot Control Register indicating that the Attention Indicator is to be turned on. The endpoint device that receives the message terminates it and causes the card's attention indicator to turn on.

Attention_Indicator_Blink. This message is issued by the Hot Plug Controller when software writes a value of $10 b$ into the Attention Indicator Control field of the Slot Control Register indicating that the Attention Indicator is to blink on and off. The endpoint device that receives the message terminates it and causes the card's attention indicator to start blinking.

Attention_Indicator_Off. This message is issued by the Hot Plug Controller when software writes a value of $11 b$ into the Attention Indicator Control field of the Slot Control Register indicating that the Attention Indicator is to be turned off. The endpoint device that receives the message terminates it and causes the card's indicator to turn off.

Power_Indicator_On. This message is issued by the Hot Plug Controller when software writes a value of $01 b$ into the Power Indicator Control field of the Slot Control Register indicating that the Power Indicator is to be turned on. The endpoint device that receives the message terminates it and causes the card's power indicator to turn on.

Power_Indicator_Blink. This message is issued by the Hot Plug Controller when software writes a value of $10 b$ into the Power Indicator Control field of the Slot Control Register indicating that the Power Indicator is to blink on and off. The endpoint device that receives the message terminates it and causes the power indicator to blink.

Power_Indicator_Off. This message is issued by the Hot Plug Controller when software writes a value of $11 b$ into the Power Indicator Control field of the Slot Control Register indicating that the Power Indicator is to be turned off. The endpoint device that receives the message terminates it and causes the card's power indicator to turn off.

Attention Button Pressed Message. A module or blade that employs an attention button must signal the Hot Plug Controller that the button has been pressed. The module generates an Attention_Button_Pressed message (Figure 17-10 on page 680) that targets the upstream device (root or switch port). The message results in an Attention Button Pressed Event that causes the Attention Button Pressed status bit in the Slot Status register bit to be set, and may also result in an interrupt if enabled.

Limitations of the Hot Plug Messages. Note that these features function only when the card is installed and operational. Thus, indicators can be controlled prior to card removal and similarly the card can report that the

Slot Numbering

Physical SIot ID

An operator who wishes to prepare a slot for card removal or insertion must specify the Physical Slot ID. The physical slot number is designated by the system designer and this assignment must be communicated to the Root or Switch port. This is required because hardware must initialize the Physical Slot ID status bit within the Slot Status Register. When configuration accesses are made to read the physical slot ID software makes an association between the Logical Slot ID (Bus# and Device #) and physical Slot ID.

Quiescing Card and Driver

General

Prior to removing a card from the system, two things must occur:

The device's driver must cease accessing the card.

The card must cease generation transaction and interrupts.

How this is accomplished is OS-specific, but the following must take place:

The OS must stop issuing new requests to the device's driver or must instruct the driver to stop accepting new requests.

The driver must terminate or complete all outstanding requests.

The card must be disabled from generating interrupts or transactions.

When the OS commands the driver to quiesce itself and its device, the OS must not expect the device to remain in the system (in other words, it could be removed and not replaced with a similar card).

Pausing a Driver (Optional)

Optionally, an OS could implement a "Pause" capability to temporarily stop driver activity in the expectation that the same card or a similar card will be reinserted. If the card is not reinstalled within a reasonable amount of time, however, the driver must be quiesced and then removed from memory.

PCI Express System Architecture

A card may be removed and an identical card installed in its place. As an example, this could be because the currently-installed card is bad or is being replaced with a later revision as an upgrade. If it is intended that the operation appear seamless from a software and operational perspective, the driver would have to quiesce, save the current device's context (i.e., the contents of all of its registers, etc.). The new card would then be installed, the context restored, and normal operation would resume. It should be noted that if the old card had failed, it may or may not be possible to have the operation appear seamless.

Quiescing a Driver That Controls Multiple Devices

If a driver controls multiple cards and it receives a command from the OS to quiesce its activity with respect to a specific card, it must only quiesce its activity with that card as well as quiescing the card itself.

Quiescing a Failed Card

If a card has failed, it may not be possible for the driver to complete requests previously issued to the card. In this case, the driver must detect the error and must terminate the requests without completion and attempt to reset the card.

The Primitives

This section discusses the hot-plug software elements and the information passed between them. For a review of the software elements and their relationships to each other, refer to Table 17-1 on page 655. Communications between the Hot-Plug Service within the OS and the Hot-Plug System Driver is in the form of requests. The spec doesn't define the exact format of these requests, but does define the basic request types and their content. Each request type issued to the Hot-Plug System Driver by the Hot-Plug Service is referred to as a primitive. They are listed and described in Table 17-8 on page 682. 18

Table 17-8: The Primitives

Primitive	Parameters	Description
QueryHot-Plug System Driver	Input: None	Requests that the Hot-Plug System Driver return a set of Logical Slot IDs for the slots it controls.
QueryHot-Plug System Driver	Return: Set of Logical Slot IDs for slots controlled by this driver.

Table 17-8: The Primitives (Continued)

Primitive	Parameters	Description
Set Slot Status	Inputs: - Logical Slot ID - New slot state (on or off). - New Attention Indica- tor state. - New Power Indicator state.	This request is used to control the slots and the Attention Indicator associated with each slot. Good completion of a request is indicated by returning the Status Change Suc- cessful parameter. If a fault is incurred during an attempted status change, the Hot-Plug System Driver should return the appropriate fault message (see middle column). Unless otherwise specified, the card should be left in the off state.
Set Slot Status	Return: Request comple- tion status: - status change successful - fault—wrong frequency - fault—insufficient power - fault-insufficient con- figuration resources - fault-power fail - fault-general failure
Query Slot Status	Input: Logical Slot ID	This request returns the state of the indicated slot (if a card is present). The Hot-Plug System Driver must return the Slot Power status infor- mation.
Query Slot Status	Return: - Slot state (on or off) - Card power require- ments.
Async Notice of Slot Status Change	Input: Logical Slot ID	This is the only primitive (defined by the spec) that is issued to the Hot-Plug Service by the Hot-Plug System Driver. It is sent when the Driver detects an unsolicited change in the state of a slot. Exam- ples would be a run-time power fault or card installed in a previ- ously-empty slot with no warning.
Async Notice of Slot Status Change	Return: none

Add-in Cards and Connectors

The Previous Chapter

PCI Express includes native support for hot plug implementations. The previous chapter discussed hot plug and hot removal of PCI Express devices. The specification defines a standard usage model for all device and platform form factors that support hot plug capability. The usage model defines, as an example, how push buttons and indicators (LED's) behave, if implemented on the chassis, add-in card or module. The definitions assigned to the indicators and push buttons, described in this chapter, apply to all models of hot plug implementations.

This Chapter

This chapter provides an introduction to the PCI Express add-in card electromechanical specifications. It describes the card form factor, the connector details, and the auxiliary signals with a description of their function. Other card form factors are also briefly described, but it should be stressed that some of them have not yet been approved by the SIG as of this writing.

The Next Chapter

The next chapter provides an introduction to configuration in the PCI Express envionment. It introduces the configuration space in which a function's configuration registers are implemented, how a function is discovered, how configuration transactions are routed, PCI-compatible space, PCI Express extended configuration space, and how to differentiate between a normal function and a bridge.

Introduction

One goal of the PCI Express add-in card electromechanical spec was to encourage migration from the PCI architecture found in many desktop and mobile devices today by making the migration path straightforward and minimizing the required hardware changes. Towards this end, PCI Express add-in cards are defined to be very similar to the current PCI add-in card form factor, allowing them to readily coexist with PCI slots in system boards designed to the ATX or micro-ATX standard. PCI Express features like automatic polarity inversion and lane reversal also help reduce layout issues on system boards, so they can still be designed using the four-layer FR4 board construction commonly used today. As a result, much of an existing system board design can remain the same when it is modified to use the new architecture, and no changes are required for existing chassis designs.

Add-in Connector

The PCI Express add-in card connector (see Figure 18-1 on page 687 and Figure 18-2 on page 688) is physically very similar to the legacy PCI connector, but uses a different pinout and does not supply -12V or 5V power. The physical dimensions of a card are the same as the PCI add-in cards and the same IO bracket is used. Table 18-1 on page 689 shows the pinout for a connector that supports PCI Express cards up to

\times 16

(16 lanes wide). Several signals are referred to as auxiliary signals in the spec, and these are highlighted and described in more detail in the section that follows the table.

Note that cards with fewer lanes can be plugged into larger connectors that will accommodate more lanes. This is referred to as Up-plugging. The opposite case, installing a larger card into a smaller slot is called Down-plugging and, unlike PCI, is physically prevented in PCI Express by the connector keying.) Consequently,the connector described by the table will accommodate a card that is

\times 1

x 4, x 8

,or

x 16

. This flexibility in the connector is highlighted by notes in the table that indicate each group of signals. For example, a x4 card plugged into this slot would only make use of pins 1 through 32 , and so the note indicating the end of the

\times 4

group of signals appears after pin 32 . These segment indicators do not represent physical spaces or keys, however, because there is only one mechanical key on the connector, located between pins 11 and 12.

Table 18-1: PCI Express Connector Pinout

Pin #	Side B		Side A
Pin #	Name	Description	Name	Description
1	+12V	12V Power	PRSNT1#	Hot-Plug presence detect
2	+12V	12V Power	+12V	12V Power
3	RSVD	Reserved	+12V	12V Power
4	GND	Ground	GND	Ground
5	SMCLK	SMBus (System Manage- ment Bus) Clock	JTAG2	TCK (Test Clock), clock input for JTAG interface
6	SMDAT	SMBus (System Manage- ment Bus) data	JTAG3	TDI (Test Data Input)
7	GND	Ground	JTAG4	TDO (Test Data out- put)
8	+3.3V	3.3 V Power	JTAG5	TMS (Test Mode Select)
9	JTAG1	TRST# (Test Reset) resets the JTAG interface	$+ 3.3 V$	3.3 V Power
10	$3.3 V_{AUX}$	3.3 V Auxiliary Power	$+ 3.3 V$	3.3 V Power
11	WAKE#	Signal for link reactiva- tion	PERST#	Fundamental reset
Mechanical Key
12	RSVD	Reserved	GND	Ground
13	GND	Ground	REFCLK+	Reference Clock (differential pair)
14	PETp0	Transmitter differential pair, Lane 0	REFCLK-	Reference Clock (differential pair)
15	PETn0	Transmitter differential pair, Lane 0	GND	Ground

Table 18-1: PCI Express Connector Pinout (Continued)

Pin #	Side B		Side A
Pin #	Name	Description	Name	Description
16	GND	Ground	PERp0	Receiver differential pair, Lane 0
17	PRSNT2#	Hot-Plug presence detect	PERn0	Receiver differential pair, Lane 0
18	GND	Ground	GND	Ground
End of the x1 connector
19	PETp1	Transmitter differential pair, Lane 1	RSVD	Reserved
20	PETn1	Transmitter differential pair, Lane 1	GND	Ground
21	GND	Ground	PERp1	Receiver differential pair, Lane 1
22	GND	Ground	PERn1	Receiver differential pair, Lane 1
23	PETp2	Transmitter differential pair, Lane 2	GND	Ground
24	PETn2	Transmitter differential pair, Lane 2	GND	Ground
25	GND	Ground	PERp2	Receiver differential pair, Lane 2
26	GND	Ground	PERn2	Receiver differential pair, Lane 2
27	PETp3	Transmitter differential pair, Lane 3	GND	Ground
28	PETn3	Transmitter differential pair, Lane 3	GND	Ground
29	GND	Ground	PERp3	Receiver differential pair, Lane 3
30	RSVD	Reserved	PERn3	Receiver differential pair, Lane 3
31	PRSNT2#	Hot-Plug presence detect	GND	Ground
32	GND	Ground	RSVD	Reserved
End of the x4 connector
33	PETp4	Transmitter differential pair, Lane 4	RSVD	Reserved
34	PETn4	Transmitter differential pair, Lane 4	GND	Ground

Table 18-1: PCI Express Connector Pinout (Continued)

Pin #	Side B		Side A
Pin #	Name	Description	Name	Description
35	GND	Ground	PERp4	Receiver differential pair, Lane 4
36	GND	Ground	PERn4	Receiver differential pair, Lane 4
37	PETp5	Transmitter differential pair, Lane 5	GND	Ground
38	PETn5	Transmitter differential pair, Lane 5	GND	Ground
39	GND	Ground	PERp5	Receiver differential pair, Lane 5
40	GND	Ground	PERn5	Receiver differential pair, Lane 5
41	PETp6	Transmitter differential pair, Lane 6	GND	Ground
42	PETn6	Transmitter differential pair, Lane 6	GND	Ground
43	GND	Ground	PERp6	Receiver differential pair, Lane 6
44	GND	Ground	PERn6	Receiver differential pair, Lane 6
45	PETp7	Transmitter differential pair, Lane 7	GND	Ground
46	PETn7	Transmitter differential pair, Lane 7	GND	Ground
47	GND	Ground	PERp7	Receiver differential pair, Lane 7
48	PRSNT2#	Hot-Plug presence detect	PERn7	Receiver differential pair, Lane 7
49	GND	Ground	GND	Ground
End of the x8 connector
50	PETp8	Transmitter differential pair, Lane 8	RSVD	Reserved
51	PETn8	Transmitter differential pair, Lane 8	GND	Ground
52	GND	Ground	PERp8	Receiver differential pair, Lane 8
53	GND	Ground	PERn8	Receiver differential pair, Lane 8
54	PETp9	Transmitter differential pair, Lane 9	GND	Ground
55	PETn9	Transmitter differential pair, Lane 9	GND	Ground

Table 18-1: PCI Express Connector Pinout (Continued)

Pin #	Side B		Side A
Pin #	Name	Description	Name	Description
56	GND	Ground	PERp9	Receiver differential pair, Lane 9
57	GND	Ground	PERn9	Receiver differential pair, Lane 9
58	PETp10	Transmitter differential pair, Lane 10	GND	Ground
59	PETn10	Transmitter differential pair, Lane 10	GND	Ground
60	GND	Ground	PERp10	Receiver differential pair, Lane 10
61	GND	Ground	PERn10	Receiver differential pair, Lane 10
62	PETp11	Transmitter differential pair, Lane 11	GND	Ground
63	PETn11	Transmitter differential pair, Lane 11	GND	Ground
64	GND	Ground	PERp11	Receiver differential pair, Lane 11
65	GND	Ground	PERn11	Receiver differential pair, Lane 11
66	PETp12	Transmitter differential pair, Lane 12	GND	Ground
67	PETn12	Transmitter differential pair, Lane 12	GND	Ground
68	GND	Ground	PERp12	Receiver differential pair, Lane 12
69	GND	Ground	PERn12	Receiver differential pair, Lane 12
70	PETp13	Transmitter differential pair, Lane 13	GND	Ground
71	PETn13	Transmitter differential pair, Lane 13	GND	Ground
72	GND	Ground	PERp13	Receiver differential pair, Lane 13
73	GND	Ground	PERn13	Receiver differential pair, Lane 13
74	PETp14	Transmitter differential pair, Lane 14	GND	Ground
75	PETn14	Transmitter differential pair, Lane 14	GND	Ground
76	GND	Ground	PERp14	Receiver differential pair, Lane 14
77	GND	Ground	PERn14	Receiver differential pair, Lane 14

Table 18-1: PCI Express Connector Pinout (Continued)

Pin #	Side B		Side A
Pin #	Name	Description	Name	Description
78	PETp15	Transmitter differential pair, Lane 15	GND	Ground
79	PETn15	Transmitter differential pair, Lane 15	GND	Ground
80	GND	Ground	PERp15	Receiver differential pair, Lane 15
81	PRSNT2#	Hot-Plug presence detect	PERn15	Receiver differential pair, Lane 15
82	RSVD	Reserved	GND	Ground

Auxiliary Signals

General

Several signals highlighted in Table 18-1 as auxiliary signals are described here in more detail. These signals are provided to assist with certain system level functions and are not required by the general PCI Express architecture, although some are required for add-in cards. For reference, these signals are summarized in Table 18-2.

Table 18-2: PCI Express Connector Auxiliary Signals

Signal Name	Required or Optional	Signal Type	Definition
REFCLK+	Required	Low-voltage differential clock	100MHz (+/- 300ppm) Refer- ence clock used to synchronize devices on both ends of a link.
REFCLK-	Required	Low-voltage differential clock
PERST#	Required	Low speed	Indicates when main power is within tolerance and stable. PERST# goes inactive after a delay of $T_{PVPERL}$ once power is stable.

Table 18-2: PCI Express Connector Auxiliary Signals (Continued)

Signal Name	Required or Optional	Signal Type	Definition
WAKE#	Required if wakeup functional- ity is sup- ported.	Open-drain	Driven low by a function to request that the main power and reference clock be reactivated.
SMBCLK	Optional	Open-drain	SMBus clock signal.
SMBDAT	Optional	Open-drain	SMBus address/data signal.
JTAG Group	Optional	Low speed	This group of signals (TCLK, TDI, TDO, TMS, and TRST#) can optionally be used to support the IEEE 1149.1 boundary scan spec.
PRSNT1#	Required		These signals are used to indicate that a card is installed into the connector.
PRSNT2#	Required

Reference Clock

This differential clock must be provided by the system board (although its use is optional for add-in cards). Its purpose is to allow both the transmitter and the receiver on a link to derive their internal clocks from the same source clock. While using the reference clock is not required, it does simplify the task of keeping the internal clocks between devices on either end of a link within the specified

600 ppm

of each other,since any two reference clocks are required to be within

+ / - 300 ppm

of their nominal

100 MHz

frequency. In addition,the base spec states that minimizing the L0s exit latency (i.e., the time required for the link to transition from the lower power L0s state back to L0) requires using a common reference clock. Finally, if Spread Spectrum Clocking (SSC) is to be used, it generally requires that both transmitters and receivers on a link must use the same reference clock. SSC allows the clock to be "down-modulated", or reduced in frequency,by as much as

0.5 %

and then brought back up to its nominal frequency at a rate not higher than

33 KHz

. Trying to modulate the clock frequency among devices that were not using the same reference clock would clearly be very difficult.

PERST#

This signal, similar in function to an inverted version of the POWERGOOD signal in a typical PC, is deasserted 100 ms after the power supply is stable and within tolerance (see Figure 18-3 on page 695). PERST# is also aware of power management activity and so can also be used to give PCI Express devices some advance notice that power is about to be removed as a result of a power management operation (see Figure 18-4 on page 696). As long as PERST# remains asserted, all PCI Express functions are held in reset.

Figure 18-3: PERST Timing During Power Up

3.3Vaux stable to SMBus driven (optional). If no 3.3Vaux on platform, the delay is from +3.3V stable

Minimum time from power rails within specified tolerance to PERST# inactive (TPVPERL)

Minimum clock valid to PERST# inactive (TPERST-CLK)

Minimum PERST# inactive to PCI Express link out of electrical idle

Minimum PERST# inactive to JTAG driven (optional) OM14742A

PCI Express System Architecture

Figure 18-4: PERST# Timing During Power Management States

The PCI Express link will be put into electrical idle prior to PERST# going active.

PERST# goes active before the power on the connector is removed.

Clock and JTAG go inactive after PERST# goes active.

A wakeup event resumes the power to the connector, restarts the clock, and the sequence proceeds as in power up.

The minimum active time for PERST# is TPERST.

OM14743A

WAKE#

This open-drain signal is driven by a PCI Express device that supports the wakeup function to request reactivation of the main power and reference clock. If an add-in card supports the wakeup process, it must implement this pin, and a system board must support the function if it connects to the WAKE# pin on the slot. There are actually two defined wakeup mechanisms, the side-band WAKE# signal and an in-band indicator called the Beacon. The Beacon is required for all components with the exception of certain form factors, of which the PCI Express add-in card is one example. Systems that support wakeup for these form factors are required to support the WAKE# signal for them although

they are also encouraged to support the Beacon. Add-in cards that can generate a wakeup event are also required to support the Beacon operation. It is not clear why two mechanisms have been defined. One emphasis in PCI Express has been to reduce side-band signals, which would argue against adding a sideband wakeup signal. On the other hand, the use of the WAKE# signal may serve to reduce the latency involved in waking up the system enough to justify its use for add-in cards.

If a slot supports WAKE#, the signal is routed to the platform power management controller, which might reside, for example, inside the Root Complex. The WAKE# signals from all the slots can be bussed together into a single input or they can each be used as separate inputs to the controller. WAKE# must have a system board pullup to a reference voltage that will be present when the main power rails are turned off, and the pullup must be a value that will allow it to pull WAKE# high in no more than 100ns. Note that Hot plug requires WAKE# to be isolated (between connectors) and driven inactive during hot-add or hot-remove operations.

WAKE# functions in a way that is similar to PME# in a conventional PCI system, but it is not the same and must not be connected directly to the PME# signal. The spec also makes it clear that WAKE# must not directly cause an interrupt. As was true of the PME# signal in PCI, care must be taken to ensure that the generation of WAKE# in one device does not damage the WAKE# generation circuitry in another device. This could present a problem if one device has

3.3 V_{AUX}

supplied while another does not,permitting the output buffers of the device without power to be reverse-biased by the assertion of WAKE# and possibly damaged. One solution to this problem is to add a circuit like the one shown in Figure 18-5. As would be expected, a card can only initiate a wakeup event if

3.3 V_{AUX}

is supplied to it,since the other power rails may be turned off when the link is put into a sleep state.

operation and requirements of the SMBus are described in detail in the System Management Bus Specification, Version 2.0.

JTAG

This optional interface provides a Test Access Port (TAP) to facilitate testing of a card that implements it. The TAP pins operate at

3.3 V

,as do the other single-ended IO signals of the PCI Express connector. JTAG stands for Joint Test Action Group and is commonly used to refer to the IEEE Standard 1149.1, Test Access Port and Boundary Scan Architecture.

PRSNT Pins

Refer to Figure 18-6 on page 700. These pins are used by the system to indicate whether a card has been plugged into a connector. On the add-in card, the PRSNT1# pin is wired to the farthest available PRSNT2# pin on the connector. For example, a x4 card would wire pins 1A (PRSNT1#) and 31B (PRSNT2#) together on the card. On the system board the PRSNT1# pin on the slot is grounded, while all the PRSNT2# pins of the slot are bussed together and pulled high, so the system is able to detect that a card has been installed in the slot by observing that the PRSNT2# signal has been pulled low.

Detecting that a card has been added is useful in a system that implements either hot-plug or hot-swap mechanisms, since a slot could be left powered off when no card is detected. Upon insertion of a new card, the hardware could detect the change and begin the process of preparing the system to bring the new card online. When the new card goes active, the link will automatically detect that a device is present and begin the process of training the link.

As an aside, the fact that an add-in card is required to connect PRSNT1# to the farthest possible PRSNT2# pin may mean that the spec designers considered using the presence detect pins to indicate information such as the link width on an add-in card. However, if the system board simply connects all the PRSNT2# pins together, this indication is not available. Visibility of the link width may have presented no real advantage anyway, since the link will automatically establish the usable link width during training.

Figure 18-6: Presence Detect

Electrical Requirements

Power Supply Requirements

Table 18-3 describes the power supplied to an add-in card. Note that the current provided by the

+ 3.3 V

and

+ 3.3 V_{AUX}

supplies does not change as a function of the link width,while it does for the

+ 12 V

supply,indicating that the

+ 12 V

supply provides the power needed for add-in cards that have higher wattage requirements. Both the

+ 3.3 V

and

+ 12 V

power supplies are required for an add-in connector,while

+ 3.3 V_{AUX}

is optional. The current limits shown in the table for

+ 3.3 V_{AUX}

indicate that the higher allowance is only for devices that support wakeup. This resembles the power limits in PCI assigned for

3.3 V_{AUX}

,in which the limit is based on whether a card is PME enabled, but there is an exception to the rule implied by this table in PCI Express. The configuration bit called Auxil-

iary Power PM Enable found in the Device Control Register (see "Device Control Register" on page 905), when set, indicates that a device has permission to use the full

375 mA

of auxiliary power regardless of whether it supports the wakeup function.

Table 18-3: Power Supply Requirements

Power Rail	x1 Connector	x4/x8 Connector	x16 Connector
+3.3V Voltage Tolerance Supply Current Capacitive Load	+/- 9% (max) 3.0A (max) $1000 uF (max)$
+12V Voltage Tolerance Supply Current Capacitive Load	+/- 8% 0.5A 300 uF (max)	+/- 8% 2.1A 300 uF (max)	+/- 8% 4.4A 300 uF (max)
$+ 3.3 V_{AUX}$ Voltage Tolerance Supply Current Wakeup enabled Non-Wakeup enabled Capacitive Load	+/~ 9% (max) 375 mA (max) 20 mA (max) 150 uF (max)

Power Dissipation Limits

The power consumption limits for different link widths and card types are listed in Table 18-4. The table indicates,for example,that a

\times 1

card cannot exceed

10 W

unless it is a high power device intended for server applications,in which case the maximum is

25 W

. At the high end,a

\times 16

graphics card is allowed to consume up to

60 W

(increasing this value to

75 W

is under currently under consideration).

Table 18-4: Add-in Card Power Dissipation

Card Type	x1		x4/x8	x16
Standard Height	10W (max) Desktop applica- tion	25W (max) Server applica- tion	25W (max)	25W (max) Server applica- tion	60W (max) Graphics applica- tion
Low Profile card	10W (max)		10W (max)	25W (max)

The difference between the types of cards is described in more detail in the electromechanical spec, but basically, the standard height cards intended for desktop applications are limited to half-length add-in cards with lower wattages, while cards intended for server applications must be from at least 7.0 inches long up to a full-length card and are allowed to use higher wattages. Low-profile cards are limited to half-length and lower wattages.

Note that devices designated as high power are constrained to start up using the low power limits until they have been configured as high power devices. As a result,all

\times 1

cards are initially limited to

10 W

,and cards intended for graphics applications are limited to

25 W

at initial power up until configured as a high power device, at which time they can use up to 60W (this may be increased to 75W in the future). See section "Slot Power Limit Control" on page 562 for more information on power configuration.

Add-in Card Interoperability

As mentioned earlier, it is possible for a PCI Express add-in card to be plugged into a slot that was intended for a wider card. This is illustrated in Table 18-5, which also points out that all the slot widths must support the basic

\times 1

card. There are basically three size-mismatch scenarios to consider:

Up-plugging. Inserting a smaller link card into a larger link slot is fully allowed.

Down-plugging. Inserting a larger link card into a smaller link slot is not allowed and is physically prevented.

Down-shifting. Installing a card into a slot that is not fully routed for all of the lanes. This is not allowed except for the case of a $\times 8$ connector for which the system designer may choose to route only the first four lanes. A $\times 8$ card functions as a $\times 4$ card in this situation.

Table 18-5: Card Interoperability

Card $∖$ Slot	x1	x4	x8	x16
x1	Required	Required	Required	Required
x4	No	Required	Allowed	Allowed
x8	No	No	Required	Allowed
x16	No	No	No	Required

Form Factors Under Development

General

In addition to the cards and slots that are defined with the PCI Express Card Electromechanical spec, there are other form factors currently under development by various groups. These form factors are usually designed for a particular class of applications and thus have different constraints. A card designed for a server application for example, will typically have a bigger power budget and more space available than one designed for a mobile application. The specs for these form factors were still under development at the time of this publication and are therefore subject to change.

Server IO Module (SIOM)

The PCI SIG is currently developing a module for the server environment called the Server IO Module (SIOM). It has four form factors: a base version and a full version, both of which can use a single- or double-width card. The built in hot swap capability of PCI Express helps make this module attractive to the server market because it allows changes to be readily made to the system on the fly.

mobile computing platforms and communications applications. It is designed with an interface that includes a x1 PCI Express connector, a USB 2.0 connector, and several LED indicators. While this form factor is designed for adding functionality internal to the system and therefore does not allow the user to readily make changes to the system, it does facilitate build-to-order or configure-to-order manufacture by making it easy to customize the functionality of a machine simply by choosing which cards to add or leave out of a product during assembly. This spec is being developed by the PCI SIG.

Figure 18-8: Mini PCI Express Add-in Card Installed in a Mobile Platform

NEWCARD form factor

NEWCARD is a standard that is similar to the Mini PCI Express in functionality but is designed for the user to readily install or remove. This spec is being developed by the PCMCIA group (Personal Computer Memory Card International Association) for use in desktop and mobile devices, and is expected to ultimately replace the existing CardBus PC card solution for these computers. Like the Mini PCI Express standard, the NEWCARD interface is defined to contain a x1 PCI Express connector, a USB 2.0 connector, and several LED indicators.

For desktop machines, NEWCARD will offer the hot plug and hot swap capabilities of PCI Express and USB, and allow a user to upgrade or expand the machine without having to open it. Since the card will be able to fit into mobile computers, it will also be possible for the user to share a NEWCARD device between desktop and mobile computers.

For a communications-specific card, the IO interface side of the card might include wired connections such as a modem or Ethernet interface, or a wireless port such as a cellular or Bluetooth connection.

Configuration Ovérview

The Previous Chapter

The previous chapter provided an introduction to the PCI Express add-in card electromechanical specifications. It described the card form factor, the connector details, and the auxiliary signals with a description of their function. Other card form factors were also briefly described, but it should be stressed that some of them have not yet been approved by the SIG as of this writing.

This Chapter

This chapter provides an introduction to configuration in the PCI Express envi-onment. It introduces the configuration space in which a function's configuration registers are implemented, how a function is discovered, how configuration transactions are routed, PCI-compatible space, PCI Express extended configuration space, how a function is discovered, and how to differentiate between a normal function and a bridge.

The Next Chapter

The next chapter provides a detailed description of the two configuration mechanisms used in a PCI Express platform: the PCI-compatible configuration mechanism, and the PCI Express enhanced configuration mechanism. It provides a detailed description of the initialization period immediately following power-up, as well as error handling during this period.

Definition of Device and Function

Just as in the PCI environment, a device resides on a bus and contains one or more functions (a device containing multiple functions is referred to as a multifunction device). Each of the functions within a multifunction device provides a stand-alone functionality. As an example, one function could be a graphics controller while another might be a network interface.

Just as in PCI, a device may contain up to a maximum of eight functions numbered 0-through-7:

The one-and-only function implemented in a single-function device must be function 0 .

In a multifunction device, the first function must be function 0 , while the remaining functions do not have to be implemented in a sequential manner. In other words, a device could implement functions 0, 2 , and 7 .

In Figure 19-1 on page 713, Device 0 on Bus 3 is a multifunction device containing two functions, each of which implements its own set of configuration registers. PCI Express System Architecture

Definition of Primary and Secondary Bus

The bus connected to the upstream side of a bridge is referred to as its primary bus, while the bus connected to its downstream side is referred to as its secondary bus.

Topology Is Unknown At Startup

Refer to Figure 19-2 on page 714. When the system is first powered up, the configuration software has not yet scanned the PCI Express fabric to discover the machine topology and how the fabric is populated. The configuration software is only aware of the existence of the Host/PCI bridge within the Root Complex and that bus number 0 is directly connected to the downstream (i.e., secondary) side of the bridge.

It has not yet scanned bus 0 and therefore does not yet know how many PCI Express ports are implemented on the Root Complex. The process of scanning the PCI Express fabric to discover its topology is referred to as the enumeration process.

Figure 19-2: Topology View At Startup

Each Function Implements a Set of Configuration Registers

Introduction

At the behest of software executing on the processor, the Root Complex initiates configuration transactions to read from or write to a function's configuration registers. These registers are accessed to discover the existence of a function as well as to configure it for normal operation. In addition to memory, IO, and message space, PCI Express also defines a dedicated block of configuration space allocated to each function within which its configuration registers are implemented.

Function Configuration Space

Refer to Figure 19-3 on page 717. Each function's configuration space is 4KB in size and is populated as described in the following two subsections.

PCI-Compatible Space

The 256 byte (64 dword) PCI-compatible space occupies the first 256 bytes of this

4 KB

space. It contains the function’s PCI-compatible configuration registers. This area can be accessed using either of two mechanisms (both of which are described later):

The PCI configuration access mechanism (see "PCI-Compatible Configuration Mechanism" on page 723).

The PCI Express Enhanced Configuration mechanism (see "PCI Express Enhanced Configuration Mechanism" on page 731).

The first 16 dwords comprises the PCI configuration header area, while the remaining 48 dword area is reserved for the implementation of function-specific configuration registers as well as PCI New Capability register sets. It is mandatory that each PCI Express function must implement the PCI Express Capability Structure (defined later) within this area. A full description of the PCI-compatible registers may be found in "PCI Compatible Configuration Registers" on page 769. PCI Express System Architecture

PCI Express Extended Configuration Space

The remaining 3840 byte (960 dword) area is referred to as the PCI Express Extended Configuration Space. It is utilized to implement the optional PCI Express Extended Capability registers:

Advanced Error Reporting Capability register set.

Virtual Channel Capability register set.

Device Serial Number Capability register set.

Power Budgeting Capability register set.

A full description of the these optional register sets may be found in "Express-Specific Configuration Registers" on page 893.

Host/PCI Bridge's Configuration Registers

The Host/PCI bridge's configuration register set does not have to be accessed using either of the spec-defined configuration mechanisms mentioned in the previous section. Rather, it is mapped into a Root Complex design-specific address space (almost certainly memory space) that is known to the platform-specific BIOS firmware. However, its configuration register layout and usage must adhere to the standard Type 0 template defined by the PCI 2.3 spec (see "Header Type 0" on page 770 for details on the Type 0 register template).

Configuration Transactions Are Originated by the Processor

Only the Root Complex Can Originate Configuration Transactions

The spec states that only the Root Complex is permitted to originate configuration transactions. The Root Complex acts as the processor's surrogate to inject transaction requests into the fabric, as well as to pass completions back to the processor. The configuration software executing on the processor is responsible for detecting and configuring all devices in the system.

The ability to originate configuration transactions is restricted to the processor/ Root Complex to avoid the anarchy that would result if any device had the ability to change the configuration of other devices.

Configuration Transactions Only Move DownStream

This restriction exists for the same reason stated in the previous section.

No Peer-to-Peer Configuration Transactions

The following rule applies to Root Ports, Switches, and PCI Express-to-PCI Bridges: Propagation of Configuration Requests from peer-to-peer are not supported.

Configuration Transactions Are Routed Via Bus, Device, and Function Number

The transaction types that are routed via a bus, device, and function number (i.e., they use ID routing rather than address-based routing) are:

Configuration transactions.

Vendor-defined Messages may optionally be routed in this manner.

Completion transactions.

This chapter focuses on configuration-related issues.

Chapter 19: Configuration Overview

How a Function Is Discovered

The configuration software executing on the processor typically discovers the existence of a function by performing a read from its PCI-compatible Vendor ID register. A unique 16-bit value is assigned to each vendor by the PCI-SIG and is hardwired into the Vendor ID register of each function designed by that vendor. The Vendor ID of FFFFh is reserved and will never be assigned to any vendor.

A function is considered present if the value read from its Vendor ID register is a value other than FFFFh. In a system that depends on a return of all ones for a configuration read from a non-existent register, the Root Complex must be designed to return all ones for a configuration read request that results in a UR (Unsupported Request) completion status.

How To Differentiate a PCI-to-PCI Bridge From a Non-Bridge Function

Refer to Figure 19-3 on page 717 and Figure 19-4 on page 719. The lower 7 bits of the Header Type register identifies the basic category of the function:

$0 =$ the function is not a bridge.

1 = the function is a PCI-to-PCI bridge (aka P2P) interconnecting two buses.

$2 =$ the function is a CardBus bridge.

In Figure 19-1 on page 713, the Header Type field in each of the Virtual P2Ps would return a value of 1 , as would the PCI Express-to-PCI bridge (Bus 8, Device 0), while those in the following Endpoint functions would return 0 :

Bus 3, Device 0.

Bus 4, Device 0.

Bus 7, Device 0 .

Bus 10, Device 0. 20

Figure 19-4: Header Type Register

Configuration Meéhânisms

The Previous Chapter

The previous chapter provided an introduction to configuration in the PCI Express environment. It introduced the configuration space in which a function's configuration registers are implemented, how a function is discovered, how configuration transactions are routed, PCI-compatible space, PCI Express extended configuration space, and how to differentiate between a normal function and a bridge.

This Chapter

This chapter provides a detailed description of the two configuration mechanisms used in a PCI Express platform: the PCI-compatible configuration mechanism, and the PCI Express enhanced configuration mechanism. It provides a detailed description of the initialization period immediately following power-up, as well as error handling during this period.

The Next Chapter

The next chapter provides a detailed description of the discovery process and bus numbering. It describes:

Enumerating a system with a single Root Complex

Enumerating a system with multiple Root Complexes

A multifunction device within a Root Complex or a Switch

An Endpoint embedded in a Switch or Root Complex

Automatic Requester ID assignment.

Root Complex Register Blocks (RCRBs)

PCI Express System Architecture

Introduction

Refer to Figure 20-1 on page 723. Each function implements a 4KB configuration space. The lower 256 bytes (64 dwords) is the PCI-compatible configuration space, while the upper 960 dwords is the PCI Express extended configuration space.

There are two mechanisms available that allow configuration software running on the processor to stimulate the Root Complex to generate configuration transactions:

The PCI 2.3-compatible configuration access mechanism.

The PCI express enhanced configuration mechanism.

These two mechanisms are described in this chapter.

Intel x86 and PowerPC processors (as two example processor families) do not possess the ability to perform configuration read and write transactions. They use memory and IO (IO is only in the

\times 86

case) read and write transactions to communicate with external devices. This means that the Root Complex must be designed to recognize certain IO or memory accesses initiated by the processor as requests to perform configuration accesses.

PCI Express System Architecture

spec does not define a configuration mechanism to be used in systems other than PC-AT compatible systems.

Background

The

\times 86

processor family is capable of addressing up to,but no more than,

64 KB

of IO address space. In the EISA spec, the usage of this IO space was defined in such a manner that the only IO address ranges available for the implementation of the PCI Configuration Mechanism (without conflicting with an ISA or EISA device) were 0400h - 04FFh, 0800h - 08FFh, and 0C00h - 0CFFh. Many EISA system board controllers already resided within the

0400 h - 04 FFh

address range, making it unavailable.

Consider the following:

As with any other PCI function, a host/PCI bridge may implement up to 64 dwords of configuration registers.

Each PCI function on each PCI bus requires 64 dwords of dedicated configuration space.

Due to the lack of available IO real estate within the

64 KB

of IO space,it wasn’t feasible to map each configuration register directly into the processor's IO address space. Alternatively, the system designer could implement the configuration registers within the processor's memory space. The amount of memory space consumed aside, the address range utilized would be unavailable for allocation to regular memory. This would limit the system's flexibility regarding the mapping of actual memory.

PCI-Compatible Configuration Mechanism Description

General

The PCI-Compatible Configuration Mechanism utilizes two 32-bit IO ports implemented in the Host/PCI bridge within the Root Complex, located at IO addresses 0CF8h and 0CFCh. These two ports are:

The 32-bit Configuration Address Port, occupying IO addresses 0CF8h through 0CFBh.

The 32-bit Configuration Data Port, occupying IO addresses 0CFCh through 0CFFh.

Accessing one of a function's PCI-compatible configuration registers is a two step process:

Write the target bus number, device number, function number and dword number to the Configuration Address Port and set the Enable bit in it to one.

Perform a one-byte, two-byte, or four-byte IO read from or a write to the Configuration Data Port.

In response, the host/PCI bridge within the Root Complex compares the specified target bus to the range of buses that exist on the other side of the bridge and, if the target bus resides beyond the bridge, it initiates a configuration read or write transaction (based on whether the processor is performing an IO read or write with the Configuration Data Port).

Configuration Address Port

Refer to Figure 20-2 on page 726. The Configuration Address Port only latches information when the processor performs a full 32-bit write to the port. A 32-bit read from the port returns its contents. The assertion of reset clears the port to all zeros. Any 8- or 16-bit access within this IO dword is treated as an 8- or 16-bit IO access. The 32-bits of information written to the Configuration Address Port must conform to the following template (illustrated in Figure 20-2 on page 726):

bits [1:0] are hard-wired, read-only and must return zeros when read.

bits [7:2] identify the target dword (1-of-64) within the target function's PCI-compatible configuration space. When the Root Complex subsequently generates the resultant configuration request packet, this bit field supplies the content of the packet's Register Number field and the packet's Extended Register Number field is set to all zeros. This configuration access mechanism is therefore limited to addressing the first 64 dwords of the targeted function's configuration space (i.e., the function's PCI-compatible address space).

bits [10:8] identify the target function number (1-of-8) within the target device.

bits [15:11] identify the target device number (1-of-32).

bits [23:16] identifies the target bus number (1-of-256).

bits [30:24] are reserved and must be zero.

bit 31 must be set to a one, enabling the translation of a subsequent processor IO access to the Configuration Data Port into a configuration access. If bit 31 is zero and the processor initiates an IO read from or IO write to the Configuration Data Port, the transaction is treated as an IO transaction request.

PCI Express System Architecture

Figure 20-2: Configuration Address Port at 0CF8h

Bus Compare and Data Port Usage

Refer to Figure 20-3 on page 728. The Host/PCI bridge within the Root Complex implements a Bus Number register and a Subordinate Bus Number register. In a chipset that only supports one Root Complex, the bridge may have a bus number register that is hardwired to 0 , a read/write register that reset forces to 0 , or it just implicitly knows that it is the bridge to bus 0 . If bit 31 in the Configuration Address Port (see Figure 20-2 on page 726) is enabled (i.e., set to one), the bridge compares the target bus number to the range of buses that exists beyond the bridge.

Target Bus

= 0

. If the target bus is the same as the value in the Bus Number register, this is a request to perform a configuration transaction on bus 0 . A subsequent IO read from or write to the bridge's Configuration Data Port at

0 CFCh

causes the bridge to generate a Type 0 configuration read or write transaction. When devices that reside on a PCI bus detect a Type 0 configuration transaction in progress, this informs them that one of them is the target device (rather than a device on one of the subordinate buses beneath the bus the Type 0 transaction is being performed on).

Bus Number

<

Target Bus

\leq

Subordinate Bus Number. If the target bus specified in the Configuration Address Port is

>

than the value in the bridge’s Bus Number register,but is

\leq

the value in the bridge’s Subordinate Bus Number register, the bridge converts the subsequent processor IO access to its Configuration Data Port into a Type 1 configuration transaction on bus 0 . When devices (other than PCI-to-PCI bridges) that reside on a bus detect a Type 1 configuration access in progress, they ignore the transaction.

The only devices on a bus that pay attention to the Type 1 configuration transaction are PCI-to-PCI bridges. Each of them must determine if the target bus number (delivered in the packet's header) is within the range of buses that reside behind them:

If the target bus is not within range, then a PCI-to-PCI bridge ignores the Type 1 access.

If it's in range, the access is passed through the PCI-to-PCI bridge as either a Type 0 configuration transaction (if the target bus $=$ the bus number in the bridge's Secondary Bus Number register), or as

a Type 1 transaction (if the target bus number $\leq$ the value in the bridge’s Subordinate Bus Number register and $>$ than the value in the bridge’s Bus Number register).

The subject of Type 0 configuration accesses is covered in detail in "Type 0 Configuration Request" on page 732. The subject of Type 1 configuration accesses is covered in detail in "Type 1 Configuration Request" on page 733.

Single Host/PC! Bridge

Refer to Figure 20-3 on page 728. The information written to the Configuration Address Port is latched by the Host/PCI bridge within the Root Complex. If bit 31 is set to one and the target bus number

=

the value in the bridge’s Bus Number register (or is

\leq

the value in the bridge’s Subordinate Bus Number register), the bridge is enabled to convert a subsequent processor access targeting its Configuration Data Port into a configuration access on bus 0 within the Root Complex. The processor then initiates a one-byte, two-byte, or four-byte (for an x86 processor indicated by the processor's byte enable signals; or, if a PowerPC 60x processor, by A[29:31] and TSIZ[0:2]) IO read or write transaction to the Configuration Data Port at

0 CFCh

. This stimulates the bridge to perform a configuration read (if the processor is reading from the Configuration Data Port) or a configuration write (if the processor is writing to the Configuration Data Port). It will be a Type 0 configuration transaction if the target bus is bus 0 , or a Type 1 configuration transaction if the target bus is further out in the bus hierarchy beyond bus 0 .

Multiple Host/PCI Bridges

If there are multiple Root Complexes present on the processor's FSB (refer to Figure 20-4 on page 730), the Configuration Address and Data ports are duplicated at the same IO addresses in each of their respective host/PCI bridges. In order to prevent contention on the processor's FSB signals, only one of the bridges responds to the processor's accesses to the configuration ports.

When the processor initiates the IO write to the Configuration Address Port, only one of the host/PCI bridges actively participates in the transaction. The other bridge quietly snarfs the data as it's written to the active participant.

Both bridges then compare the target bus number to their respective Bus Number and Subordinate Bus Number registers. If the target bus doesn't reside behind a particular host/PCI bridge, that bridge doesn't convert the subsequent access to its Configuration Data Port into a configuration access on its bus (in other words, it ignores the transaction).

A subsequent read or write access to the Configuration Data Port is only accepted by the host/PCI bridge that is the gateway to the target bus. This bridge responds to the processor's transaction and the other ignores it.

When the access is made to the Configuration Data Port, the bridge with a bus compare tests the state of the Enable bit in its Configuration Address Port. If the Enabled bit $= 1$ ,the bridge converts the processor’s IO access into a configuration access:

o If the target bus is the bus immediately on the other side of the Host/ PCI bridge, the bridge converts the access to a Type 0 configuration access on its secondary bus.

o Otherwise, it converts it into a Type 1 configuration access. Chapter 20: Configuration Mechanisms

PCI Express Enhanced Configuration Mechanism

Description

Refer to Table 20 - 1 on page 732. Each function's 4KB configuration space starts at a 4KB-aligned address within the 256MB memory space set aside as configuration space:

Address bits 63:28 indicates the 256MB-aligned base address of the overall Enhanced Configuration address range.

Address bits 27:20 select the target bus (1-of-256).

Address bits 19:15 select the target device (1-of-32) on the bus.

Address bits 14:12 select the target function (1-of-8) within the device.

Address bits 11:2 selects the target dword (1-of-1024) within the selected function's configuration space.

Address bits 1:0 define the start byte location within the selected dword.

Some Rules

A Root Complex design is not required to support an access to the enhanced configuration memory space that crosses a dword address boundary (i.e., the access straddles two adjacent memory dwords.

In addition, some processor types can perform a series of memory accesses as an atomic, locked, transaction series. A Root Complex design is not required to support an access to the enhanced configuration memory space using this locking mechanism.

This being the case, software should avoid both of the scenarios just described unless it is known that the Root Complex implementation being used supports the translation.

Table 20 - 1: Enhanced Configuration Mechanism Memory-Mapped IO Address Range

Memory Address Bit Field	Description
A[63:28]	Upper bits of the 256MB-aligned base address of the 256MB memory-mapped IO address range allocated for the Enhanced Configuration Mechanism. The manner in which the base address is allocated is implementation-specific. It is supplied to the OS by system firmware.
A[27:20]	Target Bus Number (1-of-256).
A[19:15]	Target Device Number (1-of-32).
A[14:12]	Target Function Number (1-of-8).
A[11:2]	A[11:8] is the upper four bits of the target Dword Number (1-of-1024)
A[11:2]	A[7:2] is the lower six bits of the target Dword Num- ber.
A[1:0]	Along with the access size, defines the Byte Enable setting.

Type 0 Configuration Request

A configuration read or write takes the form of a Type 0 configuration read or write when it arrives on the destination bus. On discerning that it is a Type 0 configuration operation:

The devices on the bus decode the header's Device Number field to determine which of them is the target device.

The selected device decodes the header's Function Number field to determine the selected function within the device.

The selected function uses the concatenated Extended Register Number and Register Number fields to select the target dword in the function's configuration space.

Finally, the function uses the First Dword Byte Enable field to select the byte(s) to be read or written within the selected dword.

Chapter 20: Configuration Mechanisms

Figure 20-5 and Figure 20-6 illustrate the Type 0 configuration read and write request header formats. In both cases,the Type field

= 00100

,while the state of the Fmt field's msb indicates whether it's a read or a write.

Figure 20-5: Type 0 Configuration Read Request Packet Header

Figure 20-6: Type 0 Configuration Write Request Packet Header

Type 1 Configuration Request

While in transit to the destination bus, a configuration read or write takes the form of a Type 1 configuration read or write when it is performed on each bus on the way to the destination bus. The only devices that pay attention to a Type 1 configuration read or write are PCI-to-PCI bridges. Upon receipt of a Type 1 configuration read or write request packet, a PCI-to-PCI bridge compares the target bus number in the packet header to the range of buses that reside behind the bridge (as defined by the contents of the bridge's Secondary Bus Number and Subordinate Bus Number configuration registers; see Figure 20-3 on page 728 and Figure 20-4 on page 730).

PCI Express System Architecture

If the target bus is the bridge's secondary bus, the packet is converted from a Type 1 to a Type 0 configuration request when it is passed to the secondary bus. The devices on that bus then decode the packet header as previously described in "Type 0 Configuration Request" on page 732.

If the target bus is not the bridge's secondary bus but is a bus that resides beneath its secondary bus, the Type 1 request is passed through to the bridge's secondary bus as is.

Figure 20-7 and Figure 20-8 illustrate the Type 1 configuration read and write request header formats. In both cases,the Type field

= 00101

,while the state of the Fmt field's msb indicates whether it's a read or a write.

Figure 20-7: Type 1 Configuration Read Request Packet Header

Figure 20-8: Type 1 Configuration Write Request Packet Header

Example PCI-Compatible Configuration Access

Refer to Figure 20-9 on page 737. The following x86 code sample will cause the Root Complex to perform a read from Bus 4, Device 0, Function 0's Vendor ID configuration register:

mov

d x, 0 CF 8

; set

d x =

config address port address

mov eax,

80040000

; enable=1,bus 4,dev 0,Func 0,DW 0

out

dx

,eax ; set up address port

mov

d x, 0 CFC;

set

d x =

config data port address

in ax,dx ; 2 byte read from config data port

On execution of the out (IO Write) instruction, the processor generates an IO write transaction on its FSB targeting the Configuration Address Port in the Root Complex Host/PCI bridge. The data sourced from the eax register is latched into the Configuration Address Port (see Figure 20-2 on page 726).

The Host/PCI bridge compares the target bus number (4) specified in the Configuration Address Port to the range of buses (0-through-10) that reside downstream of the bridge. The target bus falls within the range, so the bridge is primed.

On execution of the in (IO Read) instruction, the processor generates an IO read transaction on its FSB targeting the Configuration Data Port in the Root Complex Host/PCI bridge. It's a 2-byte read from the first two locations in the Configuration Data Port.

Since the target bus is not bus 0 , the Host/PCI bridge initiates a Type 1 Configuration read on bus 0 .

All of the devices on bus 0 latch the transaction request and determine that it is a type 1 Configuration Read request. As a result, both of the virtual PCI-to-PCI bridges in the Root Complex compare the target bus number in the Type 1 request to the range of buses that reside downstream of each of them.

The destination bus (4) is within the range of buses downstream of the left-hand bridge, so it passes the packet through to its secondary bus (bus 1). It is passed through as a Type 1 request because this is not the destination bus.

The upstream port on the left-hand switch receives the packet and delivers it to the upstream PCI-to-PCI bridge.

The bridge determines that the destination bus resides beneath it, so it passes the packet through to bus 2 as a Type 1 request.

Both of the bridge's within the switch receive the Type 1 request packet and the right-hand bridge determines that the destination bus is directly beneath it. PCI Express System Architecture

The bridge passes the Type 1 request packet through to bus 4, but converts into a Type 0 Configuration Read request (because the packet has arrived at the destination bus.

Device 0 on bus 4 receives the packet and decodes the target device number.

Device 0 decodes the target function number.

Function 0 in Device 0 uses the concatenated Extended Register Number and Register Number fields to select the target dword (dword 0; see Figure 20-1 on page 723) in the function's configuration space.

The first two Byte Enables in the First Dword Byte Enable field are asserted, so the function returns its Vendor ID in the resulting Completion packet. The Completion packet is routed back to the Host/PCI bridge using the Requester ID field obtained from the Type 0 request packet.

The two bytes of read data are delivered to the processor over its FSB, thereby completing the execution of the in instruction. The Vendor ID is placed in the processor's ax register.

Example Enhanced Configuration Access

Refer to Figure 20-9 on page 737. The following x86 code sample will cause the Root Complex to perform a read from Bus 4, Device 0, Function 0's Vendor ID configuration register. The example assumes that the

256 MB

-aligned base address of the Enhanced Configuration memory-mapped IO range is 50000000h:

mov ax, [50400000] ;memory-mapped IO read

Address bits 63:28 indicates the upper 36 bits of the 256MB-aligned base address of the overall Enhanced Configuration address range (in this case, $000000005 h$ ).

Address bits 27:20 select the target bus (in this case, 4).

Address bits 19:15 select the target device (in this case, 0) on the bus.

Address bits 14:12 select the target function (in this case, 0) within the device.

Address bits 11:2 selects the target dword (in this case, 0) within the selected function's configuration space.

Address bits 1:0 define the start byte location within the selected dword (in this case, 0).

The processor initiates a 2-byte memory read from memory locations 50400000h and 50400001h on its FSB. The request is latched by the Host/PCI bridge in the Root Complex. It decodes the address and determines that it is a configuration read request for the first two bytes in dword 0 , function 0 , device 0 , bus 4 . The remainder of the operation is the same as that described in the previous section.

Initial Configuration Accesses

What's Going On During Initialization Time?

During initialization time, the startup configuration software is accessing the configuration registers within each function to determine the presence of a function as well as its resource requirements. Immediately after RST# is removed from a PCI or a PCI-X function, it may not be prepared to service configuration accesses on a timely basis. As an example, a function's configuration registers might not contain valid default values immediately after RST# is removed. Perhaps the function must start backloading this information into its configuration registers from a serial EEPROM. In this case, it could be a substantial amount of time after RST# removal before the function can provide read data from or accept write data into its configuration registers. For this reason, functions do not have obey the 16 clock first Data Phase completion rule during initialization time.

Definition of Initialization Period In PCI

As defined in the PCI 2.3 spec, Initialization Time (Trhfa) begins when RST# is deasserted and completes

2^{25}

PCI clocks later (32 mega-cycles). This parameter is referred to in the spec as Trhfa (Time from Reset High-to-First-Access). At a bus speed of

33 MHz

,this equates to 1.0066 seconds,while it equates to 0.5033 seconds at a bus speed of

66 MHz

. Run-time follows initialization-time. If a target is accessed during initialization-time, it is allowed to do any of the following:

Ignore the request (except if it is a boot device). A boot device is one that must respond as a target in order to allow the processor to access the boot ROM. In a typical PC design, this would be ICH (IO Control Hub). Devices in the processor's path to the boot ROM should be prepared to be the target of a transaction immediately after Trhff expires (five clock cycles after RST# is deasserted).

Claim the access and hold in Wait States until it can complete the request, not to exceed the end of Initialization Time.

Claim the access and terminate with Retry.

Definition of Initialization Period In PCI-X

In PCI-X,Trhfa is

2^{26}

clocks (64 mega-cycles) in duration rather than

2^{25}

as it is in PCI. This is because the PCI-X clock speed can be substantially faster than (up to

133 MHz

) the PCI clock speed and if this parameter remained the same as the PCI Trhfa spec, Initialization Time would be reduced to 0.25 seconds (at a clock speed of

133 MHz

During Initialization Time, a PCI-X target has the same options available as a PCI target does (see previous section).

PCI Express and Initialization Time

Just as in PCI or PCI-X, some devices in a PCI Express environment may go through a rather long self-initialization sequence to before they are able to service configuration access requests.

When a PCI Express device receives a configuration request it may respond with a Configuration Request Retry Completion Status (CRS). Requester receipt of a Completion with Configuration Request Retry Status terminates the configuration access request on PCI Express.

Initial Configuration Access Failure Timeout

After a PCI Express device is reset, the Root Complex and/or system software must allow

1.0 s (+ 50 % / - 0 %)

for the device to return a Successful Completion status before deciding that the device has malfunctioned. This is analogous to the PCI/PCI-X Trhfa parameter.

When attempting a configuration access to a device on a PCI or PCI-X bus downstream of a PCI Express-to-PCI or -PCI-X bridge, Trhfa must be taken into account.

Delay Prior To Initial Configuration Access to Device

After system hardware or software causes one or more devices to be reset, software must wait at least

100 ms

from the end of reset before issuing any configuration requests to those devices. This time period is allocated to allow the device(s) to complete internal initialization.

PCI Express System Architecture

The system design must guarantee (in a design-specific manner) that all components that must be software visible at boot time are ready to receive configuration requests within

100 ms

of the deassertion of Fundamental Reset at the Root Complex.

A Device With a Lengthy Self-Initialization Period

If a PCI Express device requires additional time to finish its self initialization, the system design must provide a design-specific mechanism for re-issuing configuration requests terminated with CRS status after the initial

1 s

timeout has elapsed.

To ensure proper enumeration of the system in a system running legacy PCI/ PCI-X based software, the Root Complex hardware must re-issue the configuration request.

RC Response To CRS Receipt During Run-Time

After initialization time has elapsed, the action(s) taken by the Root Complex upon receipt of a Configuration Request Retry Completion Status is implementation-specific. It may re-issue the configuration request as a new request or may indicate failed completion to the processor.

If the Root Complex is designed to automatically retry the request, the number of retries attempted before indicating a failure to the processor is design-specific.

During Run-Time, support for a Completion Timeout (and the duration of the timeout) for configuration requests are implementation-specific.

The default setting in a PCI Express-to-PCI or -PCI-X bridge prevents it from returning a Configuration Request Retry Status (CRS) for a configuration request that targets a PCI or PCI-X device downstream of the bridge. This can result in a lengthy completion delay that must be taken into account by the Completion Timeout value used by the Root Complex. Configuration software can enable such a bridge to return Configuration Request Retry Status by setting the Bridge Configuration Retry Enable bit in the bridge's Device Control register.

21 PCI Express Enumeration

The Previous Chapter

The previous chapter provided a detailed description of the two configuration mechanisms used in a PCI Express platform: the PCI-compatible configuration mechanism, and the PCI Express enhanced configuration mechanism. It provided a detailed description of the initialization period immediately following power-up, as well as error handling during this period.

This Chapter

This chapter provides a detailed description of the discovery process and bus numbering. It describes:

Enumerating a system with a single Root Complex

Enumerating a system with multiple Root Complexes

A multifunction device within a Root Complex or a Switch

An Endpoint embedded in a Switch or Root Complex

Automatic Requester ID assignment.

Root Complex Register Blocks (RCRBs)

The Next Chapter

The next chapter provides a detailed description of the configuration registers residing a function's PCI-compatible configuration space. This includes the registers for both non-bridge and bridge functions.

Introduction

The discussions associated with Figure 19-1 on page 713 and Figure 20-4 on page 730 assumed that, each of the buses had been discovered and numbered earlier in time.

PCI Express System Architecture

In reality, at power up time, the configuration software only knows of the existence of bus 0 (the bus that resides on the downstream side of the Host/PCI bridge) and does not even know what devices reside on bus 0 (see Figure 21-1 on page 742).

This chapter describes the enumeration process: the process of discovering the various buses that exist and the devices and functions which reside on each of them.

Figure 21-1: Topology View At Startup

Enumerating a System With a Single Root Complex

Figure 21-2 on page 748 illustrates an example system before the buses and devices have been enumerated, while Figure 21-3 on page 749 shows the same system after the buses and devices have been enumerated. The discussion that follows assumes that the configuration software uses either of the two configuration mechanisms defined in the previous chapter. At startup time, the configuration software executing on the processor performs bus/device/function enumeration in the following manner:

Starting with device 0 (bridge A), the enumeration software attempts to read the Vendor ID from function 0 in each of the 32 possible devices on bus 0.

If a valid (not FFFFh) Vendor ID is returned from bus 0, device 0 , function 0 , this indicates that the device is implemented and contains at least one function. Proceed to the next step.

If a value of FFFFh were returned as the Vendor ID, this would indicate that function 0 is not implemented in device 0 . Since it is a rule that the first function implemented in any device must be function 0 , this would mean that device was not implemented and the enumeration software would proceed to probe bus 0 , device 1 , function 0 .

The Header Type field (see Figure 21-6 and Figure 21-7) in the Header register (see Figure 21-4) contains the value one (000001b) indicating that this is a PCI-to-PCI bridge with the PCI-compatible register layout shown in Figure 21-7 on page 752. This discussion assumes that the Multifunction bit (bit 7) in the Header Type register is 0 , indicating that function 0 is the only function in this bridge. It should be noted that the spec does not preclude implementing multiple functions within this bridge and each of these functions, in turn, could represent virtual PCI-to-PCI bridges.

Software now performs a series of configuration writes to set the bridge's bus number registers as follows:

Primary Bus Number Register $= 0$ .

Secondary Bus Number Register $= 1$ .

Subordinate Bus Number Register $= 1$ .

The bridge is now aware that the number of the bus directly attached to its downstream side is 1 (Secondary Bus Number

= 1

) and the number of the bus farthest downstream of it is 1 (Subordinate Bus Number

= 1

Software updates the Host/PCI bridge's Subordinate Bus Number register to 1.

The enumeration software reads bridge A's Capability Register (Figure 21-5 on page 750 and Table 21 - 1 on page 753; a detailed description of this register can be found in "PCI Express Capabilities Register" on page 898). The value $0100 b$ in the register’s Device/Port Type field indicates that this a Root Port on the Root Complex.

The specification states that the enumeration software must perform a depth-first search, so before proceeding to discover additional functions/ devices on bus 0 , it must proceed to search bus 1 .

Software reads the Vendor ID of bus 1, device 0 , function 0 . A valid Vendor ID is returned, indicating that bus 1, device 0 , function 0 exists.

The Header Type field in the Header register contains the value one (0000001b) indicating that this is a PCI-to-PCI bridge. In addition, bit 7 is a 0 , indicating that bridge C is a single-function device. PCI Express System Architecture

Bridge C's Capability Register contains the value 0101b in the Device/Port Type field indicating that this is the upstream Port on a switch.

Software now performs a series of configuration writes to set bridge $C^{'} s$ bus number registers as follows:

Primary Bus Number Register $= 1$ .

Secondary Bus Number Register $= 2$ .

Subordinate Bus Number Register $= 2$ .

Bridge

C

is now aware that the number of the bus directly attached to its downstream side is 2 (Secondary Bus Number

= 2

) and the number of the bus farthest downstream of it is 2 (Subordinate Bus Number

= 2

Software updates the Subordinate Bus Number registers in the Host/PCI bridge and in bridge A to 2.

Continuing with its depth-first search, a read is performed from bus 2, device 0 , function 0 's Vendor ID register. The example assumes that bridge D is device 0, function 0 on bus 2.

A valid Vendor ID is returned, indicating that bus 2, device 0 , function 0 exists.

The Header Type field in the Header register contains the value one (0000001b) indicating that this is a PCI-to-PCI bridge. In addition, bit 7 is a 0 , indicating that bridge D is a single-function device.

Bridge D's Capability Register contains the value 0110b in the Device/Port Type field indicating that this is the downstream Port on a switch.

Software now performs a series of configuration writes to set bridge D's bus number registers as follows:

Primary Bus Number Register $= 2$ .

Secondary Bus Number Register $= 3$ .

Subordinate Bus Number Register $= 3$ .

Bridge

D

is now aware that the number of the bus directly attached to its downstream side is 3 (Secondary Bus Number

= 3

) and the number of the bus farthest downstream of it is 3 (Subordinate Bus Number

= 3

Software updates the Subordinate Bus Number registers in the Host/PCI bridge,bridge $A$ ,and bridge $C$ to 3 .

Continuing with its depth-first search, a read is performed from bus 3, device 0 , function 0 's Vendor ID register.

A valid Vendor ID is returned, indicating that bus 3, device 0 , function 0 exists.

The Header Type field in the Header register contains the value zero (000000b) indicating that this is an Endpoint device. In addition, bit 7 is a 1, indicating that this is a multifunction device.

The device's Capability Register contains the value $0000 b$ in the Device/ Port Type field indicating that this is an Endpoint device.

The enumeration software performs accesses to the Vendor ID of functions

1-through-7 in bus 3, device 0 and determines that only function 1 exists in addition to function 0 .

Having exhausted the current leg of the depth first search, the enumeration software backs up one level (to bus 2) and moves on to read the Vendor ID of the next device (device 1). The example assumes that bridge $E$ is device 1, function 0 on bus 2 .

A valid Vendor ID is returned, indicating that bus 2, device 1, function 0 exists.

The Header Type field in bridge E's Header register contains the value one (0000001b) indicating that this is a PCI-to-PCI bridge. In addition, bit 7 is a 0 , indicating that bridge E is a single-function device.

Bridge E's Capability Register contains the value 0110b in the Device/Port Type field indicating that this is the downstream Port on a switch.

Software now performs a series of configuration writes to set bridge E's bus number registers as follows:

Primary Bus Number Register $= 2$ .

Secondary Bus Number Register $= 4$ .

Subordinate Bus Number Register $= 4$ .

Bridge

E

is now aware that the number of the bus directly attached to its downstream side is 4 (Secondary Bus Number

= 4

) and the number of the bus farthest downstream of it is 4 (Subordinate Bus Number

= 4

Software updates the Subordinate Bus Number registers in the Host/PCI bridge,bridge $A$ ,and bridge $C$ to 4 .

Continuing with its depth-first search, a read is performed from bus 4, device 0 , function 0 's Vendor ID register.

A valid Vendor ID is returned, indicating that bus 4, device 0 , function 0 exists.

The Header Type field in the Header register contains the value zero (000000b) indicating that this is an Endpoint device. In addition, bit 7 is a 0, indicating that this is a single-function device.

The device's Capability Register contains the value $0000 b$ in the Device/ Port Type field indicating that this is an Endpoint device.

Having exhausted the current leg of the depth first search, the enumeration software backs up one level (to bus 2) and moves on to read the Vendor ID of the next device (device 2). The example assumes that devices 2-through- 31 are not implemented on bus 2, so no additional devices are discovered on bus 2.

The enumeration software backs up to the bus within the Root Complex (bus 0) and moves on to read the Vendor ID of the next device (device 1). The example assumes that bridge B is device 1, function 0 on bus 0 .

In the same manner as previously described, the enumeration software discovers bridge B and performs a series of configuration writes to set bridge

B's bus number registers as follows:

Primary Bus Number Register $= 0$ .

Secondary Bus Number Register $= 5$ .

Subordinate Bus Number Register $= 5$ .

Bridge

B

is now aware that the number of the bus directly attached to its downstream side is 5 (Secondary Bus Number

= 5

) and the number of the bus farthest downstream of it is

5

(Subordinate Bus Number

= 5

The Host/PCI's Subordinate Bus Number is updated to 5.

Bridge $F$ is then discovered and a series of configuration writes are per-

formed to set its bus number registers as follows:

Primary Bus Number Register $= 5$ .

Secondary Bus Number Register $= 6$ .

Subordinate Bus Number Register $= 6$ .

Bridge

F

is now aware that the number of the bus directly attached to its downstream side is 6 (Secondary Bus Number

= 6

) and the number of the bus farthest downstream of it is 6 (Subordinate Bus Number

= 6

The Host/PCI bridge's and bridge B' Subordinate Bus Number registers are updated to 6 .

Bridge $G$ is then discovered and a series of configuration writes are performed to set its bus number registers as follows:

Primary Bus Number Register $= 6$ .

Secondary Bus Number Register $= 7$ .

Subordinate Bus Number Register $= 7$ .

Bridge

F

is now aware that the number of the bus directly attached to its downstream side is 7 (Secondary Bus Number

= 7

) and the number of the bus farthest downstream of it is

7

(Subordinate Bus Number

= 7

The Host/PCI bridge's Subordinate Bus Number register is updated to 7. Bridge B's and F's Subordinate Bus Number registers are also updated to 7.

A single-function Endpoint device is discovered at bus 7, device 0 , function 0 .

Bridge $H$ is then discovered and a series of configuration writes are performed to set its bus number registers as follows:

Primary Bus Number Register $= 6$ .

Secondary Bus Number Register $= 8$ .

Subordinate Bus Number Register $= 8$ .

Bridge

F

is now aware that the number of the bus directly attached to its downstream side is 8 (Secondary Bus Number

= 8

) and the number of the bus farthest downstream of it is 8 (Subordinate Bus Number

= 8

The Host/PCI bridge's Subordinate Bus Number register is updated to 8. Bridge B's and F's Subordinate Bus Number registers are also updated to 8.

Bridge J is discovered and its Capability register's Device/Port Type fields identifies it as a PCI Express-to-PCI bridge.

A series of configuration writes are performed to set bridge J's bus number registers as follows:

Primary Bus Number Register $= 8$ .

Secondary Bus Number Register $= 9$ .

Subordinate Bus Number Register $= 9$ .

Bridge

F

is now aware that the number of the bus directly attached to its downstream side is 9 (Secondary Bus Number

= 9

) and the number of the bus farthest downstream of it is 9 (Subordinate Bus Number

= 9

The Host/PCI bridge's Subordinate Bus Number register is updated to 9. Bridge B's, bridge F's, and bridge H's Subordinate Bus Number registers are also updated to 9.

All devices and their respective functions on bus 9 are discovered and none of them are bridges.

Bridge I is then discovered and a series of configuration writes are performed to set its bus number registers as follows:

Primary Bus Number Register $= 6$ .

Secondary Bus Number Register $= 10$ .

Subordinate Bus Number Register $= 10$ .

Bridge I is now aware that the number of the bus directly attached to its downstream side is 10 (Secondary Bus Number

= 10

) and the number of the bus farthest downstream of it is 10 (Subordinate Bus Number

= 10

The Host/PCI bridge's Subordinate Bus Number register is updated to 10. Bridge B's and bridge F's Subordinate Bus Number registers are also updated to 10 .

A single-function Endpoint device is discovered at bus 10, device 0 , function 0 .

Table 21 - 1: Capability Register's Device/Port Type Field Encoding

Value	Description
0000b	PCI Express Endpoint device.
0001b	Legacy PCI Express Endpoint device.
0100b	Root Port of PCI Express Root Complex. This value is only valid for devices/functions that implement a Type 01h PCI Configuration Space header.
0101b	Upstream Port of PCI Express Switch. This value is only valid for devices/functions that implement a Type 01h PCI Configuration Space header.
0110b	Downstream Port of PCI Express Switch. This value is only valid for devices/functions that implement a Type 01h PCI Configuratio Space header.
0111b	PCI Express-to-PCI/PCI-X Bridge. This value is only valid for devices/functions that implement a Type 01h PCI Configuration Space header.
1000b	PCI/PCI-X to PCI Express Bridge. This value is only valid for devices/functions that implement a Type 01h PCI Configuration Space header.
All other encodings are reserved.

Enumerating a System With Multiple Root Complexes

Refer to Figure 21-8 on page 757. In a system with multiple Root Complexes, each Root Complex:

Implements the Configuration Address Port and the Configuration Data Port at the same IO addresses (if it's an x86-based system).

Implements the Enhanced Configuration Mechanism.

Contains a Host/PCI bridge.

Implements the Bus Number and Subordinate Bus Number registers at separate addresses known to the configuration software.

PCI Express System Architecture

In the example illustration, each Root Complex is a member of the chipset and one of them is designated as the bridge to bus 0 (let's call this the primary Root Complex) while the other one is designated as the bridge to bus 255 (bus FFh; let's call it the secondary Root Complex). The default Bus Number and Subordinate Bus Number register values at startup time are:

In the primary Root Complex, both the Bus Number and Subordinate Bus Number registers are set to 0 .

In the secondary Root Complex, both the Bus Number and Subordinate Bus Number registers are set to FFh (255d).

Operational Characteristics of the PCI-Compatible Mechanism

In order to prevent contention on the processor's FSB signals, only one of the bridges responds to the processor's accesses to the configuration ports:

When the processor initiates the IO write to the Configuration Address Port, only one of the Host/PCI bridges (typically the one in the primary Root Complex) actively participates in the transaction. The other bridge quietly snarfs the data as it's written to the active participant.

Both bridges then compare the target bus number to their respective Bus Number and Subordinate Bus Number registers. If the target bus doesn't reside behind a particular Host/PCI bridge, that bridge doesn't convert the subsequent access to its Configuration Data Port into a configuration access on its bus (in other words, it ignores the transaction).

A subsequent read or write access to the Configuration Data Port is only accepted by the Host/PCI bridge that is the gateway to the target bus. This bridge responds to the processor's transaction and the other ignores it.

When the access is made to the Configuration Data Port, the selected bridge tests the state of the Enable bit in its Configuration Address Port. If the Enabled bit $= 1$ ,the bridge converts the processor’s IO access into a configuration access:

o If the target bus is the bus immediately on the other side of the Host/ PCI bridge, the bridge converts the access to a Type 0 configuration access on its secondary bus.

o Otherwise, it converts it into a Type 1 configuration access.

Operational Characteristics of the Enhanced Configuration Mechanism

In order to prevent contention on the processor's FSB signals, only one of the bridges responds to the processor's accesses to the enhanced configuration memory-mapped IO space:

When the processor initiates a memory-mapped IO access to a memory location within the enhanced configuration memory-mapped IO address range, the Host/PCI bridges in each Root Complex examines address bits A[27:20] to determine the target bus number.

The bridge wherein the target bus falls within the range of buses downstream of that bridge (as defined by the contents of its Bus Number and Subordinate Bus Number registers) acts as the target of the processor's FSB transaction while the other bridge does not actively participate in the transaction.

The bridge with a bus compare converts the processor's memory access into a configuration access:

o If the target bus is the bus immediately on the other side of the Host/ PCI bridge, the bridge converts the access to a Type 0 configuration access on its secondary bus.

o Otherwise, it converts it into a Type 1 configuration access.

The Enumeration Process

Refer to Figure 21-8 on page 757. The process of enumerating the buses downstream of the primary Root Complex is identical to that described in "Enumerating a System With a Single Root Complex" on page 742. During the enumeration of the left-hand tree structure, the Host/PCI bridge in the secondary Root Complex ignored all of the memory-mapped IO configuration accesses because, in each case, the target bus number that was specified was less than bus 255. It should be noted that, although detected and numbered, bus 8 has no device attached.

Once that enumeration process has been completed, the enumeration software takes the following steps to enumerate the buses and devices downstream of the secondary Root Complex:

The enumeration software changes both the Bus Number and Subordinate Bus Number register values in the secondary Root Complex's Host/PCI PCI Express System Architecture bridge to bus 11 (one greater than the highest-numbered bus beneath the primary Root Complex).

The enumeration software then starts searching on bus 11 and discovers the PCI-to-PCI bridge attached to the downstream Root Port.

A series of configuration writes are performed to set its bus number registers as follows:

Primary Bus Number Register $= 11$ .

Secondary Bus Number Register $= 12$ .

Subordinate Bus Number Register $= 12$ .

The bridge is now aware that the number of the bus directly attached to its downstream side is 12 (Secondary Bus Number

= 12

) and the number of the bus farthest downstream of it is 12 (Subordinate Bus Number

= 12

The Host/PCI's Subordinate Bus Number is updated to 12.

A single-function Endpoint device is discovered at bus 12, device 0 , function 0 .

Enumeration continues on bus 11 and no additional devices are discovered. This completes the bus/device enumeration process. Chapter 21: PCI Express Enumeration

A Multifunction Device Within a Root Complex or a Switch

A Multifunction Device Within a Root Complex

Refer to Figure 21-9 on page 759.

The spec is unclear on whether or not a P2P on the root bus within a Root Complex can be a multifunction device. It states that the rules regarding the implementation of the Header Register (see Figure 21-7 on page 752) in a PCI-to-PCI bridge within a Root Complex or a Switch are defined by the PCI 2.3 spec rather than the PCI Express spec. This being the case, it would be legal for function 0 in a bridge that resides on the internal bus of a Root Complex to have the Multifunction bit (bit 7) in the Header Register set to 1. This would indicate that up to seven additional functions could reside within this Root Complex device and each of them could be PCI-to-PCI bridges.

The only open issue in the authors' eyes is the contents of the Device/Port Type field in the Capability register (see Figure 21-5 on page 750 and Table 21 - 1 on page 753) of each of these bridge functions. It is assumed that it would have to be 0100b (i.e., Root Port of PCI Express Root Complex).

Figure 21-9: Multifunction Bridges in Root Complex

A Multifunction Device Within a Switch

Refer to Figure 21-10 on page 760 and Figure 21-11 on page 761. The spec doesn't preclude the inclusion of a multifunction device within a switch wherein each of the functions represents a PCI-to-PCI bridge to a downstream link.

In the first example the switch's internal bus implements two multifunction devices each of which contains four functions, and each function is the bridge to one of the switch's downstream ports. In the switch's upstream port bridge, the contents of the Device/Port Type field in the Capability register (see Figure 21-5 on page 750 and Table 21 - 1 on page 753) is 0101b (i.e., Upstream Port of PCI Express Switch).

In the second example, the bridge representing the switch's upstream port is device 0 on the link (i.e. bus) entering the switch and it is a multifunction device containing two functions each of which is a bridge to a separate internal switch bus. The contents of the Device/Port Type field in the Capability register (see

Figure 21-11: Second Example of a Multifunction Bridge In a Switch

An Endpoint Embedded in a Switch or Root Complex

The spec contains the following two statements:

"Endpoint devices (represented by Type 00h Configuration Space headers) may not appear to configuration software on the switch's internal bus as peers of the virtual PCI-to-PCI Bridges representing the Switch Downstream Ports."

"Switch Downstream Ports are PCI-PCI Bridges bridging from the internal bus to buses representing the Downstream PCI Express Links from a PCI Express Switch. Only the PCI-PCI Bridges representing the Switch Downstream Ports may appear on the internal bus. Endpoints, represented by Type 0 configuration space headers, may not appear on the internal bus."

Nothing in this text forbids the implementation of an Endpoint device within a switch. In addition, nothing in the spec forbids the implementation of an Endpoint device within a Root Complex. Figure 21-12 on page 762 and Figure 21-13 on page 763 illustrate examples of these design cases.

PCI Express System Architecture

Each time that the source bridge for a bus initiates a type 0 configuration write transaction (see Figure 21-14 on page 764), it supplies the targeted function with the bus number from its Secondary Bus Number register and the number of the device that the function resides within. The function is required to save this information for use in forming its IDs when it initiates a transaction as either a Requester or as a Completer. The information is not saved in program readable registers but rather in a function-specific manner.

A hot-plug event such as the installation or the removal of a device can cause the enumeration software to re-assign bus numbers in a portion of the bus hierarchy. If and when this should occur, the enumeration software is required to perform a configuration write to at least one register (any register) within each function in each device that resides on a bus that has received a new number. In this manner, each function on the bus is provided with the new bus number (as well as its device number) to be used in their respective IDs.

Figure 21-14: Type 0 Configuration Write Request Packet Header

Root Complex Bus Number/Device Number Assignment

The manner in which the bus number and device number are assigned to functions residing on the internal bus of a Root Complex is design-specific.

Initiating Requests Prior To ID Assignment

Before the first type 0 configuration write is performed to a function, it does not know the bus number and device number portion of its ID. While it remains in this state, a function is not permitted to initiate non-posted requests (memory write requests and message requests). There is one exception:

Chapter 21: PCI Express Enumeration

Functions within a Root Complex are permitted to initiate requests for accesses to system boot device(s).

Initiating Completions Prior to ID Assignment

If a function must generate a Completion before the first type 0 configuration write is performed to it, the Bus Number and Device Number fields in its Completer ID must be set to zeros. The Request issuer (the Requester) must ignore the value returned in the Completer ID field.

Root Complex Register Blocks (RCRBs)

What Problem Does an RCRB Address?

Refer to Figure 21-15 on page 766. Main memory is an extremely popular target. In addition to the processor, it is accessed by the graphics controller and is frequently accessed by PCI, PCI-X and PCI Express devices. This being the case, at a given moment in time, the Root Complex may simultaneously receive memory access requests through multiple ingress ports:

The FSB interface.

The graphics link.

One or more Root Ports.

It should be obvious that a traffic director (labeled "Port/VC Arbitration") must be implemented within the Root Complex. In its simplest form, the Root Complex may implement a hardwired traffic director.

In addition to simply handling multiple simultaneous requests, the traffic director may also have to deal with QoS issues. Some of the memory access requesters may require faster access than others. The Root Complex Register Block would be used to program the memory controller's egress port logic regarding TC-to-VC mapping and the VC arbitration algorithm.

An Issue For PCI Express-to-PCI or -PCI-X Bridges

If such a bridge receives a configuration request targeting a PCI or PCI-X function and the Extended Register Address field is non-zero, the bridge must return a UR (Unsupported Request) status to the Requester.

PCI Special Cycle Transactions

If software must cause a PCI Special Cycle transaction to be generated on a PCI or PCI-X bus, it takes the following actions.

To prime the Host/PCI bridge to generate a PCI Special Cycle transaction, software must write a 32-bit value with the following content to the Configuration Address Port at IO address 0CF8h:

Bus Number = the target PCI Bus that the Special Cycle transaction is to be performed on.

Device Number $=$ all ones (31d,or $1 Fh$ ).

Function Number $=$ all ones (7d).

Dword Number $=$ all zeros.

After this has been accomplished, the next write to the Configuration Data Port at IO port

0 CFCh

causes the Host/PCI bridge to pass the transaction through as a Type 1 configuration write (so that it can be submitted to PCI-to-PCI bridges farther out in the hierarchy). The type 1 configuration write request will flow unchanged through all of the bridges in the path to the target PCI/PCI-X bus until it finally arrives at the destination PCI Express-to-PCI/PCI-X bridge. This bridge converts the request in to a PCI Special Cycle transaction and the data written to the Host/PCI bridge's Configuration Data Port is supplied as the message in the Data Phase of the resultant PCI or PCI-X transaction. 22 PCI Compatible Configuration Registers

The Previous Chapter

The previous chapter provides a detailed description of the discovery process and bus numbering. It described:

Enumerating a system with a single Root Complex

Enumerating a system with multiple Root Complexes

A multifunction device within a Root Complex or a Switch

An Endpoint embedded in a Switch or Root Complex

Automatic Requester ID assignment.

Root Complex Register Blocks (RCRBs)

This Chapter

This chapter provides a detailed description of the configuration registers residing a function's PCI-compatible configuration space. This includes the registers for both non-bridge and bridge functions.

The Next Chapter

The next chapter provides a detailed description of device ROMs associated with PCI, PCI Express, and PCI-X functions. This includes the following topics:

device ROM detection.

internal code/data format.

shadowing.

PCI Express System Architecture

initialization code execution.

interrupt hooking.

Header Type 0

General

Figure 22-1 on page 771 illustrates the format of a function's Header region (for functions other than PCI-to-PCI bridges and CardBus bridges). The registers marked in black are always mandatory. Note that although many of the configuration registers in the figure are not marked mandatory, a register may be mandatory for a particular type of device. The subsequent sections define each register and any circumstances wherein it may be mandatory.

As noted earlier, this format is defined as Header Type 0 . The registers within the Header are used to identify the device, to control its functionality and to sense its status in a generic manner. The usage of the device's remaining 48 dwords of PCI-compatible configuration space is intended for device-specific registers, but, with the advent of the 2.2 PCI spec, is also used as an overflow area for some new registers defined in the PCI spec (for more information, refer to "Capabilities Pointer Register" on page 779).

Header Type 0 Registers Compatible With PCI

The Header Type 0 PCI configuration registers that are implemented and used identically in PCI and PCI Express are:

Vendor ID register.

Device ID register.

Revision ID register.

Class Code register.

Subsystem Vendor ID register.

Subsystem ID register.

Header Type register.

BIST register.

Capabilities Pointer register.

CardBus CIS Pointer register.

Expansion ROM Base Address register.

The sections that follow provide a description of each of these registers.

Header Type 0 Registers Incompatible With PCI

In a non-bridge PCI Express function, the definitions of the following configuration registers in the function's PCI-compatible configuration space differ from the PCI spec's definition of the respective register definitions:

Command Register

Status Register

Cache Line Size Register

Master Latency Timer Register

Interrupt Line Register

Interrupt Pin Register

Base Address Registers

Min_Gnt/Max_Lat Registers

The sections that follow define the implementation/usage differences of these registers. For a full description of their implementation in a PCI function, refer to the MindShare book entitled PCI System Architecture, Fourth Edition (published by Addison-Wesley). For a full description of their implementation in a PCI-X function, refer to the MindShare book entitled PCI-X System Architecture, First Edition (published by Addison-Wesley).

Chapter 22: PCI Compatible Configuration Registers

Registers Used to Identify Device's Driver

The OS uses some combination of the following mandatory registers to determine which driver to load for a device:

Vendor ID.

Device ID.

Revision ID.

Class Code.

SubSystem Vendor ID.

SubSystem ID.

Vendor ID Register

PCI-Compatible register. Always mandatory. This 16-bit register identifies the manufacturer of the function. The value hardwired in this read-only register is assigned by a central authority (the PCI SIG) that controls issuance of the numbers. The value FFFFh is reserved and must be returned by the Host/PCI bridge when an attempt is made to perform a configuration read from a configuration register in a non-existent function. In PCI or PCI-X, the read attempt results in a Master Abort, while in PCI Express it results in the return of UR (Unsupported Request) completion status. In either case, the bridge must return a Vendor ID of FFFFh. The error status returned is not considered to be an error, but the specification says that the bridge must nonetheless set its Received Master Abort bit in its configuration Status register.

Device ID Register

PCI-Compatible register. Always mandatory. This 16-bit value is assigned by the function manufacturer and identifies the type of function. In conjunction with the Vendor ID and possibly the Revision ID, the Device ID can be used to locate a function-specific (and perhaps revision-specific) driver for the function.

Revision ID Register

PCI-Compatible register. Always mandatory. This 8-bit value is assigned by the function manufacturer and identifies the revision number of the function. If the vendor has supplied a revision-specific driver, this is handy in ensuring that the correct driver is loaded by the OS.

PCI Express System Architecture

Class Code Register

General. PCI-Compatible register. Always mandatory. The Class Code register is pictured in Figure 22-2 on page 775. It is a 24-bit, read-only register divided into three fields: base Class, Sub Class, and Programming Interface. It identifies the basic function of the function (e.g., a mass storage controller), a more specific function sub-class (e.g., IDE mass storage controller), and, in some cases, a register-specific programming interface (such as a specific flavor of the IDE register set).

The upper byte defines the base Class of the function,

the middle byte defines a sub-class within the base Class,

and the lower byte defines the Programming Interface.

The currently-defined base Class codes are listed in Table 22-1 on page 775. Table 2 on page 1020 through Table 19 on page 1031 define the Subclasses within each base Class. For many Class/SubClass categories, the Programming Interface byte is hardwired to return zeros (in other words, it has no meaning). For some, such as VGA-compatible functions and IDE controllers, it does have meaning.

This register is useful when the OS is attempting to locate a function that a Class driver can work with. As an example, assume that a particular device driver has been written to work with any display adapter that is

100 %

XGA register set-compatible. If the OS can locate a function with a Class of

03 h

(see Table 22-1 on page 775) and a Sub Class of 01h (see Table 5 on page 1022), the driver will work with that function. A Class driver is more flexible than a driver that has been written to work only with a specific function from a specific vendor.

The Programming Interface Byte. For some functions (such as the XGA display adapter used as an example in the previous section) the combination of the Class Code and Sub Class Code is sufficient to fully-define its level of register set compatibility. The register set layout for some function types, however, can vary from one implementation to another. As an example, from a programming interface perspective there are a number of flavors of IDE mass storage controllers, so it's not sufficient to identify yourself as an IDE mass storage controller. The Programming Interface byte value (see Table 20 on page 1031) provides the final level of granularity that identifies the exact register set layout of the function.

Chapter 22: PCI Compatible Configuration Registers

Detailed Class Code Description. A detailed description of the currently-defined Classes, SubClasses, and Programming Interface Byte values can be found in Appendix D.

Figure 22-2: Class Code Register

Table 22-1: Defined Class Codes

Class	Description
00h	Function built before class codes were defined (in other words: before rev 2.0 of the PCI spec)
01h	Mass storage controller.
02h	Network controller.
03h	Display controller.
04h	Multimedia device.
05h	Memory controller.
06h	Bridge device.
07h	Simple communications controllers.
08h	Base system peripherals.
09h	Input devices.
0Ah	Docking stations.
0Bh	Processors.
0Ch	Serial bus controllers.
0Dh	Wireless controllers.
0Eh	Intelligent IO controllers.

Table 22-1: Defined Class Codes (Continued)

Class	Description
0Fh	Satellite communications controllers.
10h	Encryption/Decryption controllers.
11h	Data acquisition and signal processing controllers.
12h-FEh	Reserved.
FFh	Device does not fit any of the defined class codes.

Subsystem Vendor ID and Subsystem ID Registers

General. PCI-Compatible register. Mandatory. This register pair was added in revision 2.1 of the PCI spec and was optional. The 2.2 PCI spec and the PCI-

X

spec state that they are mandatory except for those functions that have a base Class of 06h (a Bridge) with a Sub Class of 00h-04h (refer to Table 8 on page 1023), or a base Class of 08h (Base System Peripherals) with a Sub Class of

00 h - 03 h

(see Table 10 on page 1026). This excludes bridges of the following types:

Host/PCI

PCI-to-EISA

PCI-to-ISA

PCI-to-Micro Channel

PCI-to-PCI

It also excludes the following generic system peripherals:

Interrupt Controller

DMA Controller

Programmable Timers

RTC Controller

The Subsystem Vendor ID is obtained from the SIG, while the vendor supplies its own Subsystem ID (the full name of this register is really "Subsystem Device ID", but the "device" is silent). A value of zero in these registers indicates there isn't a Subsystem Vendor and Subsystem ID associated with the function.

The Problem Solved by This Register Pair. A function may reside on a card or within an embedded device. Functions designed around the same

PCI/PCI-X, or PCI Express core logic (produced by a third-party) may have the same Vendor and Device IDs (if the core logic vendor hardwired their own IDs into these registers). If this is the case, the OS would have a problem identifying the correct driver to load into memory for the function.

These two mandatory registers (Subsystem Vendor ID and Subsystem ID) are used to uniquely identify the add-in card or subsystem that the function resides within. Using these two registers, the OS can distinguish the difference between cards or subsystems manufactured by different vendors but designed around the same third-party core logic. This permits the Plug-and-Play OS to locate the correct driver to load into memory.

Must Contain Valid Data When First Accessed. These two registers must contain their assigned values before the system first accesses them. If software attempts to access them before they have been initialized, the device must issue:

a Retry to the master (in PCI).

a Completion with CRS (Configuration Request Retry Completion Status) in PCI Express.

The values in these registers could be hardwired, loaded from a serial EEPROM, determined from hardware strapping pins, etc.

Header Type Register

PCI-Compatible register. Always mandatory. Figure 22-3 on page 778 illustrates the format of the Header Type register. Bits [6:0] of this one byte register define the format of dwords 4-through-15 of the function's configuration Header (see Figure 22-1 on page 771 and Figure 22-13 on page 803). In addition, bit seven defines the device as a single- (bit

7 = 0

) or multifunction (bit

7 = 1

) device. During configuration, the programmer determines if there are any other functions in this device that require configuration by testing the state of bit seven.

Currently, the only Header formats defined other than that pictured in Figure 22-1 on page 771 (Header Type Zero) are:

Header Type One (PCI-to-PCI bridge Header format; description can be found in "Header Type 1" on page 802).

and Header Type Two (CardBus bridge; detail can be found in the PC Card specification and in the MindShare book entitled CardBus System Architecture (published by Addison-Wesley).

Future versions of the specification may define other formats.

Figure 22-3: Header Type Register Bit Assignment

BIST Register

PCI-Compatible register. Optional. This register may be implemented by both Requester and Completer functions. If a function implements a Built-In Self-Test (BIST), it must implement this register as illustrated in Figure 22-4 on page 778. Table 22-2 on page 779 describes each bit's function. If the function doesn't support a BIST, this register must return zeros when read. The function's BIST is invoked by setting bit six to one. The function resets bit six upon completion of the BIST. Configuration software must fail the function if it doesn't reset bit six within two seconds. At the conclusion of the BIST, the test result is indicated in the lower four bits of the register. A completion code of zero indicates successful completion. A non-zero value represents a function-specific error code.

The time limit of two seconds may not be sufficient time to test a very complex function or one with an extremely large buffer that needs to be tested. In that case, the remainder of the test could be completed in the initialization portion of the function's device driver when the OS loads it into memory and calls it.

Figure 22-4: BIST Register Bit Assignment

Table 22-2: BIST Register Bit Assignment

Bit	Function
3:0	Completion Code. A value of zero indicates successful completion, while a non-zero result indicates a function-specific error.
5:4	Reserved.
6	Start BIST. Writing a one into this bit starts the function's BIST. The func- tion resets this bit automatically upon completion. Software should fa the function if the BIST does not complete within two seconds
7	BIST Capable. Should return a one if the function implements a BIST, a zero if it doesn't.

Capabilities Pointer Register

PCI-Compatible register. Optional for a PCI function. Mandatory for a PCI-X or PCI Express function.

Configuration Header Space Not Large Enough

The 2.1 PCI spec defined the first 16 dwords of a function's PCI-compatible configuration space as its configuration Header space. It was originally intended that all of the function's PCI spec-defined configuration registers would reside within this region and that all of its function-specific configuration registers would reside within the lower 48 dwords of its PCI-compatible configuration space. Unfortunately, they ran out of space when defining new configuration registers in the

2.2 PCI

spec. For this reason,the 2.2 and

2.3 PCI

specs permit some spec-defined registers to be implemented in the lower 48 dwords of a function's PCI-compatible configuration space.

Discovering That Capabilities Exist

If the Capabilities List bit in the Status register (see Figure 22-5 on page 780) is set to one, the function implements the Capabilities Pointer register in byte zero of dword 13 in its PCI-compatible configuration space (see Figure 22-1 on page 771). This implies that the pointer contains the dword-aligned start address of the Capabilities List within the function's lower 48 dwords of PCI-compatible configuration space. It is a rule that the two least-significant bits must be hardwired to zero and must be ignored (i.e., masked) by software when reading the

PCI Express System Architecture

register. The upper six bits represents the upper six bits of the 8-bit, dword-aligned start address of the new registers implemented in the lower 48 dwords of the function's PCI-compatible space. The two least-significant bits are assumed to be zero.

Figure 22-5: Status Register

What the Capabilities List Looks Like

The configuration location pointed to by the Capabilities Pointer register is the first entry in a linked series of one or more configuration register sets, each of which supports a feature. Each entry has the general format illustrated in Figure 22-6 on page 782. The first byte is referred to as the Capability ID (assigned by the PCI SIG) and identifies the feature associated with this register set (e.g.,

2 =

AGP), while the second byte either points to another feature's register set, or indicates that there are no additional register sets (with a pointer value of zero) associated with this function. In either case, the least-significant two bits must return zero. If a pointer to the next feature's register set is present in the second byte, it points to a dword within the function's lower 48 dwords of PCI-compatible configuration space (it can point either forward or backward in the function's configuration space). The respective feature's register set always immediately follows the first two bytes of the entry, and its length and format are defined by what type of feature it is. The Capabilities currently defined in the 2.3 PCI spec are those listed in Table 22-3 on page 781.

Table 22-3: Currently-Assigned Capability IDs

ID	Description
00h	Reserved.
01h	PCI Power Management Interface. Refer to “The PM Capability Register Set” on page 585.
02h	AGP. Refer to “AGP Capability” on page 845. Also refer to the MindShare book entitled AGP System Architecture, Second Edition (published by Addi- son-Wesley).
03h	VPD. Refer to “Vital Product Data (VPD) Capability” on page 848.
$04 h$	Slot Identification. This capability identifies a bridge that provides external expansion capabilities (i.e., an expansion chassis containing add in card slots). Full documentation of this feature can be found in the revi- sion 1.1 PCI-to-PCI Bridge Architecture Specification. For a detailed, Express oriented description, refer to “Introduction To Chassis/Slot Numberin, Registers" on page 859 and "Chassis and Slot Number Assignment" on page 861.
05h	Message Signaled Interrupts. Refer to “The MSI Capability Register Set on page 332.
06h	CompactPCI Hot Swap. Refer to the chapter entitled Compact PCI and PMC in the MindShare book entitled PCI System Architecture, Fourth Edi- tion (published by Addison-Wesley).
07h	PCI-X device. For a detailed description, refer to the MindShare book enti- \| tled PCI-X System Architecture (published by Addison-Wesley
08h	Reserved for AMD.
09h	Vendor Specific capability register set. The layout of the register set is vendor specific, except that the byte immediately following the "Next" pointer indicates the number of bytes in the capability structure (including the ID and Next pointer bytes). An example vendor specific usage is a function that is configured in the fin manufacturing steps as either a 32-bit or 64-bit PCI agent and the Vendor Specific capability structure tells the device driver which features the device supports.
0Ah	Debug port.
0Bh	CompactPCI central resource control. A full definition of this capability can be found in the PICMG 2.13 Specification (http://www.picmg.com).

Table 22-3: Currently-Assigned Capability IDs (Continued)

ID	Description
0Ch	PCI Hot-Plug. This ID indicates that the associated device conforms to the Standard Hot-Plug Controller model.
0Dh- 0Fh	Reserved.
10h	PCI Express Capability register set (aka PCI Express Capability Struc- ture). For a detailed explanation, refer to "PCI Express Capability Registe Set” on page 896.
11h-FFh	Reserved.

Figure 22-6: General Format of a New Capabilities List Entry

CardBus CIS Pointer Register

PCI-Compatible register. Optional. This optional register is implemented by functions that share silicon between a Cardbus device and a PCI or PCI Express function. This field points to the Card Information Structure (CIS) on the Card-Bus card. The register is read-only and indicates that the CIS can be accessed from the indicated offset within one of the following address spaces:

Offset within the function's function-specific PCI-compatible configuration space (after dword $15 d$ in the function’s PCI-compatible configuration space).

Öffset from the start address indicated in one of the function's Memory Base Address Registers (see Figure 22-10 on page 796 and Figure 22-11 on page 797).

Offset within a code image in the function's expansion ROM (see "Expansion ROM Base Address Register" on page 783 and "Expansion ROMs" on page 871).

Chapter 22: PCI Compatible Configuration Registers

The format of the CardBus CIS Pointer register is defined in the revision 3.0 PC Card specification. A detailed description of the CIS can be found in the Mind-Share architecture series book entitled CardBus System Architecture (published by Addison-Wesley).

Expansion ROM Base Address Register

PCI-Compatible register. Required if a function incorporates a device ROM. Many PCI functions incorporate a device ROM (the spec refers to it as an expansion ROM) that contains a device driver for the function. The expansion ROM start memory address and size is specified in the Expansion ROM Base Address Register at configuration dword

12 d

in the configuration Header region. As described in the section entitled "Base Address Registers" on page 792, on power-up the system must be automatically configured so that each function's IO and memory decoders recognize mutually-exclusive address ranges. The configuration software must be able to detect how much memory space an expansion ROM requires. In addition, the system must have the capability of programming a ROM's address decoder in order to locate its ROM in a non-conflicting address range.

When the start-up configuration program detects that a function has an Expansion ROM Base Address Register implemented (by writing all ones to it and reading it back), it must then check the first two locations in the ROM for an Expansion ROM signature to determine if a ROM is actually installed (i.e., there may be an empty ROM socket). If installed, the configuration program must shadow the ROM and execute its initialization code. This process is described in "Expansion ROMs" on page 871.

The format of the expansion ROM Base Address Register is illustrated in Figure 22-7 on page 785:

A one in bit zero enables the function's ROM address decoder (assuming that the Memory Space bit in the Command register is also set to one).

Bits [10:1] are reserved.

Bits [31:11] are used to specify the ROM's start address (starting on an address divisible by the ROM's size). PCI Express System Architecture

As an example, assume that the programmer writes FFFFFFFEh to the ROM's Base Address Register (bit 0, the Expansion ROM Enable bit, is cleared so as not to enable the ROM address decoder until a start memory address has been assigned). A subsequent read from the register in the example yields FFFE0000h. This indicates the following:

Bit 0 is a zero, indicating that the ROM address decoder is currently disabled.

Bits [10:1] are reserved.

In the Base Address field (bits [31:11]), bit 17 is the least-significant bit that the programmer was able to set to one. It has a binary-weighted value of $128 K$ ,indicating that the ROM decoder requires $128 KB$ of memory space be assigned to the ROM. The programmer then writes a 32-bit start address into the register to assign the ROM start address on a $128 K$ address boundary.

The PCI 2.3 spec recommends that the designer of the Expansion ROM Base Address Register should request a memory block slightly larger than that required by the current revision ROM to be installed. This permits the installation of subsequent ROM revisions that occupy more space without requiring a redesign of the logic associated with the function's Expansion ROM Base Address Register. The spec sets a limit of

16 MB

as the maximum expansion ROM size.

The Memory Space bit in the Command register has precedence over the Expansion ROM Enable bit. The function's expansion ROM should respond to memory accesses only if both its Memory Space bit (in its Command register) and the Expansion ROM Enable bit (in its expansion ROM Base Address register) are both set to one.

In order to minimize the number of address decoders that a function must implement, one address decoder can be shared between the Expansion ROM Base Address Register and one of the function's Memory Base Address Registers. The two Base Address Registers must be able to hold different values at the same time, but the address decoder will not decode ROM accesses unless the Expansion ROM Enable bit is set in the Expansion ROM Base Address Register.

A more detailed description of expansion ROM detection, shadowing and usage can be found in "Expansion ROMs" on page 871.

Table 22 - 4: Command Register

Bit	Type	Description
0	RW	IO Address Space Decoder Enable. - Endpoints: - 0 . IO decoder is disabled and IO transactions targeting this device return completion with ’Unsupported Request’ status. - 1. IO decoder is enabled andIO transactions targeting this device are accepted.
1	RW	Memory Address Space Decoder Enable. - Endpoints and Memory-mapped devices within Switch: – 0. Memory decoder is disabled and Memory transactions target ing this device return completion with ‘Unsupported Request’ completion status. - 1. Memory decoder is enabled and Memory transactions targetin: this device are accepted.
2	RW	Bus Master Enable. - Endpoints: - 0. Disables an Endpoint function from issuing memory or IC requests. Also disables the ability to generate MSI messages. - 1. Enables the Endpoint to issue memory or IO requests, includ- ing MSI messages. - Requests other than memory or IO requests are not controlled by this bit - Default $= 0$ . - Hardwired to 0 if an Endpoint function does not generate mem- ory or IO requests. - Root and Switch Port: Controls the forwarding of memory or IO requests by a Switch or Root Port in the upstream direction. 0 . Memory and IO requests received at a Root Port or the down- stream side of a Switch port must return an Unsupported Requests (UR) Completion. Does not affect the forwarding of Completions in either th upstream or downstream direction. - Does not control the forwarding of requests other than memory or IO requests. - Default value of this bit is 0b.
3	RO	Special Cycle Enable. Does not apply to PCI Express and must be 0.

Chapter 22: PCI Compatible Configuration Registers

Table 22 - 4: Command Register (Continued)

Bit	Type	Description
4	RO	Memory Write and Invalidate. Does not apply to PCI Express and must be 0 .
5	RO	VGA Palette Snoop. Does not apply to PCI Express and must be 0.
6	RW	Parity Error Response. In the Status register (see Figure 22-5 on page 780), the Master Data Parity Error bit is set by a Requester if its Parity Error Response bit is set and either of the following two conditions occurs: - If the Requester receives a poisoned Completion. - If the Requester poisons a write request. If the Parity Error Response bit is cleared, the Master Data Parity Error status bit is never set. The default value of this bit is 0.
7	RO	IDSEL Stepping/Wait Cycle Control. Does not apply to PCI Express and must be 0.
8	RW	SERR Enable. When set, this bit enables the non-fatal and fatal errors detected by the function to be reported to the Root Complex. The func tion reports such errors to the Root Complex if it is enabled to do so either through this bit or through the PCI Express specific bits in th \| Device Control register (see "Device Control Register" on page 905). The default value of this bit is 0.
9	RO	Fast Back-to-Back Enable. Does not apply to PCI Express and must be 0.

Table 22 - 4: Command Register (Continued)

Bit	Type	Description
10	RW	Interrupt Disable. Controls the ability of a PCI Express function to gen- erate INTx interrupt messages. - $0 =$ Function enabled to generate INTx interrupt message: - $1 =$ Function’s ability to generate INTx interrupt messages is dis- abled. If the function had already transmitted any Assert_INTx emulation interrupt messages and this bit is then set, it must transmit a corre- sponding Deassert_INTx message for each assert message transmitted earlier. Note that INTx emulation interrupt messages forwarded by Root and Switch Ports from devices downstream of the Root or Switch Port are not affected by this bit. The default value of this bit is 0.

Status Register

Differs from the PCI spec. Mandatory. Table 22 - 5 on page 789 provides a description of each bit in the Status register (also refer to Figure 22-9 on page 788). The bit fields with the RW1C attribute have the following characteristics:

Register bits return status when read, and a status bit may be cleared by writing a one to it. Writing a 0 to RW1C bits has no effect.

Figure 22-9: PCI Configuration Status Register

Table 22 - 5: Status Register

Bit	Attributes	Description
3	RO	Interrupt Status. Indicates that the function has an interrupt request outstanding (that is, the function transmitted an inter- rupt message earlier in time and is awaiting servicing) Note that INTx emulation interrupts forwarded by Root and Switch Ports from devices downstream of the Root or Switch Port are not reflected in this bit. The default state of this bit is 0. Note : this bit is only associated with INTx messages, and has no meaning if the device is using Message Signaled Inter- rupts.
4	RO	Capabilities List. Indicates the presence of one or more extended capability register sets in the lower 48 dwords of the function’s PCI-compatible configuration space. Since, at $ε$ minimum, all PCI Express functions are required to imple- ment the PCI Express capability structure, this bit must be se to 1.
5	RO	66MHz-Capable. Does not apply to PCI Express and must be 0 .
7	RO	Fast Back-to-Back Capable. Does not apply to PCI Express and must be 0.
8	RW1C	Master Data Parity Error. The Master Data Parity Error bit is set by a Requester if the Parity Error Enable bit is set in its Command register and either of the following two conditions occurs: - If the Requester receives a poisoned Completion. - If the Requester poisons a write request. If the Parity Error Enable bit is cleared, the Master Data Parity Error status bit is never set. The default value of this bit is 0.
10:9	RO	DEVSEL Timing. Does not apply to PCI Express and must be 0.

Table 22 - 5: Status Register (Continued)

Bit	Attributes	Description
11	RW1C	Signaled Target Abort. This bit is set when a function acting as a Completer terminates a request by issuing Completer Abort Completion Status to the Requester. The default value of this bit is 0.
12	RW1C	Received Target Abort. This bit is set when a Requester receives a Completion with Completer Abort Completion Sta- tus. The default value of this bit is 0.
13	RW1C	Received Master Abort. This bit is set when a Requester receives a Completion with Unsupported Request Comple- tion Status. The default value of this bit is 0.
14	RW1C	Signaled System Error. This bit is set when a function sends an ERR_FATAL or ERR_NONFATAL message, and the SERR Enable bit in the Command register is set to one The default value of this bit is 0.
15	RW1C	Detected Parity Error. Regardless of the state the Parity Error Enable bit in the function's Command register, this bit is set if the function receives a Poisoned TLP. The default value of this bit is 0.

Cache Line Size Register

Differs from the PCI spec. Optional.

This field is implemented by PCI Express devices as a read-write field for legacy compatibility purposes but has no impact on any PCI Express device functionality.

Master Latency Timer Register

Differs from the PCI spec. Optional.

This register does not apply to PCI Express and must be hardwired to 0 .

Chapter 22: PCI Compatible Configuration Registers

Interrupt Line Register

Differs from the PCI spec. Optional.

Usage In a PCI Function

Required if a PCI function is capable of generating interrupt requests via an INTx# pin (i.e., INTA#, INTB#, INTC#, or INTD#). The PCI spec allows a function to generate interrupts either using an interrupt pin, or using MSI-capability (for more information, see "Message Signaled Interrupts" on page 331).

The read/writable Interrupt Line register is used to identify which input on the interrupt controller the function's PCI interrupt request pin (as specified in its Interrupt Pin register; see "Interrupt Pin Register" on page 792) is routed to. For example, in a PC environment the values 00h-through-0Fh in this register correspond to the IRQ0-through-IRQ15 inputs on the interrupt controller. The value 255d (FFh) indicates "unknown" or "no connection." The values from 10h-through-FEh, inclusive, are reserved. Although it doesn't state this in the PCI spec, it is the author's opinion that RST# should initialize the Interrupt Line register to a value of FFh, thereby indicating that interrupt routing has not yet been assigned to the function.

The OS or device driver can examine a device's Interrupt Line register to determine which system interrupt request line the device uses to issue requests for service (and, therefore, which entry in the interrupt table to "hook").

In a non-PC environment, the value written to this register is architecture-specific and therefore outside the scope of the specification.

Usage In a PCI Express Function

A PCI Express function may generate interrupts in the legacy PCI/PCI-X manner. As an example, when a PCI Express-to-PCI or PCI-X bridge detects the assertion or deassertion of one of its INTA#, INTB#, INTC#, or INTD# inputs on the legacy side of the bridge, it sends an INTx Assert or Deassert message upstream towards the Root Complex (specifically, to the interrupt controller within the Root Complex).

As in PCI, the Interrupt Line register communicates interrupt line routing information. The register is read/write and must be implemented by any function that contains a valid non-zero value in its Interrupt Pin configuration register (described in the next section). The OS or device driver can examine a device's

PCI Express System Architecture

Interrupt Line register to determine which system interrupt request line the device uses to issue requests for service (and, therefore, which entry in the interrupt table to "hook").

In a non-PC environment, the value written to this register is architecture-specific and therefore outside the scope of the specification.

Interrupt Pin Register

Differs from the PCI spec. Optional.

Usage In a PCI Function

Required if a PCI function is capable of generating interrupt requests via an INTx# pin. The PCI spec allows a function to generate interrupts either using an interrupt pin, or using MSI-capability (for more information, see "Two Methods of Interrupt Delivery" on page 330).

The read-only Interrupt Pin register defines which of the four PCI interrupt request pins, INTA#-through-INTD#, a PCI function is connected (i.e., bonded) to. The values

01 h

-through-04h correspond to PCI interrupt request pins INTA#-through-INTD#. A return value of zero indicates that the device doesn't generate interrupts. All other values (05h-FFh) are reserved.

Usage In a PCI Express Function

This read-only register identifies the legacy INTx interrupt Message (INTA, INTB, INTC, or INTD) the function transmits upstream to generate an interrupt. The values 01h-through-04h correspond to legacy INTx interrupt Messages INTA-through-INTD. A return value of zero indicates that the device doesn't generate interrupts using the legacy method. All other values (05h-FFh) are reserved. Note that, although the function may not generate interrupts via the legacy method, it may generate them via the MSI method (see "Determining if a Function Uses INTx# Pins" on page 343 for more information).

Base Address Registers

Differ from the PCI spec. Required if a function implements memory and/or IO decoders.

Chapter 22: PCI Compatible Configuration Registers

Introduction

Virtually all functions implement some memory, and/or a function-specific register set to control the function and sense its status. Some examples are:

A parallel port's Status, Command and Data registers could reside in IO or memory-mapped IO space.

A network interface's control registers (Command/Status, etc.) could reside in IO or memory-mapped IO space.

The network interface may also incorporate a RAM memory buffer that must be mapped into the system's memory space.

In addition, a ROM containing the function's BIOS and interrupt service routine may be present in a function.

On power-up, the system must be automatically configured so that each function's IO and memory functions occupy mutually-exclusive address ranges. In order to accomplish this, the system must be able to detect how many memory and IO address ranges a function requires and the size of each. Obviously, the system must then be able to program the function's address decoders in order to assign non-conflicting address ranges to them.

The Base Address Registers (BARs), located in dwords 4-through-9 of the function's configuration Header space (see Figure 22-1 on page 771), are used to implement a function's programmable memory and/or IO decoders. Each register is 32-bits wide (or 64-bits wide if it's a memory decoder and its associated memory block can be located above the 4GB address boundary). Figure 22-10 on page 796, Figure 22-11 on page 797, and Figure 22-12 on page 798 illustrate the three possible formats of a Base Address Register. Bit 0 is a read-only bit and indicates whether it's a memory or an IO decoder:

If bit $0 = 0$ ,the register is a memory address decoder.

If bit $0 = 1$ ,the register is an IO address decoder.

Decoders may be implemented in any of the Base Address Register positions. If more than one decoder is implemented, there may be holes. During configuration, the configuration software must therefore look at all six of the possible Base Address Register positions in a function's Header to determine which registers are actually implemented.

IO Space Usage

In a PC environment, IO space is densely populated and will only become more so in the future. For this reason and because some processors are only capable of performing memory transactions, the following rules related to IO space usage

PCI Express System Architecture

are defined in the PCI Express spec:

Native PCI Express Endpoint Function (as indicated by a value of 0000b in the Device/Port Type field in the function's PCI Express Capabilities Register; see Figure 22-31 on page 865). Some operating systems and/or processors may not support IO accesses (i.e., accesses using IO rather than memory addresses). This being the case, the designer of a native PCI Express function should avoid the use of IO BARs.

However, the target system that a function is designed for may use the function as one of the boot devices (i.e., the boot input device, output display device, or boot mass storage device) and may utilize a legacy device driver for the function at startup time. The legacy driver may assume that the function's device-specific register set resides in IO space. In this case, the function designer would supply an IO BAR to which the configuration software will assign an IO address range. When the OS boot has completed and the OS has loaded a native PCI Express driver for the function, however, the OS may deallocate all legacy IO address ranges previously assigned to the selected boot devices. From that point forward and for the duration of the power-up session, the native driver will utilize memory accesses to communicate with its associated function through the function's memory BARs.

Legacy PCI Express Endpoint Function (as indicated by a value of 0001b in the Device/Port Type field in the function's PCI Express Capabilities Register; see Figure 22-31 on page 865). A legacy PCI Express Endpoint function consists of a legacy PCI or PCI-X function supplied with a PCI Express front end to interface it to the PCI Express fabric. As many legacy functions implemented IO BARs, IO BARs are tolerated in this type of function.

Memory Base Address Register

This section provides a detailed description of the bit fields within a Memory BAR. The section entitled "Finding Block Size and Assigning Address Range" on page 799 describes how the register is probed to determine its existence, the size of the memory associated with the decoder, and the assignment of the base address to the decoder.

Decoder Width Field. In a Memory Base Address Register, bits [2:1] define whether the decoder is 32- or 64-bits wide:

If $00 b =$ it’s a 32-bit register (see Figure 22-10 on page 796). The configuration software therefore will write a 32-bit start memory address into it specifying any address in the first 4GB of memory address space.

If $10 b =$ it’s a 64-bit register (see Figure 22-11 on page 797). The configu-

ration software therefore writes a 64-bit start memory address into it that specifies a start address in a

2^{64}

memory address space. This means that this Base Address Register consumes two consecutive dwords of the configuration Header space. The first dword is used to set the lower 32-bits of the start address and the second dword is used to specify the upper 32-bits of the start address.

Prefetchable Attribute Bit. Bit three defines the block of memory as Prefetchable or not. A block of memory space may be marked as Prefetch-able only if it can guarantee that:

there are no side effects from reads (e.g., the read doesn't alter the contents of the location or alter the state of the function in some manner). It's permissible for a bridge that resides between a Requester and a memory target to prefetch read data from memory that has this characteristic. If the Requester doesn't end up asking for all of the data that the bridge read into a read-ahead buffer, the bridge must discard the data (see "Bridge Must Discard Unconsumed Prefetched Data" on page 801). The data remains unchanged in the target's memory locations.

on a read, it always returns all bytes irrespective of the byte enable settings.

the memory device continues to function correctly if a bridge that resides between the Requester and the memory target performs byte merging (for more information, refer to "Byte Merging" on page 801) in its posted memory write buffer when memory writes are performed within the memory target's range.

In a nutshell, regular memory is prefetchable while memory-mapped IO (or any other badly-behaved memory region) is not. The configuration software can determine that a memory target is prefetchable or not by checking the Prefetchable bit in the memory target's Base Address Register (BAR).

All memory BAR registers in PCI Express Endpoint functions with the Prefetchable bit set to one must be implemented as 64-bit memory BARs. Memory BARs that do not have the prefetchable bit set to one may be implemented as 32-bit BARs.

As an example, the address decoder for a block of memory-mapped IO ports may hardwire the Prefetchable bit to zero, while the address decoder for well-behaved memory would hardwire it to one. For performance reasons, the spec urges that, wherever possible, memory-mapped IO ranges be marked as prefetchable memory.

PCI Express System Architecture

The configuration software checks this bit to determine a memory target's operational characteristics, assigns a memory range to its decoder (i.e., its Memory BAR), and then backtracks to all upstream bridges between the memory target and the processor and configures the bridges to treat the assigned memory range in the appropriate manner:

If it's Prefetchable memory, it's permissible for a bridge to perform read prefetching to yield better performance, and it's also permissible for the bridge to perform byte merging in its posted memory write buffer for writes performed to the memory.

If it's non-Prefetchable memory, bridge read prefetching and byte merging are not allowed within the assigned region of memory space. This will not allow bridges to optimize accesses to the function, but you're assured the function will work correctly (and that's pretty important!).

Base Address Field. This field consists of bits [31:7] for a 32-bit memory decoder and bits [63:7] for a 64-bit memory decoder. It is used:

to determine the size of the memory associated with this decoder, and

to assign a start (i.e., base) address to the decoder.

Programming of an example Memory Base Address Register is provided in "Finding Block Size and Assigning Address Range" on page 799.

The minimum memory range requested by a BAR is 128 bytes.

Figure 22-10: 32-Bit Memory Base Address Register Bit Assignment

Figure 22-11: 64-Bit Memory Base Address Register Bit Assignment

IO Base Address Register

Introduction. This section provides a detailed description of the bit fields within an IO Base Address Register. The section entitled "Finding Block Size and Assigning Address Range" on page 799 describes:

how the register is probed to determine its existence,

how to determine the size of the IO register set associated with the decoder and therefore the amount of IO space that must be assigned to it, and

how to assign the base address to the decoder.

IO BAR Description. Refer to Figure 22-12 on page 798. Bit zero returns a one, indicating that this is an IO, rather than a memory, decoder. Bit one is reserved and must always return zero. Bits [31:2] comprises the Base Address field and is used to:

determine the size of the IO block required and

to set its start address.

The PCI spec requires that a device that maps its control register set into IO space must not request more than 256 locations per IO Base Address Register.

PC-Compatible IO Decoder. The upper 16-bits of the IO BAR may be hardwired to zero when a function is designed specifically for a PC-compatible, x86-based machine (because Intel x86 processors are incapable of gen-

PCI Express System Architecture

erating IO addresses over

64 KB

). The function must still perform a full 32- bit decode of the IO address, however.

Legacy IO Decoders. Legacy PC-compatible devices such as VGA and IDE controllers frequently expect to be located within fixed legacy IO ranges. Such functions do not implement Base Address Registers. Instead, the configuration software identifies them as legacy functions via their respective Class Code and then enables their IO decoder(s) by setting the IO Space bit in its Command register to one.

A legacy IO function may or may not own all of the byte locations within a dword of IO space:

A legacy IO function that does own all of the bytes within the currently-addressed dword can perform its decode using the dword-aligned address supplied by A[31:2].

A legacy IO function that does not own all of the byte locations within a dword must decode the byte enables to determine if it owns the byte-specific location being addressed. It must examine the byte enables to determine if the Requester is addressing additional, higher byte locations within the target IO dword (identified via A[31:2]). If it owns all of the addressed IO ports, the function can honor the request. However, if it doesn't own them all it must issue a Completer Abort to the Requester.

Figure 22-12: IO Base Address Register Bit Assignment

Chapter 22: PCI Compatible Configuration Registers

Finding Block Size and Assigning Address Range

How It Works. The configuration program must probe each of a function's possible Base Address Registers to determine:

Is the Base Address Register implemented?

Is it a memory or an IO address decoder?

If it's a memory decoder, is it a 32- or 64-bit Base Address Register?

If it's a memory decoder, is the memory associated with the register Prefetchable or non-Prefetchable?

How much memory or address space does it require and with what alignment?

All of this information can be ascertained simply by writing all ones to the Base Address Register and then reading it back. A return value of zero indicates that the Base Address Register isn't implemented. Assuming that the value read is non-zero, scanning the returned value (assuming its non-zero) upwards starting at the least-significant bit of the Base Address field, the programmer determines the size of the required memory or the IO space by finding the least-significant bit that was successfully set to one. Bit zero of the register has a binary-weighted value of one, bit one a value of two, bit two a value of four, etc. The binary-weighted value of the least-significant bit set to one in the Base Address field indicates the required amount of space. This is also the first read/writable bit in the register and all of the bits above it are by definition read/writable. After discovering this information, the program then writes a base 32- or 64-bit memory address, or the base 32-bit IO address into the Base Address Register.

A Memory Example. As an example, assume that FFFFFFFFh is written to the Base Address Register at configuration dword

04 d

and the value read back is FFF00000h. The fact that any bits could be changed to one indicates that the Base Address Register is implemented.

Bit $0 = 0$ ,indicating that this is a memory address decoder.

Bits $[2 : 1] = 00 b$ ,indicating that it’s a 32-bit memory decoder.

Bit $3 = 0$ ,indicating that it’s not Prefetchable memory.

Bit 20 is the first one bit found in the Base Address field. The binary-weighted value of this bit is 1,048,576, indicating that this is an address decoder for $1 MB$ of memory.

The programmer then writes a 32-bit base address into the register. However, only bits [31:20] are writable. The decoder accepts bits [31:20] and assumes that bits [19:0] of the assigned base address are zero. This means PCI Express System Architecture

that the base address is divisible by

1 MB

,the size of the requested memory range. It is a characteristic of PCI, PCI-X, and PCI Express decoders that the assigned start address is always divisible by the size of requested range.

As an example, it is possible to program the example memory address decoder for a 1MB block of memory to start on the one, two, or three meg boundary, but it is not possible to set its start address at the 1.5, 2.3, or 3.7 meg boundary.

An IO Example. As a second example, assume that FFFFFFFFh is written to a function's Base Address Register at configuration dword address 05d and the value read back is FFFFFF01h. Bit 0 is a one, indicating that this is an IO address decoder. Scanning upwards starting at bit 2 (the least-significant bit of the Base Address field), bit 8 is the first bit that was successfully changed to one. The binary-weighted value of this bit is 256 , indicating that this is an IO address decoder requesting 256 bytes of IO space.

The programmer then writes a 32-bit base IO address into the register. However, only bits [31:8] are writable. The decoder accepts bits [31:8] and assumes that bits

[7 : 0]

of the assigned base address are zero. This means that the base address is divisible by 256 , the size of the requested IO range.

Smallest/Largest Decoder Sizes

Smallest/Largest Memory Decoders. The smallest memory address decoder is implemented as a Base Address Register that permits bits [31:7] to be written. Since the binary-weighted value of bit seven is 128, 128 bytes is the smallest memory block a memory decoder can be designed for.

If a 32-bit memory BAR only permits bit 31 to be written, it is requesting 2GB of memory space.

A 64-bit memory BAR could request more than 2GB of memory address space, resulting in none of the lower 32 bits in the BAR being writable. If this is the case, the programmer must also write all ones in the high dword of the BAR to determine how big a memory space the decoder requires.

Smallest/Largest IO Decoders. The smallest IO decoder would be implemented as a Base Address Register that permitted bits [31:2] to be programmed. Since the binary-weighted value of bit two is 4,4 bytes (a dword) is the smallest IO block an IO decoder can be designed for.

The largest IO decoder would permit bits [31:8] to be written. The binary-weighted value of bit 8 is 256 and this is therefore the largest range that an IO decoder can request.

Byte Merging

A bridge may combine writes to a single dword within one entry in the posted-write buffer. This feature is recommended to improve performance and is only permitted in memory address ranges that are designated as prefetchable.

As an example, assume that a Requester performs two memory writes:

the first writes to locations 00000100h and 00000101h and

the second writes to locations 00000102h and 00000103h.

These four locations reside within the same dword. The bridge could absorb the first two-byte write into a dword buffer entry and then absorb the second two byte write into the same dword buffer entry. When the bridge performs the memory write, it can complete it as a single access. It is a violation of the spec, however, for a bridge to combine separate byte writes to the same location into a single write. As an example, assume that a Requester performs four separate memory writes to the same dword: the first writes to location zero in the dword, the second to location zero again, the third to location one and the fourth to location two. When the bridge performs the posted writes, it has to perform a single memory write transaction to write the first byte to location zero. It then performs a second memory write transaction to write to locations zero (the second byte written to it by the Requester), one and two.

Bridge Must Discard Unconsumed Prefetched Data

A bridge that has prefetched memory read data for a Requester must discard any prefetched read data that the Requester doesn't actually end up reading. The following is an example scenario that demonstrates a problem that will result if a bridge doesn't discard prefetched data that wasn't consumed:

The processor has two buffers in main memory that occupy adjacent memory regions. The memory is designated as prefetchable memory.

The processor writes data into the first memory buffer and then instructs a PCI Express Requester to read and process the data.

The Requester starts its memory read and the bridge between the Requester and the target memory performs read aheads from the memory because it is prefetchable, well-behaved memory. The bridge ends up prefetching past the end of the first memory buffer into the second one, but the Requester

PCI Express System Architecture

only actually reads the data from the first buffer area.

The bridge does not discard the unused data that was prefetched from the second buffer.

The processor writes data into the second memory buffer and then instructs a Requester (the same Requester or a different one) beyond the same bridge to read and process the data.

The Requester starts its memory read at the start address of the second buffer. The bridge delivers the data that it prefetched from the beginning of the second buffer earlier. This is stale data and doesn't reflect the latest data written into the second memory buffer.

Min_Gnt/Max_Lat Registers

Differs from the PCI spec. Optional.

These registers do not apply to PCI Express. They must be read-only and hardwired to 0 .

Header Type 1

General

Figure 22-13 on page 803 illustrates the layout of a PCI-to-PCI bridge's configuration header space.

PCI Express System Architecture

Revision ID register.

Class Code register.

Header Type register.

BIST register.

Capabilities Pointer register.

Subordinate Bus Number register.

Secondary Bus Number register.

Primary Bus Number register.

IO Base, Limit and Upper registers.

Memory Base and Limit registers.

Expansion ROM Base Address register.

The sections that follow provide a description of each of these registers.

Header Type 1 Registers Incompatible With PCI

In a Header Type 1 bridge PCI Express function, the definitions of the following configuration registers in the function's PCI-compatible configuration space differ from the PCI spec's definition of the respective register definitions:

Command Register

Status Register

Cache Line Size Register

Master Latency Timer Register

Interrupt Line Register

Interrupt Pin Register

Base Address Registers

Secondary Latency Timer register.

Secondary Status register.

Prefetchable Memory Base, Limit, and Upper registers.

Bridge Control register.

The sections that follow define the implementation/usage differences of these registers. For a full description of their implementation in a PCI-to-PCI bridge function, refer to the MindShare book entitled PCI System Architecture, Fourth Edition (published by Addison-Wesley). For a full description their implementation in a PCI-X to PCI-X bridge function, refer to the MindShare book entitled PCI-X System Architecture, First Edition (published by Addison-Wesley).

Chapter 22: PCI Compatible Configuration Registers

Terminology

Before proceeding, it's important to define some basic terms associated with an actual or a virtual PCI-to-PCI bridge. Each PCI-to-PCI bridge is connected to two buses, referred to as its primary and secondary buses:

Downstream. When a transaction is initiated and is passed through one or more PCI-to-PCI bridges flowing away from the host processor, it is said to be moving downstream.

Upstream. When a transaction is initiated and is passed through one or more PCI-to-PCI bridges flowing towards the host processor, it is said to be moving upstream.

Primary bus. PCI bus that is directly connected to the upstream side of a bridge.

Secondary bus. PCI bus that is directly connected to the downstream interface of a PCI-to-PCI bridge.

Subordinate bus. Highest-numbered PCI bus on the downstream side of the bridge.

Bus Number Registers

PCI-Compatible registers. Mandatory.

Introduction

Each PCI-to-PCI bridge must implement three mandatory bus number registers. All of them are read/writable and are cleared to zero by reset. During configuration, the configuration software initializes these three registers to assign bus numbers. These registers are:

the Primary Bus Number register.

the Secondary Bus Number register.

the Subordinate Bus Number register.

The combination of the Secondary and the Subordinate Bus Number register values defines the range of buses that exists on the downstream side of the bridge. The information supplied by these three registers is used by the bridge to determine whether or not to pass a packet through to the opposite interface. PCI Express System Architecture

Primary Bus Number Register

PCI-Compatible register. Mandatory. Located in Header byte zero of dword six. The Primary Bus Number register is initialized by software with the number of the bus that is directly connected to the bridge's primary interface. This register exists for three reasons:

To route Completion packets.

To route a Vendor-defined message that uses ID-based routing.

To route a PCI Special Cycle Request that is moving upstream. A bridge that connects a PCI Express link to a PCI or PCI-X bus receives a Special Cycle Request (as defined in the PCI spec) on its secondary interface. A Special Cycle Request is a request to perform a Special Cycle transaction on the destination PCI or PCI-X bus. The request takes the form of a Type 1 Configuration write request packet (see Figure 20-8 on page 734) wherein the destination ID has the following characteristics:

All ones in the Device Number and Function Number fields, and

— All zeros in the Register Number and Extended Register Number fields. If the destination bus number in the request packet matches the value in the Primary Bus Number register and the other fields are as stated above, the request is converted into a Special Cycle transaction on the primary bus and the write data is delivered as the message in the transaction's Data Phase.

If it doesn't match the bridge's Primary Bus Number register and it's outside the range of buses defined by the bridge's Secondary Bus Number and Subordinate Bus Number registers, the bridge accepts the packet (the target bus is not on the downstream side of the bridge and therefore it must be passed upstream). The bridge accepts the packet and passes it to its opposite interface.

Secondary Bus Number Register

PCI-Compatible register. Mandatory. Located in Header byte one of dword six. The Secondary Bus Number register is initialized by software with the number of the bus that is directly connected to the bridge's secondary interface. This register exists for three reasons:

When a Special Cycle Request is latched on the primary side, the bridge uses this register (and, possibly, the Subordinate Bus Number register) to determine if it should be passed to the bridge's secondary interface as either a PCI Special Cycle transaction (if the bus connected to the secondary interface is the destination PCI or PCI-X bus) or as is (i.e., as a Type 1 configuration write request packet).

When a Type 1 Configuration transaction (read or write and not a PCI Special Cycle Request) is latched on the primary side, the bridge uses this register (and, possibly, the Subordinate Bus Number register) to determine if it should be passed to the bridge's secondary interface as either a Type 0 configuration transaction (if the bus connected to the secondary interface is the destination PCI or PCI-X bus) or as is (i.e., as a Type 1 configuration write request packet).

When a Completion packet is latched on the primary side, the bridge uses this register (and, possibly, the Subordinate Bus Number register) to determine if it should be passed to the bridge's secondary interface.

Subordinate Bus Number Register

PCI-Compatible register. Mandatory. Located in Header byte two of dword six. The Subordinate Bus Number register is initialized by software with the number of the highest-numbered bus that exists on the downstream side of the bridge. If there are no PCI-to-PCI bridges on the secondary bus, the Subordinate Bus Number register is initialized with the same value as the Secondary Bus Number register.

Bridge Routes ID Addressed Packets Using Bus Number Registers

When one of the bridge's interfaces latches a Completion packet, an ID-routed Vendor-defined message, or a PCI Special Cycle request, it uses its internal bus number registers to decide whether or not to accept the packet and pass it to the opposite bridge interface:

The routing of PCI Special Cycle requests was described in the previous sections.

When the bridge latches a Completion packet or an ID-routed Vendor-defined message on its primary interface, it compares the Bus Number portion of destination ID to its Secondary Bus Number and Subordinate Bus Number register values. If the target bus number falls within the range of buses defined by the bridge's Secondary Bus Number and Subordinate Bus Number registers, the bridge accepts the packet and passes it to its opposite interface.

When the bridge latches a Completion packet or an ID-routed Vendor-defined message on its secondary interface, it compares the Bus Number portion of the destination ID to its Primary Bus Number register.

If it matches, the bridge accepts the packet and passes it to the primary interface.

If it doesn't match the bridge's Primary Bus Number register and it's outside the range of buses defined by the bridge's Secondary Bus Num-

PCI Express System Architecture

ber and Subordinate Bus Number registers, the bridge accepts the packet (the target bus is not on the downstream side of the bridge and therefore it must be passed upstream) and passes it to its primary interface.

If the destination bus falls within the range of buses defined by the bridge's Secondary Bus Number and Subordinate Bus Number registers, then the target bus is on the downstream side of the bridge. The bridge therefore does not accept the packet.

These registers are also used to route Type 1 configuration packets.

Vendor ID Register

PCI-Compatible register. Mandatory. See "Vendor ID Register" on page 773.

Device ID Register

PCI-Compatible register. Mandatory. See "Device ID Register" on page 773.

Revision ID Register

PCI-Compatible register. Mandatory. See "Revision ID Register" on page 773.

Class Code Register

PCI-Compatible register. Mandatory. Refer to Figure 22-2 on page 775. The Class field in the Class Code register of a Virtual PCI-to-PCI bridge, or a PCI Express bridge to a PCI or PCI-X bus will contain the value 06h (see Table 22-1 on page 775), the SubClass field will contain the value 04h (see Table 8 on page 1023), and the Programming Interface Byte will contain 00h.

Header Type Register

PCI-Compatible register. Mandatory. Refer to "Header Type Register" on page 777. The Header Type field in the Header Type register of a Virtual PCI-to-PCI bridge, or a PCI Express bridge to a PCI or PCI-X bus will be 01h, thereby indicating the register layout shown in Figure 22-13 on page 803.

Chapter 22: PCI Compatible Configuration Registers

BIST Register

PCI-Compatible register. Optional.Refer to "BIST Register" on page 778.

Capabilities Pointer Register

PCI-Compatible register. Mandatory. Refer to "Capabilities Pointer Register" on page 779.

Basic Transaction Filtering Mechanism

PCI devices that reside on the downstream side of a PCI-to-PCI bridge may incorporate internal memory (mapped into memory space) and/or an internal, device-specific register set mapped into either IO or memory-mapped IO space. The configuration program automatically detects the presence, type and address space requirements of these devices and allocates space to them by programming their address decoders to recognize the address ranges it assigns to them.

The configuration program assigns all IO devices that reside behind a PCI-to-PCI bridge mutually-exclusive address ranges that are blocked together within a common overall range of IO locations. The PCI-to-PCI bridge is then programmed to pass any IO transactions detected on the primary side of the bridge to the secondary side if the target address is within the range associated with the community of IO devices that reside behind the bridge. Conversely, any IO transactions detected on the secondary side of the bridge are passed to the primary side if the target address is outside the range associated with the community of IO devices that reside on the secondary side (because the target device doesn't reside on the secondary side, but may reside on the primary side).

All memory-mapped IO devices (i.e., non-prefetchable memory) that reside behind a PCI-to-PCI bridge are assigned mutually-exclusive memory address ranges within a common block of memory locations. The PCI-to-PCI bridge is then programmed to pass any memory-mapped IO transactions detected on the primary side of the bridge to the secondary side if the target address is within the range associated with the community of memory-mapped IO devices that reside behind the bridge. Conversely, any memory-mapped IO transactions detected on the secondary side of the bridge are passed to the primary side if the target address is outside the range associated with the community of mem-PCI Express System Architecture

ory-mapped IO devices that reside on the secondary side (because the target device doesn't reside on the secondary side, but may reside on the primary side).

All memory devices (i.e., regular memory, not memory-mapped IO) that reside behind a PCI-to-PCI bridge are assigned mutually-exclusive memory address ranges within a common overall range of memory locations. The PCI-to-PCI bridge is then programmed to pass any memory transactions detected on the primary side of the bridge to the secondary side if the target address is within the range associated with the community of memory devices that reside behind the bridge. Conversely, any memory transactions detected on the secondary side of the bridge are passed to the primary side if the target address is outside the range associated with the community of memory devices that reside on the secondary side (because the target device doesn't reside on the secondary side, but may reside on the primary side).

The bridge itself may incorporate:

a memory buffer.

an IO register set that is used to control the bridge

a device ROM that contains a device driver for the bridge.

The bridge must incorporate programmable address decoders for these devices.

Bridge's Memory, Register Set and Device ROM

Introduction

A PCI-to-PCI bridge designer may choose to incorporate the following entities within the bridge:

A set of internal, device-specific registers that are used to control the bridge's operational characteristics or check its status. These registers are outside the scope of the PCI specification.

A memory buffer within the bridge.

A device ROM that contains a device driver for the bridge.

The register set must be mapped into memory or IO address space (or both). The designer implements one or two Base Address Registers (programmable address decoders) for this purpose.

Chapter 22: PCI Compatible Configuration Registers

If there is a device ROM within the bridge, the designer must implement an Expansion ROM base address register used by configuration software to map the ROM into memory space.

Likewise, if the bridge incorporates a memory buffer, the design must include a Base Address Register used to assign a base address to the memory.

Base Address Registers

Differs from PCI. Optional. Only necessary if the bridge implements a device-specific register set and/or a memory buffer.

Located in Header dwords four and five. If the designer doesn't implement any internal, device-specific register set or memory, then these address decoders aren't necessary. These Base Address Registers are used in the same manner as those described for a non-bridge PCI function (see "Base Address Registers" on page 792). If implemented, both may be implemented as memory decoders, both as IO decoders,one as memory and one as IO,or only one may be implemented as either IO or memory.

If a BAR is implemented as a memory BAR with the prefetchable bit set to one, it must be implemented as a 64-bit memory BAR and would therefore consume both dwords four and five.

Expansion ROM Base Address Register

PCI-Compatible register. Optional. Only necessary if the bridge implements a bridge-specific device ROM. Located in Header dword 14. This register is optional (because there may not be a device ROM incorporated within the bridge). The format and usage of this register is precisely the same as that described for a non-bridge PCI function (see "Expansion ROM Base Address Register" on page 783).

Bridge's IO Filter

PCI-Compatible registers. Optional.

Introduction

There is no requirement for a bridge to support devices that reside in IO space within or behind the bridge. For this reason, implementation of the IO decode-related configuration registers is optional. PCI Express System Architecture

When the bridge detects an IO transaction initiated on either of its bus interfaces, it must determine which of the following actions to take:

Ignore the transaction because the target IO address isn't located on the other side of the bridge, nor is it targeting an IO location embedded within the bridge itself.

When the target IO address is one of the bridge's internal IO registers, the Requester is permitted to access the targeted internal register and the transaction is not passed through the bridge.

When the target IO location is located on the other side of the bridge, the transaction is passed through the bridge and is initiated on the opposite bus.

The optional configuration registers within the bridge that support this "filtering" capability are:

Base Address Registers. If present, the Base Address Register or registers can be designed as IO or memory decoders for an internal register set or memory.

IO Base and IO Limit registers. If the bridge supports IO space on the downstream side of the bridge, the IO Base register defines the start address and the IO Limit register defines the end address of the range to recognize and pass through to the secondary bus.

IO Extension registers (IO Base Upper 16-Bits and IO Limit Upper 16-Bits registers). If the bridge supports a 4GB (rather than a $64 KB$ ) IO address space on the downstream side of the bridge (as indicated in the IO Base and IO Limit registers), the combination of the IO Base and the IO Base Upper 16 Bits registers define the start address, and the combination of the IO Limit and the IO Limit Upper 16-Bits registers define the end address of the range to recognize and pass to the secondary side.

The sections that follow describe each of these scenarios.

Bridge Doesn't Support Any IO Space Behind Bridge

Assume that a bridge doesn't support any devices that reside in IO space on the downstream side of the bridge. In other words, it doesn't recognize any IO addresses as being implemented behind the bridge and therefore ignores all IO transactions detected on its primary bus. In this case, the bridge designer does not implement the optional IO Base, IO Limit, or IO Extension registers (i.e., IO Base Upper 16-bits and IO Limit Upper 16-Bits registers).

The bridge ignores all IO request packets detected on the primary bus (other than transactions that may target an optional set of bridge-specific registers contained within the bridge itself).

Any IO transactions detected on the bridge's secondary bus would be claimed and passed through to the primary bus in case the target IO device is implemented somewhere upstream of the bridge.

Bridge Supports 64KB IO Space Behind Bridge

Assume that a bridge is designed to support IO transactions initiated on the primary bus that may target locations within the first

64 KB

of IO space (IO locations 0000000h through 0000FFFFh) on the secondary side of the bridge. It ignores any primary side IO accesses over the

64 KB

address boundary. In other words,the bridge supports a

64 KB

IO space,but not a

4 GB

IO space on the secondary side of the bridge.

In this case, the bridge designer must implement the IO Base and the IO Limit registers, but does not implement the IO Extension registers (i.e., the IO Base Upper 16-Bits and the IO Limit Upper 16-Bits registers).

The IO Base and IO Limit register pair comprise the global IO address decoder for all IO targets that reside on the secondary side of the bridge:

Before the registers are initialized by the configuration software, they are first read from to determine whether they support $64 KB$ or a $4 GB$ of IO space on the secondary side of the bridge. In this scenario, assume that the registers are hardwired to indicate that the bridge only supports a $64 KB$ IO space.

The configuration software then walks the secondary bus (and any subordinate buses it discovers) and assigns to each IO decoder it discovers an exclusive IO address range within the first $64 KB$ of IO space. The subranges assigned to the devices are assigned in sequential blocks to make efficient use of IO space.

The IO Base and Limit register pair are then initialized by the startup configuration software with the start and end address of the IO range that all IO devices that were discovered behind the bridge (on the secondary and on any subordinate buses) have been programmed to reside within. In this case,since the bridge only supports the first $64 KB$ of IO space,the defined range will be a subset of the first $64 KB$ of IO space.

After they have been initialized, these two registers provide the bridge with the start and the end address of the IO address range to recognize. After they have been initialized, these two registers provide the bridge with the start and the end address of the IO address range to recognize for passing IO transactions through the bridge. PCI Express System Architecture

The bridge only supports the lower

64 KB

of IO space,but the IO address decoder comprised of the IO Base and Limit registers must perform a full IO address decode of address bits [31:2] to determine whether or not to accept an IO access on the primary bus and pass it to the secondary bus.

The format of the IO Base and IO Limit registers are illustrated in Figure 22-14 on page 815 and Figure 22-15 on page 815. Both registers have the same format:

the upper hex digit, bits [7:4], defines the most-significant hex digit of a 16- bit IO address;

the lower hex digit, bits [3:0], defines whether the bridge performs a 16-bit or 32-bit IO address decode.

In the scenario under discussion, the lower hex digit of both registers is hardwired with the value

0 h

,indicating that it performs a 16-bit IO address decode and therefore only supports addresses within the first

64 KB

of IO space.

Assume that the configuration software programs the upper digit of the IO Base register with the value

2 h

and the upper digit of the IO Limit register with the value

3 h

. This indicates that the start of the IO range to recognize is 2000h and the end address is 3FFFh-an 8KB block. As another example, assume that the upper digit in the base and limit registers are both set to

3 h

. The IO address range to recognize is then 3000h through 3FFFh-a 4KB block. In other words, this register pair defines the start address aligned on a 4KB address boundary, and the size, also referred to as the granularity, of the defined block is in increments of

4 KB

It should be noted that, if there aren't any IO devices on the bridge's secondary side, the IO Limit register can be programmed with a numerically lower IO address than the IO Base register. The bridge will not pass any IO transactions latched on the primary side through to the secondary side, but will pass any IO transactions latched on the secondary side through to the primary side.

PCI Express System Architecture

Example. Assume that the IO base is set to 2h and the IO Limit is set to 3h. The bridge is now primed to recognize any IO transaction on the primary bus that targets an IO address within the range consisting of 2000h through 3FFFh. Refer to Figure 22- 16 on page 817.

Anytime that the bridge detects an IO transaction on the primary bus with an address inside the 2000h through 3FFFh range, it accepts the transaction and passes it through (because it's within the range defined by the IO Base and Limit registers and may therefore be for an IO device that resides behind the bridge).

Anytime that the bridge detects an IO transaction on the primary bus with an address outside the 2000h through 3FFFh range, it ignores the transaction (because the target IO address is outside the range of addresses assigned to IO devices that reside behind the bridge).

Anytime that the bridge detects an IO transaction on the secondary bus with an address inside the 2000h through 3FFFh range, it ignores the transaction (because the target address falls within the range assigned to IO devices that reside on the secondary side of the bridge).

Anytime that the bridge detects an IO transaction on the secondary bus with an address outside the 2000h through 3FFFh range, it accepts the transaction and passes it through to the primary side (because the target address falls outside the range assigned to IO devices that reside on the secondary side of the bridge, but it may be for an IO device on the primary side).

The IO Base register is initialized with the fourth digit of the 32-bit start IO address.

The IO Base Upper 16 bits register is initialized with the fifth through the eight digits of the 32-bit start address of the range.

The IO Limit register is initialized with the fourth digit of the 32-bit end IO address.

The IO Limit Upper 16 bits register is initialized with the fifth through eighth digits of the 32-bit end address of the range.

The IO Base and IO Limit register pair comprise an IO address decoder. They are used as follows:

Before the registers are initialized by the configuration software, they are read from to determine if they are capable of supporting a $64 KB$ or a $4 GB$ IO address space behind the bridge. In this scenario, the Address Decode Type field (see Figure 22-14 on page 815) within each of the registers is hardwired (with a value of $1 h$ ) to indicate that a $4 GB$ IO space is supported on the secondary side.

The configuration software than walks the secondary bus (and any subordinate buses it discovers beneath the secondary bus) and assigns each IO device that it discovers an exclusive IO address range within the 4GB IO space. The sub-ranges assigned to the devices are assigned in sequential blocks to make efficient use of IO space.

The IO Base and IO Base Extension (i.e., the IO Base Upper 16-bits) register pair is then initialized by the startup configuration software with the upper five digits of the 4KB-aligned, 32-bit start address of the IO range that all IO devices that were discovered behind the bridge (on the secondary and on any subordinate buses) have been programmed to reside within.

The IO Limit and IO Limit Extension (i.e., the IO Limit Upper 16 bits) register pair is initialized with the $4 KB$ -aligned end address of the range that the devices occupy.

In the scenario under discussion, since the bridge supports the entire 4GB IO space, the defined range is a subset of the overall 4GB IO space. After they have been initialized, these four registers provide the bridge with the start and the end address of the IO address range to recognize.

Since the bridge supports the entire 4GB IO space, the IO address decoder comprised of the four registers (Base and Limit registers plus their Extension registers) performs an IO address decode within address bits [31:12] to determine whether or not to pass an IO access detected on the primary bus through to the secondary bus and vice versa.

Chapter 22: PCI Compatible Configuration Registers

The format of the IO Base and IO Limit registers was illustrated earlier in Figure 22-14 on page 815 and Figure 22-15 on page 815. In the scenario under discussion, the lower hex digit of the Base and Limit registers is hardwired with the value

1 h

,indicating a 32-bit IO address decode,supporting address recognition within the entire 4GB IO space. Simply put, the IO Base and IO Limit Upper 16- bits registers are used to hold the upper four digits of the start and end IO address boundaries, respectively.

Assume that the configuration software programs the registers as follows:

Upper digit of the IO Base $= 2 h$ .

IO Base Upper 16-bits register $= 1234 h$ .

Upper digit of the IO Limit register $= 3 h$ .

IO Limit Upper 16-bits register $= 1235 h$ .

This indicates a

72 KB

range consisting of:

start of IO range $= 12342000 h$

end address $= 12353 FFFh$ .

As another example, assume the following:

Upper digit of the IO Base $= 3 h$ .

IO Base Upper 16-bits register $= 1234 h$ .

Upper digit of the IO Limit register $= 3 h$ .

IO Limit Upper 16-bits register $= 1234 h$ .

This indicates a

4 KB

range consisting of:

start of IO range $= 12343000 h$

end address $= 12343 FFFh$ .

In other words,the four registers define the start address aligned on a

4 KB

address boundary,and the size of the defined block is an increment of

4 KB

Bridge's Prefetchable Memory Filter

Differs from PCI. Optional.

An Important Note From the Authors

Although the Express spec appears to support the concept of a bridge within a Root Complex or a Switch reading more data than is actually requested from areas of memory defined as prefetchable (see the spec statements included below), it is the opinion of the authors that a bridge within a Root Complex or a Switch will not do so. Our rationale is provided below.

In PCI. When a PCI bus master initiates a memory read transaction, it issues the start memory address but does not indicate how much data is to be read. The only exception is a single Data Phase memory read. In that case, the total amount of data to be read is represented by the Byte Enables that are presented in that Data Phase.

When the device acting as the target on the initiating bus receives the transaction request, the manner in which the request is handled depends on the device type (bridge or ultimate memory target), as well as the transaction type used by the bus master:

If device acting as the target is the ultimate target of the read:

and the memory is well-behaved, prefetchable memory, then the target may perform internal prefetches (i.e., read-aheads) and queue up data to be supplied to the requester if the transaction ends up asking for the data. This is to enhance performance. If the transaction ends without all of the prefetched data being asked for, the remaining data in the target's read-ahead buffer should be discarded (unless the target can guarantee the continued freshness of the data).

and the memory is not prefetchable memory (e.g., it's a memory-mapped IO register set), then the memory target must wait until the Byte Enables are presented in each Data Phase and only read and supply the requested bytes. No prefetching is permitted.

If the device acting as the target is a bridge in the path to the target, the bridge latches the request and issues a Retry to the initiating master: - and the transaction type is Memory Read, there are two possibilities:

If the memory address is within a range defined as prefetch-able memory, the bridge may turn the read into a burst read when it initiates the request on the other side of the bridge and prefetch data into a bridge buffer. When the original master then retries the transaction, the bridge sources data from the fast read-ahead buffer yielding better performance. If the master ultimately doesn't consume all of the data, it is discarded by the bridge.

and the memory is not prefetchable memory (e.g., it's a memory-mapped IO register set), then no prefetching by the bridge is permitted when it re-initiates the read on the opposite side of the bridge.

and the transaction type is Memory Read Line (MRL), this tells the bridge that the master has specific knowledge that the memory range from the transaction's start address up to the end of the addressed line of memory space is prefetchable memory. Even if the bridge's prefetchable memory range registers indicate this is not prefetchable memory, the bridge may turn the read into a burst read when it initiates the request on the other side of the bridge and prefetch data up to the end of the current line into a bridge buffer. When the original master then retries the transaction, the bridge sources data from the fast read-ahead buffer yielding better performance. If the master ultimately doesn't consume all of the data, it is discarded by the bridge.

and the transaction type is Memory Read Multiple (MRM), this tells the bridge that the master has specific knowledge that the memory range from the transaction's start address and up to the end of the line immediately following the addressed line of memory space is prefetchable memory. Even if the bridge's prefetchable memory range registers indicate this is not prefetchable memory, the bridge may turn the read into a burst read when it initiates the request on the other side of the bridge and prefetch data across cache line boundaries into a bridge buffer. When the original master then retries the transaction, the bridge sources data from the fast read-ahead buffer yielding better performance. If the master ultimately doesn't consume all of the data, it is discarded by the bridge.

In PCI Express. When a PCI Express Requester issues a memory read request, it indicates the exact amount of data it wishes to read:

The First DW Byte Enable field in the request packet header indicates the byte(s) to be read from the first dword.

The Length field in the request packet header indicates the overall number of dwords in the transfer.

The Last DW Byte Enable field in the request packet header indicates the byte(s) to be read from the last dword.

Since the exact amount of requested data is known at the onset of a memory read request, there is no reason for prefetching to achieve better performance (as there is in PCI).

Spec References To Prefetchable Memory. The following Express spec 1.0a references represent all of its references to prefetching:

Page 33, line 5: "A PCI Express Endpoint requesting memory resources through a BAR must set the BAR's Prefetchable bit unless the range contains locations with read side-effects or locations in which the device does not tolerate write merging".

Page 33, line 8: "For a PCI Express Endpoint, 64-bit addressing must be supported for all BARs that have the prefetchable bit set. 32-bit addressing is permitted for all BARs that do not have the prefetchable bit set."

Page 52, line 13: "For each bit of the Byte Enables fields: a value of 0b indicates that the corresponding byte of data must not be written or, if non prefetchable, must not be read at the Completer."

Page 54, line 9: "This is really just a specific case of the rule that in a non-prefetchable space, non-enabled bytes must not be read at the Completer."

Page 265, line 26: "For example, if a Read is issued to prefetchable memory space and the Completion returns with a Unsupported Request Completion Status, perhaps due to a temporary condition, the initiator may choose to reissue the Read Request without side effects."

Page 326, line 5: "A PCI Express Endpoint requesting memory resources through a BAR must set the BAR's Prefetchable bit unless the range contains locations with read side-effects or locations in which the device does not tolerate write merging. It is strongly encouraged that memory-mapped resources be designed as prefetchable whenever possible. PCI Express devices other than legacy Endpoints must support 64-bit addressing for any Base Address register that requests prefetch-able memory resources".

Page 328, line 1: "A PCI Express Endpoint requesting memory resources through a BAR must set the BAR's Prefetchable bit unless the range contains locations with read side-effects or locations in which the device does not tolerate write merging. It is strongly encouraged that memory-mapped resources be designed as prefetchable whenever possible. PCI Express devices other than legacy Endpoints must support 64-bit addressing for any Base Address register that requests prefetch-able memory resources."

Page 329: "The Prefetchable Memory Base and Prefetchable Memory Limit registers must indicate that 64-bit addresses are supported, as defined in PCI Bridge 1.1." Please note that this not so. Section 3.2.5.10 in the 1.1 bridge spec states "The Prefetchable Base Upper 32 Bits and Prefetch-able Limit Upper 32 Bits registers are optional extensions to the Prefetchable Memory Base and Prefetchable Memory Limit registers."

Characteristics of Prefetchable Memory Devices

While the Memory Base and Limit registers are mandatory (to support memory-mapped IO behind the bridge). The Prefetchable Memory Base and Limit registers (and their extensions) are optional (there is no requirement for a bridge to support prefetchable memory on its downstream side). The PCI-to-PCI bridge specification recognizes the fact that while both groups are mapped into memory address space, memory devices and memory-mapped IO devices can have distinctly different operational characteristics.

An optional set of registers are provided in the bridge's configuration space that permit the configuration software to define the start and end address of the prefetchable memory space that is occupied by well-behaved memory devices behind the bridge. A mandatory register pair permits the configuration software to define the start and end address of the memory-mapped IO space that is occupied by well-behaved memory devices behind the bridge.

Multiple Reads Yield the Same Data. A well-behaved memory device always returns the same data from a location no matter how many times the location is read from. In other words, reading from a memory device doesn't in any way alter the contents of memory. This is one of the characteristics of a prefetchable memory target.

Byte Merging Permitted In the Posted Write Buffer. A bridge incorporates a posted-write buffer that quickly absorbs data to be written to a memory device on the other side of the bridge. Since the initiating Requester is able to immediately complete a memory write and doesn't have to delay until the write to the memory device has actually been completed, posting yields better performance during memory write operations. The bridge would ensure that, before any subsequent memory read is permitted to propagate through the bridge, the bridge would flush its posted-write buffer to the memory device. Byte merging is permitted in a bridge's posted memory write buffer when handling writes to prefetchable memory (for more information, refer to "Byte Merging" on page 801).

Characteristics of Memory-Mapped IO Devices

Memory-mapped IO devices exhibit a different set of operational characteristics.

Read Characteristics. Performing a memory read from a memory-mapped IO location often has the effect of altering the contents of the location. As examples, one of the following may be true:

The location may be occupied by a memory-mapped IO status port. Reading from the location causes the IO device to deassert any status bits that were set in the register (on the assumption that they've been read and will therefore be dealt with by the device driver). If the read was caused by a prefetch and the prefetched data is never actually read by the device driver, then status information has just been discarded.

The location may be the front-end of a FIFO data buffer. Performing a read from the location causes the delivery of its current contents and the next data item is then automatically placed in the location by the IO device. The device assumes that the first data item has just been read by the device driver and sets up the next data item in the FIFO location. If the read was caused by a prefetch and the prefetched data is never actually read by the device driver, then the data has just been discarded.

Reads within an area of memory space occupied by memory-mapped IO devices must never result in prefetching by a bridge. A mandatory set of registers are provided that permit the configuration software to define the start and end address of the memory space that is occupied by memory-mapped IO devices that reside on the bridge's secondary side.

Write Characteristics. See "Byte Merging" on page 801.

Determining If Memory Is Prefetchable or Not

The configuration software determines that a memory target supports prefetch-ing by testing the state of the Prefetchable attribute bit in the memory target's Base Address Register (see "Base Address Registers" on page 792 and "Prefetchable Attribute Bit" on page 795).

Prefetchable $= 1$ indicates that the memory is prefetchable. The memory target must be mapped into Prefetchable memory space using the bridge's Prefetchable Base and Limit configuration registers ( if they are implemented).

Prefetchable $= 0$ indicates that it’s not. In this case,the memory target must be mapped into memory-mapped IO space using the Memory Base and Limit registers.

Bridge Support For Downstream Prefetchable Memory Is Optional

If the bridge does not support Prefetchable memory on its secondary side, the Prefetchable Memory Base and Limit registers must be implemented as read-only registers that return zero when read, and the Prefetchable Memory Base and Prefetchable Memory Limit registers are not implemented.

Must Support > 4GB Prefetchable Memory On Secondary Side

Whether or not a bridge supports prefetchable memory on the bridge's downstream side is optional. If the designer chooses to support this capability, then the following registers must be implemented to define the start and end address of the memory range occupied by prefetchable memory devices on the downstream side of the bridge:

Prefetchable Memory Base register.

Prefetchable Memory Limit register.

These two registers are used to define the start (base) and end (limit) address of the memory range and are illustrated in Figure 22-17 on page 827 and Figure 22- 18 on page 828. Any address within the lower

4 GB

can be specified. The start address is 1MB-aligned and the size of the range is specified in 1MB increments. The Express spec states that all memory BARs for prefetchable memory must be implemented as 64-bit registers (see Figure 22-11 on page 797). To support this, the extensions to Base and Limit registers must also be implemented:

Prefetchable Memory Base Upper 32-bits register.

Prefetchable Memory Limit Upper 32-bits register.

The 4-bit Address Decode Type field in the Base and Limit registers is hardwired to indicate that the extension registers are present.

The configuration software walks the secondary bus and any buses subordinate to the bridge and assigns each Prefetchable memory target a sub-range in a global overall range within the

2^{64}

memory space. After completing the address assignment process, the software then writes the upper eight hex digits of the range's 64-bit start address into the Prefetchable Memory Base Upper register and the next three hex digits into the upper three digits of the Base register. The upper eight hex digits of the range's 64-bit end address is written into the Prefetchable Memory Limit Upper register and the next three hex digits into the upper three digits of the Limit register.

PCI Express System Architecture

As an example, assume that these four registers are set as follows:

FF00000h is written into the Prefetchable Memory Base Upper 32-bits register.

123h is written into the upper three digits of the Base register.

FF000000h is written into Prefetchable Memory Limit Upper 32-bits register.

$124 h$ is written into the upper three digits of the Limit register.

This defines the Prefetchable memory address range as the 2MB range from FF00000012300000h through FF000000124FFFFFh. As another example, assume they are programmed as follows:

00000230h is written into the Prefetchable Memory Base Upper 32-bits register.

222h written into the upper three digits of the Base register.

0000030h is written into Prefetchable Memory Limit Upper 32-bits register.

222h written into the upper three digits of the Limit register.

This defines the Prefetchable memory address range as the 1MB range from 000002302220000h through 00000230222FFFFFh.

Rules for Bridge Prefetchable Memory Accesses

The following rules apply to Prefetchable memory:

Bridge support for Prefetchable memory on its secondary side is optional.

If the bridge does not support Prefetchable memory on its secondary side, the Prefetchable Memory Base and Limit registers must be implemented as read-only registers that return zero when read.

If the bridge does support prefetchable memory on its downstream side, it must implement the Prefetchable Memory Base and Limit registers, as well as the Prefetchable Memory Base and Limit Upper 32-bits registers (as indicated by hardwiring the first digit in the Base and Limit registers to a value of $1 h$ ).

Memory transactions are forwarded from the primary to the secondary bus if the address is within the range defined by the Prefetchable Memory Base and Limit registers or that defined by the Memory Base and Limit registers (for memory-mapped IO).

Memory transactions are forwarded from the secondary to the primary bus when the address is outside the ranges defined by the extended Prefetch-able Memory Base and Limit registers and the Memory Base and Limit registers (for memory-mapped IO).

When $2^{64}$ memory is supported on the downstream side of the bridge, transactions targeting addresses within the address range specified by the Prefetchable Memory Base and Limit registers (and their extensions) are permitted to cross the $4 GB$ boundary.

The bridge designer must support memory access requests above the $4 GB$ address boundary received by its downstream interface. Prior to the 1.1 PCI-to-PCI bridge spec, it was optional on both sides. This was changed to ensure that Requesters on the secondary side can access main memory above the $4 GB$ address boundary.

Assume that the bridge supports Prefetchable memory anywhere in $2^{64}$ memory space on the downstream side, but the configuration software maps all Prefetchable memory behind the bridge below the $4 GB$ boundary. In this case, the upper extensions of the Prefetchable Base and Limit registers must be set to zero and the bridge does not respond to memory access requests above the $4 GB$ address boundary received on the upstream interface. Those received by the bridge's downstream interface would be passed to the upstream interface (in case the Requester is addressing main memory above the $4 GB$ boundary).

Assume that the bridge supports Prefetchable memory anywhere in $2^{64}$ memory space on the downstream side and that the configuration software maps all Prefetchable memory on the downstream side above the 4GB

boundary. In this case, the upper extensions of the Prefetchable Base and Limit registers contain non-zero values and the bridge responds only to prefetchable memory access requests received on its upstream interface that are above the

4 GB

address boundary and within the defined prefetchable memory address range.

Assume that the bridge supports Prefetchable memory anywhere in $2^{64}$ memory space behind the bridge and that the configuration software maps the Prefetchable memory on the bridge's downstream side into a space that straddles the $4 GB$ boundary. In this case,the extension to the Prefetchable Base register is set to zero and the extension to the Limit register contains a non-zero value. When a memory request with an address below the $4 GB$ boundary is detected on either interface, the bridge compares the address only to the Prefetchable Memory Base register. If the address is $\geq$ the start address specified in the register, the address is in range. When a memory request with an address above the $4 GB$ boundary is detected on either interface, the bridge compares the lower 32-bits of the address to the Limit register and the Upper 32-bits of the address to the Limit Upper 32-bits register. If the address is $\leq$ the end address specified in the two registers,the address is in range.

The bridge may be designed to assume that all memory accesses received by its downstream interface that are passed to the primary bus are prefetch-able. This assumes that the destination of all memory reads traveling upstream is system memory (which is prefetchable). If a bridge makes this assumption, it must implement a device-specific bit in its configuration space that allows this ability to be disabled.

Memory writes received by either of the bridge's interfaces are accepted into the bridge's downstream or upstream posted memory write buffer. As described in "Byte Merging" on page 801, the bridge is permitted to perform byte merging in the buffer for writes to prefetchable memory, but not to memory-mapped IO.

Bridge's Memory-Mapped IO Filter

PCI-Compatible registers. Mandatory. The bridge designer is required to implement the Memory Base and Limit registers used to define a memory-mapped IO range. These two registers are used to define a range of memory occupied by memory-mapped IO devices that reside on the downstream side of the bridge. Figure 22-19 on page 831 and Figure 22-20 on page 831 illustrate the Memory Base and Limit registers. The lower digit of each register is hardwired to zero and the upper three digits are used to define the upper three hex digits of the eight-digit start and end addresses, respectively. Unlike the Prefetchable Base

Chapter 22: PCI Compatible Configuration Registers

and Limit and IO Base and Limit register pairs, there are no Extension registers associated with the Memory Base and Limit register pair. This means that all memory-mapped IO devices in the system must reside in the lower

4 GB

of memory address space.

As an example, assume that the configuration software has written the following values to the Memory Base and Limit registers:

The upper three digits of the Memory Base register contain 555h.

The upper three digits of the Memory Limit register contain $678 h$ .

This defines a 292MB memory-mapped IO region on the downstream side of the bridge starting at 55500000h and ending at 678FFFFFh. PCI Express System Architecture

Figure 22-19: Memory-Mapped IO Base Register

Figure 22-20: Memory-Mapped IO Limit Register

Bridge Command Registers

Differs from PCI. Mandatory.

Introduction

The bridge designer must implement two required command registers in the bridge's configuration Header region:

The Command register is the standard configuration Command register defined by the spec for any function. It is associated with the bridge's primary bus interface.

The Bridge Control register is an extension to the standard Command register and is associated with the operation of both of the bridge's bus interfaces.

These two registers are described in the next two sections.

Bridge Command Register

Differs from PCI. Mandatory. The Command register format, pictured in Figure 22-21 on page 832, is the same as that for a non-bridge function. Some of the bits, however, have different effects. Each of the bits is described in Table 22-6 on page 833.

Figure 22-21: Command Register

Table 22-6: Bridge Command Register Bit Assignment

Bit	Attributes	Description
0	RW	IO Address Space Decoder Enable - 0 . IO transactions received at the downstream side of a bridge that are moving in the upstream direction are not forwarded and bridge returns completion with ’Unsup- ported Request’ completion status. - 1. IO transactions received at the downstream side of a bridge that are moving in the upstream direction are for warded from the secondary to primary side of the bridge
1	RW	Memory Address Space Decoder Enable. - Memory-mapped devices within Bridge: - 0 . Memory decoder is disabled and Memory transactions targeting this device results in bridge returning completion with 'Unsupported Request' completion status - 1. Memory decoder is enabled and Memory transactions targeting this device are accepted. - Memory transactions targeting device on the upstream side of a bridge: - 0 . Memory transactions received at the downstream side of a bridge are not forwarded to the upstream side and bridg returns ‘Unsupported Request’ completion status 1. Memory transactions received at the dowstream side of $ε$ bridge that target a device residing on the upstream side of a bridge are forwarded from the secondary to the primary side of the bridge.
2	RW	Bus Master. Controls the ability of a Root Port or a downstream Switch Port to forward memory or IO requests in the upstream direction. If this bit is 0, when a Root Port or a downstream Switch Port receives an upstream-bound memory request or IO request, it returns Unsupported Requests (UR) status to the requester. This bit does not affect forwarding of Completions in either the upstream or downstream direction. - The forwarding of requests other than those mentioned above are not controlled by this bit. – Default value of this bit is 0.

Table 22-6: Bridge Command Register Bit Assignment (Continued)

Bit	Attributes	Description
3	RO	Special Cycles. Does not apply to PCI Express and must be hard- wired to 0 .
4	RO	Memory Write and Invalidate Enable. Does not apply to PCI Express and must be hardwired to 0.
5	RO	VGA Palette Snoop. Does not apply to PCI Express and must be hardwired to 0.
6	RW	Parity Error Response. When forwarding a Poisoned TLP from Primary to Secondary: - The primary side must set the Detected Parity Error bit in the bridge Status register. - If the Parity Error Response bit in the Bridge Control register is set, the secondary side must set the Master Data Parity Error bit in the Secondary Status register. When forwarding a Poisoned TLP from Secondary to Primary: - The secondary side must set the Detected Parity Error bit in the Secondary Status register. - If the Parity Error Response bit in the Bridge Control register is set, the primary side must set the Master Data Parity Error bit in the bridge Status register. If the Parity Error Response bit is cleared, the Master Data Parity Error status bit in the bridge Status register is never set. The default value of this bit is 0.
7	RO	Stepping Control. Does not apply to PCI Express. Must be hard- wired to 0.
8	RW	SERR# Enable. When set, this bit enables the non-fatal and fatal errors detected by the bridge's primary interface to be reported to the Root Complex. The function reports such errors to the Root Complex if it is enabled to do so either through this bit or through the PCI Express specific bits in the Device Control register (see “Device Control Register” on page 905). The default value of this bit is 0 .
9	RO	Fast Back-to-Back Enable. Does not apply to PCI Express and must be hardwired to 0.

Table 22-6: Bridge Command Register Bit Assignment (Continued)

Bit	Attributes	Description
10	RW	Interrupt Disable. Controls the ability of a bridge to generate INTx interrupt messages: - $0 =$ The bridge is enabled to generate INTx interrupt messages. - 1 = The bridge's ability to generate INTx interrupt messages is disabled. If the bridge had already transmitted any Assert_INTx emulation interrupt messages and this bit is then set, it must transmit a cor- responding Deassert_INTx message for each assert message transmitted earlier. Note that INTx emulation interrupt messages forwarded by Root and Switch Ports from devices downstream of the Root or Switch Port are not affected by this bit. The default value of this bit is (
15:11		Reserved. Read-only and must return zero when read.

Bridge Control Register

Differs from PCI. Mandatory. The Bridge Control register is a required extension to the bridge's Command register and is associated with the operation of both the primary and the secondary bridge interfaces. Figure 22-22 on page 835 illustrates this register and Table 22-7 on page 836 defines its bit assignment. Bits 8 - through - 11 were first defined in the 2.2 PCI spec.

Figure 22-22: Bridge Control Register

Table 22-7: Bridge Control Register Bit Assignment

Bit	Attributes	Description
0	RW	Parity Error Response. When forwarding a Poisoned TLP from Primary to Secondary: - The primary side must set the Detected Parity Error bit in the bridge Status register. - If the Parity Error Response bit in the Bridge Control register is set, the secondary side must set the Master Data Parity Error bit in the Secondary Status register. When forwarding a Poisoned TLP from Secondary to Primary: - The secondary side must set the Detected Parity Error bit in the Secondary Status register. - If the Parity Error Response bit in the Bridge Control register is set, the primary side must set the Master Data Parity Error bit in the bridge Status register. If the Parity Error Response bit is cleared, the Master Data Parity Error status bit in the Secondary Status register is never set. The default value of this bit is 0.
1	RW	SERR# Enable. This bit controls the forwarding of ERR_COR (cor- rectable errors), ERR_NONFATAL (non-fatal errors), and ERR_FATAL (fatal errors) received on the secondary side to the primary side. Default value of this field is 0.
2	RW	ISA Enable. See page 582 of the MindShare PCI book.
3	RW	VGA Enable. See page 608 of the MindShare PCI book.
4	RO	Reserved. Hardwired to zero.
5	RO	Master Abort Mode. Not used in Express and must be hardwired to zero.
6	RW	Secondary Bus Reset. Setting this bit to one triggers a hot reset on the Express downstream port. Port configuration registers must not be affected, except as required to update port status. Default value of this field is 0.
7	RO	Fast Back-to-Back Enable. Not used in Express and must be hard- wired to zero.

Table 22-7: Bridge Control Register Bit Assignment (Continued)

Bit	Attributes	Description
8	RO	Primary Discard Timeout. Not used in Express and must be hard wired to zero.
9	RO	Secondary Discard Timeout. Not used in Express and must be hardwired to zero.
10	RO	Discard Timer Status. Not used in Express and must be hard- wired to zero.
11	RO	Discard Timer SERR# Enable. Not used in Express and must be hardwired to zero.

Bridge Status Registers

Introduction

The bridge contains two required status registers, each of which is associated with one of the two interfaces.

Bridge Status Register (Primary Bus)

Differs from PCI. Mandatory. Refer to Figure 22-23 on page 838 and Table 22 - 8 on page 838. This required register is completely compatible with the Status register definition for a non-bridge function (see "Status Register" on page 788) and only reflects the status of the bridge's primary interface.

If the Capabilities List bit (bit 4) is set to one, this indicates that the bridge implements the Capability Pointer register in byte 0 of dword 13 in its configuration Header (see Figure 22-13 on page 803). For a general description of the New Capabilities, refer to "Capabilities Pointer Register" on page 779. In subsequently traversing the New Capabilities list, software may discover that the bridge implements the Slot Numbering registers. For a description of this feature, refer to "Introduction To Chassis/Slot Numbering Registers" on page 859 and "Chassis and Slot Number Assignment" on page 861.

PCI Express System Architecture

Table 22 - 8: Bridge Primary Side Status Register

Bit	Attributes	Description
3	RO	Interrupt Status. Indicates that the bridge itself had previ- ously transmitted an interrupt request to its driver (that is, the function transmitted an interrupt message earlier in time and is awaiting servicing). Note that INTx emulation interrupts forwarded by Root and Switch Ports from devices downstream of the Root or Switch Port are not reflected in this bit. The default state of this bit is 0.
4	RO	Capabilities List. Indicates the presence of one or more extended capability register sets in the lower 48 dwords of th function's PCI-compatible configuration space. Since, at a minimum, all PCI Express functions are required to imple- ment the PCI Express capability structure, this bit must be set to 1 .
5	RO	66MHz-Capable. Does not apply to PCI Express and must be 0.

Table 22 - 8: Bridge Primary Side Status Register (Continued)

Bit	Attributes	Description
7	RO	Fast Back-to-Back Capable. Does not apply to PCI Express and must be 0.
8	RW1C	Master Data Parity Error. When forwarding a Poisoned TLP from Primary to Second- ary: - The primary side must set the Detected Parity Error bit in the bridge Status register. - If the Parity Error Response bit in the Bridge Control regis- ter is set, the secondary side must set the Master Data Par- ity Error bit in the Secondary Status register. When forwarding a Poisoned TLP from Secondary to Pri- mary: - The secondary side must set the Detected Parity Error bit in the Secondary Status register. - If the Parity Error Response bit in the Bridge Control regis- ter is set, the primary side must set the Master Data Parity Error bit in the bridge Status register. If the Parity Error Response bit in the Bridge Command regis- ter is cleared, the Master Data Parity Error status bit in the Secondary Status register is never set. The default value of this bit is 0.
10:9	RO	DEVSEL Timing. Does not apply to PCI Express and must be 0 .
11	RW1C	Signaled Target Abort. This bit is set when the bridge's pri- mary interface completes a received request by issuing a Completer Abort Completion Status. Default value of this field is 0.
12	RW1C	Received Target Abort. This bit is set when the bridge’s pri- mary interface receives a Completion with Completer Abort Completion Status. Default value of this field is 0.
13	RW1C	Received Master Abort. This bit is set when the bridge's pri- mary interface receives a Completion with Unsupported Request Completion Status. Default value of this field is 0.

Table 22 - 8: Bridge Primary Side Status Register (Continued)

Bit	Attributes	Description
14	RW1C	Signaled System Error. This bit is set when the bridge's pri- mary interface sends an ERR_FATAL (fatal error) or ERR_NONFATAL (non-fatal error) message (if the SERR Enable bit in the bridge Command register is set to one). The default value of this bit is 0.
15	RW1C	Detected Parity Error. This bit is set by the bridge's primary interface whenever it receives a Poisoned TLP, regardless of the state the Parity Error Enable bit in the bridge Command register. Default value of this bit is 0.

Bridge Secondary Status Register

Differs from PCI. Mandatory. Refer to Figure 22-24 on page 841. With the exception of the Received System Error bit, this required register is completely compatible with the Status register definition for a non-bridge function (see "Status Register" on page 788) and only reflects the status of the secondary side. It should be noted that the Capabilities List bit (bit 4) is never implemented in this register.

While bit 14 is the Signaled System Error bit in the primary side Status register, it is the Received System Error bit in the Secondary Status register. When set, this bit indicates that SERR# was detected asserted on the secondary side. Writing a one to it clears the bit, while a zero doesn't affect it. Reset clears this bit.

Table 22 - 9: Bridge Secondary Side Status Register (Continued)

Bit	Attributes	Description
8	RW1C	Master Data Parity Error. When forwarding a Poisoned TLP from Primary to Second- ary: - The primary side must set the Detected Parity Error bit in the bridge Status register. - If the Parity Error Response bit in the Bridge Control regis- ter is set, the secondary side must set the Master Data Par- ity Error bit in the Secondary Status register. When forwarding a Poisoned TLP from Secondary to Pri- mary: - The secondary side must set the Detected Parity Error bit in the Secondary Status register. - If the Parity Error Response bit in the Bridge Control regis- ter is set, the primary side must set the Master Data Parity Error bit in the bridge Status register. If the Parity Error Response bit in the Bridge Control register is cleared, the Master Data Parity Error status bit in the Sec- ondary Status register is never set. The default value of this bit is 0 .
10:9	RO	DEVSEL Timing. Does not apply to Express and must be 0.
11	RW1C	Signaled Target Abort. This bit is set when the bridge's sec- ondary interface completes a received request by issuing a Completer Abort Completion Status. Default value of this field is 0 .
12	RW1C	Received Target Abort. This bit is set when the bridge's sec- ondary interface receives a Completion with Completer Abort Completion Status. Default value of this field is 0.
13	RW1C	Received Master Abort. This bit is set when the bridge's sec ondary interface receives a Completion with Unsupported Request Completion Status. Default value of this field is 0.
14	RW1C	Signaled System Error. This bit is set when the bridge's sec- ondary interface sends an ERR_FATAL (fatal error) or ERR_NONFATAL (non-fatal error) message (if the SERR Enable bit in the Bridge Control register is set to one. The default value of this bit is 0.

Table 22 - 9: Bridge Secondary Side Status Register (Continued)

Bit	Attributes	Description
15	RW1C	Detected Parity Error. This bit is set by the bridge's second- ary interface whenever it receives a Poisoned TLP, regardless of the state the Parity Error Enable bit in the Bridge Control register. Default value of this bit is 0.

Bridge Cache Line Size Register

Differs from PCI. Mandatory.

This field is implemented by PCI Express devices as a read-write field for legacy compatibility purposes but has no impact on any PCI Express device functionality.

Bridge Latency Timer Registers

Differs from PCI. Mandatory.

Bridge Latency Timer Register (Primary Bus)

Differs from PCI. Mandatory. This register does not apply to PCI Express and must be read-only and hardwired to 0 .

Bridge Secondary Latency Timer Register

Differs from PCI. Mandatory. This register does not apply to PCI Express and must be read-only and hardwired to 0 .

Bridge Interrupt-Related Registers

Differs from PCI. Optional. Only required if the bridge itself generates interrupts.

Interrupt Line Register

A bridge may generate interrupts in the legacy PCI/PCI-X manner due to an internal, bridge-specific event. The interrupt handler is within the bridge's device driver. When the bridge detects such an internal event, it sends an INTx Assert message upstream towards the Root Complex (specifically, to the interrupt controller within the Root Complex).

As in PCI, the Interrupt Line register communicates interrupt line routing information. The register is read/write and must be implemented if the bridge contains a valid non-zero value in its Interrupt Pin configuration register (described in the next section). The OS or device driver can examine the bridge's Interrupt Line register to determine which system interrupt request line the bridge uses to issue requests for service (and, therefore, which entry in the interrupt table to "hook").

In a non-PC environment, the value written to this register is architecture-specific and therefore outside the scope of the specification.

Interrupt Pin Register

This read-only register identifies the legacy INTx interrupt Message (INTA, INTB, INTC, or INTD) the bridge transmits upstream to generate an interrupt. The values 01h-through-04h correspond to legacy INTx interrupt Messages INTA-through-INTD. A return value of zero indicates that the bridge doesn't generate interrupts using the legacy method. All other values (05h-FFh) are reserved. Note that, although the bridge may not generate interrupts via the legacy method, it may generate them via the MSI method (see "Determining if a Function Uses INTx# Pins" on page 343 for more information).

Chapter 22: PCI Compatible Configuration Registers

PCI-Compatible Capabilities

AGP Capability

The 2.2 spec assigns the Capability ID of 02h to AGP. The remainder of this section is only included as an example of a New Capability.

Refer to Figure 22-25 on page 845.

The AGP's Capability ID is $02 h$ .

The second byte is the register that points to the register set associated with the next New Capability (if there is one).

Following the pointer register are two, 4-bit read-only fields designating the major and minor rev of the AGP spec that the AGP device is built to (at the time of this writing,the major rev is $2 h$ and the minor is $0 h$ ).

The last byte of the first dword is reserved and must return zero when read.

The next two dwords contain the AGP device's AGP Status and AGP Command registers.

The sections that follow define these registers and the bits within them.

For a detailed description of AGP,refer to the MindShare book entitled

A G P

System Architecture (published by Addison-Wesley).

Figure 22-25: Format of the AGP Capability Register Set

AGP Status Register

The AGP Status register is defined in Table 22-10 on page 846. This is a read-only register. Writes have no effect. Reserved or unimplemented fields or bits always return zeros when read.

Table 22-10: AGP Status Register (Offset CAP_PTR + 4)

Bits	Field	Description
31:24	RQ	The RQ field contains the maximum depth of the AGP request queue. Therefore, this number is the maximum number of transaction requests this device can manage. A "0" is inter- preted as a depth of one, while FFh is interpreted as a depth of 256.
23:10	Reserved	Writes have no effect. Reads return zeros.
9	SBA	If set, this device supports Sideband Addressing
8:6	Reserved	Writes have no effect. Reads return zeros.
5	4G	If set, this device supports addresses greater than 4GB.
4	FW	If set, this device supports Fast Write transactions.
3	Reserved	Writes have no effect. Reads return a zero
2:0	RATE	The RATE field is a bit map that indicates the data transfer rates supported by this device. AGP devices must report al that apply. The RATE field applies to AD, C/BE#, and SBA buses.
		Bit SetTransfer Rate
		01X
		12X
		24X

AGP Command Register

The AGP Command register is defined in Table 22-11 on page 847. This is a read/writable register, with reserved fields hard-wired to zeros. All bits in the AGP Command register are cleared to zero after reset. This register is programmed during configuration. With one exception, the behavior of a device if this register is modified during runtime is not specified. If the AGP_Enable bit is cleared, the AGP master is not allowed to initiate a new request.

Table 22-11: AGP Command Register (Offset CAP_PTR + 8)

Bits	Field	Description
31:24	RQ_Depth	Master: The RQ_DEPTH field must be programmed with the maximum number of transaction requests the master i allowed to enqueue into the target. The value pro grammed into this field must be equal to or less than the value reported by the target in the RQ field of its AGP Sta tus Register. A "0" value indicates a request queue depth of one entry, while a value of FFh indicates a request queue depth of 256. Target: The RQ_DEPTH field is reserved.
23:10	Reserved	Writes have no effect. Reads return zeros.
9	SBA_Enable	When set, the Sideband Address mechanism is enabled in this device.
8	AGP_Enable	Master: Setting the AGP_Enable bit allows the master to initiate AGP operations. When cleared, the master cannot initiate AGP operations. Also when cleared, the master is allowed to stop driving the SBA port. If bits 1 or 2 are set, the master must perform a re-synch cycle before initiating a new request. Target: Setting the AGP_Enable bit allows the target to accept AGP operations. When cleared, the target ignores incoming AGP operations. The target must be complete configured and enabled before the master is enabled. The AGP_Enable bit is the last to be set. Reset clears this bit.
7:6	Reserved	Writes have no effect. Reads return zeros.

Table 22-11: AGP Command Register (Offset CAP_PTR + 8) (Continued)

Bits	Field	Description
5	$4 G$	Master: Setting the $4 G$ bit allows the master to initiate AGP requests to addresses at or above the 4GB address boundary. When cleared, the master is only allowed to access addresses in the lower 4 GB of addressable space. Target: Setting the 4G bit enables the target to accept AGP DAC (Dual-Address Commands) commands, when bit 9 is cleared. When bits 5 and 9 are set, the target can accept a Type 4 SBA command and utilize A[35:32] of the Type 3 SBA command.
4	FW_Enable	When this bit is set, memory write transactions initiated by the core logic will follow the fast write protocol. When this bit is cleared, memory write transactions initiated by the core logic will follow the PCI protocol.
3	Reserved	Writes have no effect. Reads return zeros
2:0	Data_Rate	No more than one bit in the Data_Rate field must be set to indicate the maximum data transfer rate supported. The same bit must be set in both the master and the target. Bit SetTransfer Rate
		12X
		24X

Vital Product Data (VPD) Capability

Introduction

The 2.1 spec defined the optional Vital Product Data as residing in a PCI function's expansion ROM.

The 2.2 spec has deleted this information from the ROM and instead places the VPD (if present) in a function's PCI configuration register space (see "Capabilities Pointer Register" on page 779). This section describes the 2.2 implementation of the VPD and provides an example from the 2.2 spec.

Chapter 22: PCI Compatible Configuration Registers

It's Not Really Vital

It's always brought a smile to my face that despite its name, the VPD has never been vital. It's always been named "Vital" in the spec, but its content was not initially defined. Then in the 2.1 spec, although vital, it was defined as residing in a function's ROM, but its inclusion was optional. The 2.2 spec has now moved it from the ROM to the configuration space, but it's still optional.

What Is VPD?

The configuration registers present in a PCI function's configuration Header region (the first 16 dwords of its configuration space) provide the configuration software with quite a bit of information about the function. However, additional useful information such as

a board's part number

the EC (Engineering Change) level of a function

the device's serial number

an asset tag identifier

could be quite useful in a repair, tech support or asset management environments. If present, the VPD list provides this type of information.

Where Is the VPD Really Stored?

It is intended that the VPD would reside in a device such as a serial EEPROM associated with the PCI function. The configuration access mechanism described in the next section defines how this information would be accessed via the PCI function's VPD feature registers.

VPD On Cards vs. Embedded PCI Devices

Each add-in card may optionally contain VPD. If it's a multifunction card, only one function may contain VPD or each function may implement it. Embedded functions may or may not contain VPD.

How Is VPD Accessed?

Figure 22-26 on page 851 illustrates the configuration registers that indicate the presence of VPD information and permit the programmer to access it. The Capability ID of the VPD registers is

03 h

,while the registers used to access to the VPD data consists of the VPD Address and Data registers in conjunction with the one-bit Flag register. The programmer accesses the VPD information using the procedures described in the following two sections.

Reading VPD Data. Use the following procedure to read VPD data:

Using a PCI configuration write, write the dword-aligned VPD address into the Address register and simultaneously set the Flag bit to zero.

Hardware then reads the indicated dword from VPD storage and places the four bytes into the Data register. Upon completion of the operation, the hardware sets the Flag bit to one.

When software sees the Flag bit set to one by the hardware, it can then perform a PCI configuration read to read the four VPD bytes from the Data register.

If either the Address or Data registers are written to prior to hardware setting the Flag bit to one, the results of the read are unpredictable.

Writing VPD Data. Use the following procedure to write VPD data. Please note that only Read/Write VPD Data items may be written to.

Write four bytes of data into the Data register.

Write the dword-aligned VPD address into the Address register and simultaneously set the Flag bit to one.

When software detects that the Flag bit has been cleared to zero by hardware, the VPD write has been completed.

If either the Address or Data registers are written to prior to hardware clearing the Flag bit to zero, the results of the VPD write are unpredictable.

Rules That Apply To Both Read and Writes. The following rules apply to both VPD data reads and writes:

Once a VPD read or write has been initiated, writing to either the Address or Data registers prior to the point at which the hardware changes the state of the Flag bit yields unpredictable results.

Each VPD data read or write always encompasses all four bytes within the VPD dword indicated in the Address register.

The least-significant byte in the Data register corresponds to the least-significant byte in the indicated VPD dword.

The initial values in the Address and Data registers after reset are indeterminate.

Reading or writing data outside the scope of the overall VPD data structure is not allowed. The spec doesn't say what the result will be if you do it, so it is hardware design-specific.

The values contained in the VPD are only stored information and have no effect upon the device.

Chapter 22: PCI Compatible Configuration Registers

The two least-significant bits in the Address register must always be zero (i.e., it is illegal to specify an address that is not aligned on a dword address boundary).

Figure 22-26: VPD Capability Registers

31 3016	15	8 70
VPD Address Register	Pointer to next Capability	ID = 03h	Dword 0
VPD Data Register			Dword 1

VPD Data Structure Made Up of Descriptors and Keywords

As mentioned earlier, the VPD actually consists of a data structure accessed using the VPD Address and Data registers. The individual data items that comprise the VPD data structure are themselves small data structures known as descriptors. The basic format of two of the descriptors used in the VPD was first defined in the version 1.0a ISA Plug and Play spec. For more information about this spec, refer to the MindShare book entitled Plug and Play System Architecture (published by Addison-Wesley). The two ISA-like descriptor types are:

Identifier String descriptor. This descriptor contains the alphanumeric name of the card or embedded device. If the VPD is implemented, this descriptor is mandatory and is always the first one in the VPD. It is illustrated in Table 22-13 on page 853.

End Tag descriptor. If the VPD is implemented, this descriptor is mandatory and is used to identify the end of VPD data structure. Its value is always $78 h$ .

In addition to these two descriptors, the 2.2 spec has defined two new descriptor types referred to as:

VPD-R descriptor. This descriptor type identifies the start and overall length of a series of one or more read-only keywords within the VPD data structure. The last keyword in the list of read-only keywords must be the Checksum keyword. This checksum encompasses the VPD from its first location to the end of the read-only area. A detailed description of this descriptor can be found in "VPD Read-Only Descriptor (VPD-R) and Keywords" on page 853.

PCI Express System Architecture

VPD-W descriptor. If used, this optional descriptor type is used to identify the start and overall length of the read/write descriptors within the VPD data structure. A detailed description of this descriptor can be found in "VPD Read/Write Descriptor (VPD-W) and Keywords" on page 856.

The basic format of the overall VPD data structure is illustrated in Table 22-12 on page 852. It has the following characteristics:

The VPD always starts with an Identifier String descriptor, followed by an optional list of one or more read-only VPD keywords.

The list of read-only keywords always begins with the VPD-R descriptor and ends with the Checksum keyword.

Immediately following the list of read-only keywords is an optional list of read/write keywords. If present, the read-write keyword list is prefaced with the VPD-W descriptor. Because the VPD read-write keywords can be altered, there is no checksum at the end of the read/write keywords.

The overall VPD data structure is always terminated by a special descriptor known as the End Tag. Its value is always 78h.

Table 22-12: Basic Format of VPD Data Structure

Typical Descriptor List	Comments
String Identifier Descriptor	Always the first entry.
Read-Only Descriptor	Heads the list of read-only keywords.
Read-Only Keyword	List of Read-Only keywords.
Read-Only Keyword
Read-Only Keyword
Checksum Keyword
Read/Write Descriptor	Heads the list of read-write keywords.
Read/Write Keyword	List of Read/Write keywords.
Read/Write Keyword	List of Read/Write keywords.
End Tag descriptor	Always used to indicate the end of the VPD. Its value is always 78h.

Table 22-13: Format of the Identifier String Tag

Byte	Description
0	Must be 82h.
1	Least-significant byte of identifier string length (the length encom- passes bytes 3-through-n).
2	Most-significant byte of identifier string length (the length encom- passes bytes 3-through-n).
3-through-n	ASCII name of function.

VPD Read-Only Descriptor (VPD-R) and Keywords

Table 22-14 on page 853 illustrates the format of the VPD-R descriptor. As mentioned earlier, this descriptor begins the list of one or more read-only keywords and indicates the length of the list. Each keyword is a minimum of four bytes in length and has the format illustrated in Table 22-15 on page 854. The read-only keywords currently-defined are listed in Table 22-16 on page 854.

Table 22-14: Format of the VPD-R Descriptor

Byte	Description
0	Must be 90h.
1	Least-significant byte of read-only keyword list length (the length encompasses bytes 3-through-n).
2	Most-significant byte of read-only keyword list length (the length encompasses bytes 3-through-n).
3-through-n	List of Read-Only keywords.

Table 22-15: General Format of a Read or a Read/Write Keyword Entry

Byte(s)	Description
0 and 1	ASCII Keyword (see Table 22-16 on page 854 and Table 22-20 on page 856).
2	Length of Keyword field (encompassing bytes 3-through-n).
3-through-n	Keyword data field.

Table 22-16: List of Read-Only VPD Keywords

ASCII Read-Only Keyword	Description of Keyword Data Field
PN	Device Part Number in ASCII.
EC	Engineering Change level (alphanumeric) of device in ASCII.
MN	Manufacturer ID in ASCII.
SN	Serial Number (alphanumeric) in ASCII.
Vx	Vendor-Specific field (alphanumeric) in ASCII. "x" can be any value 0-through-Z.
CP	Extended Capability. If present, this keyword indicates that the function implements an additional New Capability within its IO or memory space. See Table 22-17 on page 855 for a complete descrip- tion.
RV	Checksum. See Table 22-18 on page 855 for complete description.

Table 22-17: Extended Capability (CP) Keyword Format

Byte	Description
0	New Capability ID.
1	Index of Base Address Register (value between 0 and 5) that points to space containing this capability.
2	Least-significant byte of offset within BAR's range where this New Capability’s register set begins.
3	Most-significant byte of offset within BAR's range where this New Capability’s register set begins.

Table 22-18: Format of Checksum Keyword

Byte	Description
0	Checksum from start of VPD up to and including this byte. Checksum is correct if sum of all bytes equals zero.
1	Reserved.
2	Reserved.
3-through-n	Reserved read-only space (as much as desired).

Is Read-Only Checksum Keyword Mandatory?

The spec doesn't say if the Checksum is mandatory, but it is the author's opinion that it is. In other words, even if the VPD contained no other read-only keywords, it must contain the VPD-R descriptor followed by the Checksum keyword. This provides the programmer with the checksum for the portion of the VPD that encompasses the String Identifier descriptor, the VPD-R descriptor and the Checksum keyword itself. In other words, it provides the checksum for everything other than the read-write portion of the VPD. It stands to reason the portion of the VPD that can be written to should not be included within the checksummed area.

PCI Express System Architecture

VPD Read/Write Descriptor (VPD-W) and Keywords

The VPD may optionally contain a list of one or more read/write keyword fields. If present, this list begins with the VPD-W descriptor which indicates the start and length of the read/write keyword list. There is no checksum stored at the end of the read-write keyword list.

Table 22-19 on page 856 illustrates the format of the VPD-W descriptor and Table 22-20 on page 856 provides a list of the currently-defined read/write keyword fields.

Table 22-19: Format of the VPD-W Descriptor

Byte(s)	Description
0	Must be 91h
1	Least-significant byte of read/write keyword list length (the length encompasses bytes 3-through-n).
2	Most-significant byte of read/write keyword list length (the length encompasses bytes 3-through-n).
3-through-n	List of Read/Write keywords.

Table 22-20: List of Read/Write VPD Keywords

ASCII Read/Write Keyword	Description of Keyword Data Field
Vx	Vendor-Specific (alphanumeric in ASCII). "x" may be any character from 0-through-Z.
YA	Asset Tag Identifier. ASCII alphanumeric code supplied by system owner.
Yx	\| System-specific alphanumeric ASCII item. "x" may be any character from 0-through-9 and B-through-Z.
RW	Remaining read/write area. Identifies the unused portion of the $r / w$ space. The description in the spec is very confusing and defies inte pretation by the author (maybe I'm just being thick-headed)

Chapter 22: PCI Compatible Configuration Registers

Example VPD List

Table 22-21 on page 857 contains the sample VPD data structure provided in the spec. The author has made a few minor corrections, so it doesn't match the one in the spec exactly. In the draft version of the spec, the 3rd row, last column contained "ABC Super..." etc. and the offset in VPD-R Tag row was wrong. I fixed it by adjusting the offsets in the 1st column. It was fixed in the final version of the 2.2 spec by changing the Product Name to "ABCD Super...".

Table 22-21: Example VPD List

Offset (decimal)	Item	Value
0	String ID Tag	82h
$1 - 2$	String length (32d)	0020h (32d)
3-34	Product name in ASCII	"ABC Super-Fast Wid- get Controller"
Start of VPD Read-Only Keyword Area
35	VPD-R Tag. Identifies start and length of read- only keyword area within VPD.	90h
36-37	Length of read-only keyword area.	5Ah (90d)
38-39	Read-only Part Number keyword.	"PN"
40	Length of Part Number data field.	08h (8d)
41-48	Part Number in ASCII.	"6181682A"
49-50	Read-Only Engineering Change (EC) level key- word.	"EC"
51	Length of EC data field.	0Ah (10d)
52-61	EC data field.	"4950262536"
62-63	Read-only Serial Number keyword.	"SN"
64	Serial Number length field.	08h (8d)
65-72	Serial Number data field.	"00000194"

Table 22-21: Example VPD List (Continued)

Offset (decimal)	Item	Value
73-74	Read-only Manufacturer ID keyword.	"MN"
75	Manufacturer ID length field.	04h (4d)
76-79	Manufacturer ID	"1037"
80-81	Read-only Checksum keyword.	"RV"
82	Length of reserved read-only VPD area.	2Ch (44d)
83	Checksum for bytes 0-through-83.	Checksum.
84-127	Reserved read-only area.
Start of VPD Read/Write Keyword Area
128	VPD-W Tag	91h
129-130	Length of read/write keyword area	007Eh (126d)
131-132	Read/Write Vendor-Specific Keyword.	"V1"
133	Vendor-specific data field length.	$05 h (5 d)$
134-138	Vendor-specific data field.	"65A01"
139-140	System-specific keyword.	"Y1"
141	System-specific data field length.	0Dh (13d)
142-154	System-specific data field.	"Error Code 26"
155-156	Remaining Read/Write area keyword.	"RW"
157	Length of remaining read/write area	$61 h (97 d)$
158-254	Remainder of read/write area.	reserved.
255	End Tag	78h

Chapter 22: PCI Compatible Configuration Registers

Introduction To Chassis/Slot Numbering Registers

Assuming that the Capabilities List bit is set in the bridge's primary Status register, the bridge implements the Capabilities Pointer register (see "Capabilities Pointer Register" on page 779). When software traverses the linked list of New Capability register sets for a bridge associated with a Root Port or a switch downstream port, it may encounter the Slot Numbering registers (if this is the bridge to an expansion chassis).

Figure 22-27 on page 859 pictures the Slot Numbering register set. It consists of the registers described in Table 22-22 on page 859. For additional information, refer to "Chassis and Slot Number Assignment" on page 861. Chapter 22: PCI Compatible Configuration Registers

Figure 22-27: Chassis and Slot Number Registers

Table 22-22: Slot Numbering Register Set

Register	Description
Capability ID	Read-Only. 04h identifies this as the Slot Numbering register set.
Next Capability Pointer	Read-Only. $00 h =$ Indicates that this is the last register set in the linked New Capabilities list. Non-zero value $=$ dword-aligned pointer to the next register set in the linked list.

Table 22-22: Slot Numbering Register Set (Continued)

Description

Expansion Slot

Read-Only, automatically loaded by hardware after reset. The configuration software uses the value in this register to determine the number of expansion card slots present in the chassis. The spec doesn’t define where the hardware obtain: this information. It could read a set of strapping pins on the trailing-edge of reset, or could obtain the information from a serial EEPROM.

Chassis Number

Read/Write. The value in this register identifies the chassis number assigned to this chassis. At reset time, this register may: - be pre-loaded with

00 h

,or - be implemented as a non-volatile register that "remembers' the chassis number assigned during a previous platform configuration. The configuration software will initialize all upstream bridges within the same chassis with the same Chassis Num- ber and must guarantee that each chassis is assigned a mutu- ally-exclusive Chassis Number. A bridge may implement the Chassis/Slot numbering registers and yet may not have any expansion card slots residing beneath it. The Chassis Number register may be cleared to zero by reset, or may be non-volatile (i.e., the current contents of the regis- ter will survive resets and power cycles). If it is non-volatile, its initial state after the first power up will be zero. When th configuration software detects zero in an expansion chassis' Chassis Number register, it must assign a number to it. Zero is not a valid Chassis Number for an expansion chassi because Chassis Zero is reserved for the card slots embedded on the system board.

Chassis and Slot Number Assignment

Problem: Adding/Removing Bridge Causes Buses to Be Renumbered

The best way to start this discussion is to illustrate the problem with an example. Assume the following set of conditions:

The system has several Express add-in card slots on the bus in the Root Complex.

There are one or more Root Ports and one or more of these Root Ports are connected to chasses, each of which has add-in card slots on the downstream side of the chassis' Switch upstream port (i.e., the Switch embedded in the chassis).

The system was shipped as described in items one and two and no cards have been added or removed.

Diagnostic software has detected a problem with an add-in card in one of the slots.

Now, the question: When the software displays a message to identify the bad card to the end user, how will it identify the location of the card slot to the end user? The software knows the following:

The bus number that the device resides on.

Which device number is assigned to the device.

So, let's say that the software identifies the bad boy to the end user by displaying its location using the bus and device numbers and let's say that someone at the factory was nice enough to physically label each card slot using that information (bus and device number). That would work just fine-as long as no one installs or removes a card that has a bridge on it. Remember that the configuration software discovers bridges each time that the machine is restarted and assigns a bus number to each bridge's secondary bus. In other words, if a bus is added or removed, that can change the bus numbers assigned to a number of the buses. This would result in the labels on the card slots being wrong and the end user wouldn't know it.

If Buses Added/Removed, Slot Labels Must Remain Correct

As stated in this section's heading, the addition or removal of a bus must not render the physical slot labels incorrect. This requirement highlights that the bus number cannot be used as part of the slot labeling (because it can change).

PCI Express System Architecture

The only exception would be Bus 0 which cannot be removed and is always assigned bus number 0 .

Definition of a Chassis

As defined in the 1.1 PCI-to-PCI bridge specification, there are two types of chasses:

Main Chassis-Refer to Figure 22-28 on page 862. These add-in card slots are connected to Root Ports and are not removable. These card slots do not present a problem in that the physical labeling of the slots is always correct.

Expansion Chassis-Refer to Figure 22-33 on page 869. An Expansion Chassis consists of a group of one or more buses each with card slots and the entire group can be installed in or removed from the system as a single entity. The slots within an expansion chassis are numbered sequentially and are identified by chassis number and slot number.

Figure 22-28: Main Chassis

Chapter 22: PCI Compatible Configuration Registers

Chassis/Slot Numbering Registers

PCI-Compatible Chassis/Slot Numbering Register Set. "Introduc-

tion To Chassis/Slot Numbering Registers" on page 859 introduced the PCI-compatible configuration registers associated with chassis and slot numbering:

The Chassis Number register (see Figure 22-27 on page 859 and Table 22-22 on page 859). The configuration software assigns a non-zero chassis number to each upstream bridge that implements the Slot Numbering capability register set and that has a non-zero value in its Expansion Slot register (indicating the number of slots implemented within the chassis).

The Expansion Slot register (see Figure 22-29 on page 864 and Table 22- 23 on page 864). The Expansion Slot register is preloaded with the indicated information by hardware before the configuration software is executed. As an example, the bridge could sample a set of strapping pins on the trailing-edge of reset to determine the contents of the Expansion Slot register. A bridge may implement the Chassis/Slot numbering registers and yet may not have any expansion slots on its secondary bus. 0 indicates that no slots are implemented on the downstream side of the bridge.

Express-Specific Slot-Related Registers. In addition to these PCI-compatible registers, the following PCI Express-specific configuration registers are also involved in the process:

The Slot Implemented bit in the PCI Express Capabilities register (see Figure 22-31 on page 865). This bit is only implemented in the bridge to a Root Port or to a downstream switch port. If this bit is hardwired to one, this indicates that the downstream port is connected to an add-in slot rather than to an embedded device or to a disabled link.

The Physical Slot Number field in the PCI Express Slot Capability register (see Figure 22-30 on page 865). This hardware initialized field indicates the physical slot number attached to this Port. The assigned slot number must be globally unique within this chassis. This field must be set to 0 for a port that is connected to a device that is either integrated on the system board or within the same silicon as the Switch device or the Root Port.

The Chassis and Expansion Slot registers must be implemented in each upstream bridge in an expansion chassis that has expansion slots on its secondary bus.

Figure 22-29: Expansion Slot Register

Table 22-23: Expansion Slot Register Bit Assignment

Bit Field	Description
7:6	Reserved. Read-only and must always return zero when read.
5	First-In-Chassis bit. This bit must be set to one in the first upstream bridge within each expansion chassis. This is defined as follows: - If there is only one expansion chassis and it contains only on upstream bridge with slots on its secondary side, that bridge is the First-In-Chassis. - If an expansion chassis contains a hierarchy of bridges springing from one parent upstream bridge (see Figure 22-33 on page 869), the parent upstream bridge is First-In-Chassis, while the other upstrean bridges will have the First-In-Chassis bit cleared to zero
4:0	Number of Expansion Slots on bridge's secondary bus. If there aren't any expansion slots on the bridge's secondary bus, this field must be hardwired to zero.

PCI Express System Architecture

Two Examples

First Example. Figure 22-32 on page 867 illustrates a system wherein:

The Root Complex has four Root Ports that are connected to add-in slot connectors, and one Root Port connected to an embedded device. With the exception of the Root Port that is connected to an embedded device, in each of the Root Port bridges:

Slot Implemented bit $= 1$ (hardware in initialized).

Physical Slot Number $=$ the respective slot’s hardware-assigned slot number (hardware in initialized).

The Chassis register is not implemented.

The Expansion Slot register is not implemented.

The Root Port that is connected to the embedded device contains the following information:

Slot Implemented bit $= 0$ (hardware in initialized).

Physical Slot Number $=$ not implemented.

Chassis register not implemented.

Expansion Slot register not implemented.

A chassis is connected to add-in slot connector number one.

In the upstream bridge of the chassis:

The Chassis register in the chassis' upstream port is set to $01 h$ by the configuration software.

The First In Chassis bit is set to one (hardware in initialized).

The upstream bridge's Expansion Slot register contains the value 2 (hardware in initialized) indicating that the chassis implements 2 add-in slot connectors on the downstream side of the bridge.

In each of the chassis' downstream port bridges:

Slot Implemented bit $= 1$ (hardware in initialized).

Physical Slot Number $=$ the respective slot’s hardware-assigned slot

number (hardware in initialized).

The Chassis register is not implemented.

The Expansion Slot register is not implemented.

Chapter 22: PCI Compatible Configuration Registers

Figure 22-32: Chassis Example One

Second Example. Figure 22-33 on page 869 illustrates an example wherein:

The Root Complex has four Root Ports that are connected to add-in slot connectors, and one Root Port connected to an embedded device. With the exception of the Root Port that is connected to an embedded device, in each of the Root Port bridges:

Slot Implemented bit $= 1$ (hardware in initialized).

Physical Slot Number $=$ the respective slot’s hardware-assigned slot number (hardware in initialized).

The Chassis register is not implemented.

The Expansion Slot register is not implemented.

The Root Port that is connected to the embedded device contains the following information:

Slot Implemented bit $= 0$ (hardware in initialized).

Physical Slot Number $=$ not implemented.

Chassis register not implemented.

Expansion Slot register not implemented.

A chassis is connected to add-in slot connector number one.

In the upstream bridge of the chassis:

The Chassis register in the chassis' upstream port is set to $01 h$ by the configuration software.

The First In Chassis bit is set to one (hardware in initialized).

The upstream bridge's Expansion Slot register contains the value 7 (hardware in initialized) indicating that the chassis implements 7 add-in slot connectors on the downstream side of the bridge.

In the bridge of each of the downstream switch ports that are connected

to slots 1-through-5:

Slot Implemented bit $= 1$ (hardware in initialized).

Physical Slot Number $=$ the respective slot’s hardware-assigned slot number (hardware in initialized).

— The Chassis register is not implemented.

The Expansion Slot register is not implemented.

In the bridge of the left-most downstream port on the same bus as slots 1-through-5:

Slot Implemented bit $= 0$ (hardware in initialized).

Physical Slot Number $=$ not implemented.

Chassis register not implemented.

Expansion Slot register not implemented.

In the upstream bridge of the lower switch within the chassis:

The Chassis register in the chassis' upstream port is set to $01 h$ by the configuration software.

The First In Chassis bit is cleared to zero (hardware in initialized).

The upstream bridge's Expansion Slot register contains the value 2 (hardware in initialized) indicating that the chassis implements 2 add-in slot connectors on the downstream side of this bridge.

In the bridge of each of the downstream ports beneath the upstream bridge:

Slot Implemented bit $= 1$ (hardware in initialized).

Physical Slot Number $=$ the respective slot’s hardware-assigned slot number (hardware in initialized).

The Chassis register is not implemented.

The Expansion Slot register is not implemented.

23 Expansion ROMs

The Previous Chapter

The previous chapter provided a detailed description of the configuration registers residing a function's PCI-compatible configuration space. This included the registers for both non-bridge and bridge functions.

This Chapter

This chapter provides a detailed description of device ROMs associated with PCI, PCI Express, and PCI-X functions. This includes the following topics:

device ROM detection.

internal code/data format.

shadowing.

initialization code execution.

interrupt hooking.

The Next Chapter

The next chapter provides a description of:

The PCI Express Capability register set in a function's PCI-compatible configuration space.

The optional PCI Express Extended Capabilities register sets in a function's extended configuration space:

The Advanced Error Reporting Capability register set.

Virtual Channel Capability register set.

Device Serial Number Capability register set.

Power Budgeting Capability register set.

RCRBs.

PCI Express System Architecture

ROM Purpose-Device Can Be Used In Boot Process

In order to boot the OS into memory, the system needs three devices:

A mass storage device to load the OS from. This is sometimes referred to as the IPL (Initial Program Load) device and is typically an IDE or a SCSI hard drive.

A display adapter to enable progress messages to be displayed during the boot process. In this context, this is typically referred to as the output device.

A keyboard to allow the user to interact with the machine during the boot process. In this context, this is typically referred to as the input device.

The OS must locate three devices that fall into these categories and must also locate a device driver associated with each of the devices. Remember that the OS hasn't been booted into memory yet and therefore hasn't loaded any loadable device drivers into memory from disk! This is the main reason that device ROMs exist. It contains a device driver that permits the device to be used during the boot process.

ROM Detection

When the configuration software is configuring a PCI, PCI-X, or PCI-Express function, it determines if a function-specific ROM exists by checking to see if the designer has implemented an Expansion ROM Base Address Register (refer to Figure 23-1 on page 873).

As described in "Expansion ROM Base Address Register" on page 783, the programmer writes all ones (with the exception of bit zero, to prevent the enabling of the ROM address decoder; see Figure 23-1 on page 873) to the Expansion ROM Base Address Register and then reads it back. If a value of zero is returned, then the register is not implemented and there isn't an expansion ROM associated with the device.

On the other hand, the ability to set any bits to ones indicates the presence of the Expansion ROM Base Address Register. This may or may not indicate the presence of a device ROM. Although the address decoder and a socket may exist for a device ROM, the socket may not be occupied at present. The programmer determines the presence of the device ROM by:

Chapter 23: Expansion ROMs

assigning a base address to the register's Base Address field,

enabling its decoder (by setting bit 0 in the register to one),

and then attempting to read the first two locations from the ROM.

If the first two locations contain the ROM signature-AA55h-then the ROM is present.

Figure 23-1 on page 873 illustrates the format of the Expansion ROM Base Address Register. Assume that the register returns a value of FFFE0000h when read back after writing all ones to it. Bit 17 is the least-significant bit that was successfully changed to a one and has a binary-weighted value of

128 K

. This indicates that it is a

128 KB

ROM decoder and bits [24:17] within the Base Address field are writable. The programmer now writes a 32-bit start address into the register and sets bit zero to one to enable its ROM address decoder. The function's ROM address decoder is then enabled and the ROM (if present) can be accessed. The maximum ROM decoder size permitted by the PCI spec is 16MB, dictating that bits [31:25] must be read/write.

The programmer then performs a read from the first two locations of the ROM and checks for a return value of AA55h. If this pattern is not received, the ROM is not present. The programmer disables the ROM address decoder (by clearing bit zero of the Expansion ROM Base Address Register to zero). If AA55h is received, the ROM exists and a device driver code image must be copied into main memory and its initialization code must be executed. This topic is covered in the sections that follow.

Figure 23-1: Expansion ROM Base Address Register Bit Assignment

ROM Shadowing Required

The PCI spec requires that device ROM code is never executed in place (i.e., from the ROM). It must be copied to main memory. This is referred to as "shadowing" the ROM code. This requirement exists for two reasons:

ROM access time is typically quite slow, resulting in poor performance whenever the ROM code is fetched for execution.

Once the initialization portion of the device driver in the ROM has been executed, it can be discarded and the code image in main memory can be shortened to include only the code necessary for run-time operation. The portion of main memory allocated to hold the initialization portion of the code can be freed up, allowing more efficient use of main memory.

Once the presence of the device ROM has been established (see the previous section), the configuration software must copy a code image into main memory and then disable the ROM address decoder (by clearing bit zero of the Expansion ROM Base Address Register to zero). In a non-PC environment, the area of memory the code image is copied to could be anywhere in memory space. The specification for that environment may define a particular area.

In a PC environment, the ROM code image must be copied into main memory into the range of addresses historically associated with device ROMs: 000C0000h through 000DFFFFh. If the Class Code indicates that this is the VGA's device ROM, its code image must be copied into memory starting at location 000C0000h.

The next section defines the format of the information in the ROM and how the configuration software determines which code image (yes, there can be more than one device driver) to load into main memory.

ROM Content

Multiple Code Images

The PCI spec permits the inclusion of more than one code image in a PCI device ROM. Each code image would contain a copy of the device driver in a specific machine code, or in interpretive code (explained later). The configuration software can then scan through the images in the ROM and select the one best PCI Express System Architecture

suited to the system processor type. The ROM might contain drivers for various types of devices made by this device's vendor. The code image copied into main memory should match up with the function's ID. To this end, each code image also contains:

the Vendor ID and Device ID. This is useful for matching up the driver with a function that has a vendor/device match.

the Class Code. This is useful if the driver is a Class driver that can work with any compatible device within a Class/SubClass. For more information, see "Class Code Register" on page 774.

Figure 23-3 on page 877 illustrates the concept of multiple code images embedded within a device ROM. Each image must start on an address evenly-divisible by 512. Each image consists of two data structures, as well as a run-time code image and an initialization code image. The configuration software interrogates the data structures in order to determine if this is the image it will copy to main memory and use. If it is, the configuration software:

Copies the image to main memory,

Disables the expansion ROM's address decoder,

Executes the initialization code,

If the initialization code shortens the length indicator in the data structure, the configuration software deallocates the area of main memory that held the initialization portion of the driver (in Figure 23-4 on page 879, notice that the initialization portion of the driver is always at the end of the image).

The area of main memory containing the image is then write-protected.

The sections that follow provide a detailed discussion of the code image format and the initialization process.

Format of a Code Image

General

Figure 23-4 on page 879 illustrates the format of a single code image. The image consists of the following components:

ROM Header. Described in "ROM Header Format" on page 879. Also contains a 16-bit pointer to the ROM data structure.

ROM Data Structure. Described in "ROM Data Structure Format" on page 881. Contains information about the device and the image.

Run-time code. This is the portion of the device driver that remains in main memory after the OS loads and that remains available for execution on an on-going basis.

Initialization code. This is the portion of the device driver that is called and executed immediately after loading the driver into main memory. It completes the setup of the device and enables it for normal operation. It must always reside at the end of the image so it can be abbreviated or discarded after its initial execution at system startup. Chapter 23: Expansion ROMs

Table 23-1: PCI Expansion ROM Header Format

Offset	Length (bytes)	Value	Description
00h	$1 d$	$55 h$	ROM signature byte one. The first two bytes must contain AA55h, identifying this as a device ROM. This has always been the signa- ture used for a device ROM in any PC-compat- ible machine.
01h	1d	AAh	ROM signature byte two.
02h - 17h	22d	n	Reserved for processor architecture unique data. See Table 23-2 on page 881. This block of 22d locations is reserved for processor/archi- tecture unique data. For PC-compatible envi- ronments and images that identify the code as Intel x86-compatible in the Code Type field (see "Code Type" on page 885) of the ROM data structure, the PCI spec defines the structure of the processor/architecture unique data area in the image Header. For non-PC compatible environments, the content of this structure is architecture-specific. Table 23-2 on page 881 defines the fields that must be supplied for PC- compatibility. The offset specified in the table is the offset from the first location of this ROM code image.
$18 h - 19 h$	$2 d$	n	Pointer to PCI Data Structure. Since this is a 16- bit pointer, the data structure can be anywhere within 64K forward of the first location in this code image. This is the 16-bit offset (in little- endian format) to the ROM data structure within this code image. It is an offset from the start address of this code image. Because this is only a 16-bit offset from the first location of this code image, the data structure must reside within 64KB forward of the first location of this code image.

Table 23-2: PC-Compatible Processor/Architecture Data Area In ROM Header

Offset	Length (in bytes)	Description
02h	1	Overall size of the image (in 512 byte increments). The total size of the runtime code plus the initialization code (runtime code + initialization code = initialization size). This sum is not necessarily the "overall size of the image." The overall size of the image (Image Length) could be greater than the initialization size. The Image Length or Image size is what specifies where the next image in the ROM starts, while the Initialization size (a better name for this field) is the actual code size that is copied into RAM.
03h-05h	3	Entry point for the initialization code. Contains a three- byte, x86 short jump to the initialization code entry point. The POST performs a far call to this location to initialize the device.
06h-17h	18d	Reserved (for application-unique data, such as the copyright notice)

ROM Data Structure Format

As stated earlier, the ROM Data Structure associated with each code image must reside within the first

64 KB

of each code image. The Data Structure must reside within the run-time code (assuming there is one). It's possible that a ROM may not contain a device driver for the device, but only an initialization module that tests the device and gets it ready for normal operation. If there isn't a run-time code module, the Data Structure must reside within the initialization code. The Data Structure's format is defined in Table 23-3 on page 882 and the purpose of each field is further defined in the sections that follow the table.

Table 23-3: PCI Expansion ROM Data Structure Format

Offset	Length	Description
00h	4	Signature consisting of the ASCII string "PCIR" (PCI ROM).
$04 h$	2	Vendor ID. This is a duplication of the Vendor ID found in the function's configuration Vendor ID register (see "Ven- dor ID Register” on page 773). The ROM may contain mul- tiple code images of the desired Code Type (e.g., x86 code), but they may be for different devices produced by the same (or a different) vendor. In order to ensure that it loads the correct one, the configu- ration software compares the Vendor ID, Device ID, and Class Code values contained in this Data Structure to those found in the function's Vendor ID, Device ID, and Class Code configuration registers
06h	2	Device ID. This is a duplication of the Device ID found in the function’s configuration Device ID register (see "Device ID Register" on page 773). See explanation of Vendor ID field in this table
08h	2	Reserved. Was the Pointer to the Vital Product Data. The pointer to the optional VPD is provided as an offset from the start location of the code image. The 2.2 PCI spec rede- fined this as a Reserved bit field and the optional VPD (if present) was moved to the device's configuration registers. Refer to "Vital Product Data (VPD) Capability" on page 848.
0Ah	2	PCI Data Structure Length in bytes, little-endian forma
$0 Ch$	1	PCI Data Structure Revision. The Data Structure format shown in this table is revision zero.
0Dh	3	Class Code. This is a duplication of the Class Code found in the function's configuration Class Code register (see “Class Code Register” on page 774). See explanation of Vendor ID field in this table.

Table 23-3: PCI Expansion ROM Data Structure Format (Continued)

Offset	Length	Description
10h	2	Image length. Code image length in increments of 512 bytes (little-endian format). The total size of the runtime code plus the initialization code (runtime code + initializa tion code $=$ initialization size). This sum is not necessarily the "overall size of the image." The overall size of the image (Image Length) could be greater than the initialization size The Image Length or Image size is what specifies where the next image in the ROM starts, while the Initialization size (a better name for this field) is the actual code size that is copied into RAM.
12h	2	Revision level of code/data in this code image.
$14 h$	1	Code type. See "Code Type" on page 885
$15 h$	1	Indicator byte. Bit 7 indicates whether this is the last code image in the ROM (1 = last image). Bits [6:0] are reserved and must be zero.
16h	2	Reserved.

ROM Signature. This unique signature identifies the start of the PCI Data Structure. The "

P

" is stored at offset 00h,the "

C

" at offset 01h,etc. "PCIR" stands for PCI ROM.

Vendor ID field in ROM data structure. As stated in Table 23-3 on page 882 , the configuration software does not select a code image to load into system memory unless it is the correct Code Type and the Vendor ID, Device ID, and Class Code in the image's Data Structure match the function's respective configuration registers. The ROM may contain code images for variations on the device, either from the same vendor or supplied by different vendors.

Device ID in ROM data structure. Refer to the description of the Vendor ID field in the previous section.

Pointer to Vital Product Data (VPD). The 2.2 PCI spec defined this as a Reserved field and the optional VPD (if present) was moved to the device's configuration registers. Refer to "Vital Product Data (VPD) Capability" on page 848. The following description is only provided as historical information.

The VPD pointer is the offset (from the start of the code image) to the Vital Product Data area. The offset was stored in little-endian format. Because the offset is only 16-bits in size, the Vital Product Data area had to reside within the first

64 KB

of the image. A value of zero indicates that the image contains no Vital Product Data. The revision

2.0 PCI

spec said that the pointer was required, but the 2.1 PCI spec removed that requirement. If no device ROM was present on a device other than the one containing the VPD, there was only one image and it contained the VPD. If multiple code images were present, each image contained VPD for that device. The VPD data that described the device may be duplicated in each code image, but the VPD that pertains to software may be different for each code image.

PCI Data Structure Length. This 16-bit value is stored in the little-endian format. It defines the length (in bytes) of the PCI Data Structure for this image.

PCI Data Structure Revision. This 8-bit field reflects the revision of the image's Data Structure. The currently-defined data structure format (as of revision 2.2 PCI spec) is revision zero.

Class Code. The 24-bit class code field contains the same information as the Class Code configuration register within the function's configuration header. The configuration software examines this field to determine if this is a VGA-compatible interface. If it is, the ROM code image must be copied into system memory starting at location 000C0000h (for compatibility). Otherwise, it will typically be copied into the C0000h-through-DFFFFh region in a PC-compatible machine. Also refer to "Vendor ID field in ROM data structure" on page 883.

Image Length. This two-byte field indicates the length of the entire code image (refer to Figure 23-4 on page 879) in increments of 512 bytes. It is stored in little-endian format. The total size of the runtime code plus the initialization code (runtime code + initialization code = initialization size). This sum is not necessarily the "overall size of the image." The overall size of the image (Image Length) could be greater than the initialization size. The Image Length or Image size is what specifies where the next image in the ROM starts, while the Initialization size (a better name for this field) is the actual code size that is copied into RAM.

Revision Level of Code/Data. This two-byte field reflects the revision level of the code within the image.

Code Type. This one-byte field identifies the type of code contained in this image as either executable machine language for a particular processor/ architecture, or as interpretive code.

Code Type 00h = Intel $\times 86$ (IBM PC-AT compatible) executable code.

Code Type $01 h =$ OpenBoot interpretive code. The Open Firmware standard (reference IEEE standard 1275-1994) defines the format and usage of the interpretive code. A basic description of the Open Firmware standard can be found in "Introduction to Open Firmware" on page 888.

Code Type $02 h = HP$ PA/RISC executable code (added in the 2.2 PCI spec).

Code Type 03h = Extensible Firmware Interface (EFI).

The values from 04h-through-FFh are reserved.

Indicator Byte. Only bit seven is currently defined.

$0 =$ not last code image in ROM.

$1 =$ last code image in ROM.

Bits [6:0] are reserved.

Execution of Initialization Code

Prior to the discovery of the device's ROM, the configuration software has accomplished the following:

Assigned one or more memory and/or IO ranges to the function by programming its Base Address Registers (see "Base Address Registers" on page 792).

If the device is interrupt-driven, the interrupt routing information has been programmed into the device's Interrupt Line register (see "Interrupt Line Register" on page 791).

In addition, if the UDF bit (this bit was in the 2.1 PCI spec and was deleted from the 2.2 PCI spec) was set in the device's configuration Status register, the user has been prompted to insert the diskette containing the PCI configuration file, or PCF, and the user selected any configuration options available from the file.

After the ROM is discovered, the configuration software copied a code image from the ROM into RAM memory. After the appropriate code image has been copied into system memory, the device ROM's address decoder is disabled. The configuration software must keep the area of RAM (the image resides in) read/ writable. The sequence that follows assumes that the selected ROM image's Code Type field is 00h and that the device resides on a PC compatible platform. The configuration software then executes the following sequence:

Refer to Figure 23-5 on page 888. The software calls the initialization module within the image (through location $3 h$ in the image),supplying it with three parameters in the AX register: the bus number, device number and function number of the function associated with the ROM:

The 8-bit bus number is supplied in $AH$ ,

the device number is supplied in the upper five bits of AL,

and the function number in the lower three bits of AL.

It's necessary to supply the initialization code with this information so that it can determine how the function has been configured. For example, what IO and/or memory address range the configuration software has allocated to the function (via its base address registers), what input on the interrupt controller the function's interrupt pin has been routed to, etc.

The initialization code then issues a call to the PCI BIOS, supplying the bus number, device number, and function number as input parameters and requesting the contents of the function's Base Address Registers. Armed with this information, the initialization code can now communicate with the function's IO register set to initialize the device and prepare it for normal operation.

If the ROM image has a device-specific Interrupt Service Routine embedded within the run-time module, it reads from the device's Interrupt Line configuration register to determine which system interrupt request input on the interrupt controller the function's PCI interrupt pin has been routed to by the configuration software. Using this routing information, the initialization code knows which entry in the interrupt table in memory must be hooked. It first reads the pointer currently stored in that interrupt table entry and saves it within the body of the run-time portion of the image. It then stores the pointer to the interrupt service routine embedded within the run-time module of the code image into that interrupt table entry. In this way, it maintains the integrity of the interrupt chain. Since the area of system memory it has been copied into must be kept read/writable until the initialization code completes execution, the initialization code has no problem saving the pointer that it read from the interrupt table entry before hooking it to its own service routine.

The ROM image may also have a function-specific BIOS routine embedded within the run-time module of the code image. In this case, it needs to hook another interrupt table entry to this BIOS routine. Once again, it reads and saves the pointer currently stored in that interrupt table entry and then stores the pointer to the BIOS routine embedded within the run-time module of the code image. In this way, it maintains the integrity of the interrupt chain. Note regarding steps 3 and 4: This procedure is different if the function will deliver interrupts via MSI instead of INTx messages. It would read the MSI capability structure of the function for the message address and message data fields and use that info to find the interrupt table entry. Then it would store the pointer to the interrupt service routine embedded within the run-time module of the code image into that interrupt table entry. No interrupt chaining would have to be supported.

Since the area of system memory it resides in must be kept read/writable until the initialization code completes execution, the initialization code can adjust the code image length (in location 2h of the image). Very typically, at the completion of initialization code execution the programmer will adjust the image length field to encompass the area from the image's start through the end of the run-time code. The initialization code is typically only executed once and is then discarded. It must also recompute a new Checksum and store it at the end of the run-time code. If it sets the image length to zero, it doesn't need to recompute the image checksum and update it. When it returns control to the configuration software, a length of zero would indicate that the driver will not be used for some reason (perhaps a problem was detected during the setup of the device) and all of the memory it occupies can be deallocated and reused for something else.

Once the initialization code has completed execution, it executes a return to the system software that called it.

The configuration software then performs two final actions:

It interrogates the image size (at offset $2 h$ in the image) to determine if it was altered. If it has, the configuration software adjusts the amount of memory allocated to the image to make more efficient use of memory. The image is typically shorter than it was.

It computes a new checksum for the image and stores it at the end of the image.

Write-protects the area of main memory the image resides in. This will keep the OS from using the area after it takes control of the machine.

The ISA Plug-and-Play spec refers to the PCI method of writing device ROM code and handling its detection, shadowing, and initialization, as the DDIM (Device Driver Initialization Model). That spec stresses that this is the model that all device ROMs for other buses (i.e., other than PCI) should adhere to.

PCI Express System Architecture

Figure 23-5: AX Contents On Entry To Initialization Code

Introduction to Open Firmware

Introduction

The IEEE standard 1275-1994 entitled Standard for Boot (Initialization, Configuration) Firmware Core Requirements and Practices addresses two areas of concern regarding the boot process:

The very first section in this chapter described the basic rationale for including a device ROM in the design of a device-it provides a device driver that allows the OS boot program to use the device during the OS boot process. That raises the question of what language to write the device driver in. This is one of the two major areas addressed by the OpenBoot standard. It is the one that the PCI spec is concerned with.

After the OS is booted into memory, the BIOS passes control to it. If it's a Plug-and-Play capable OS, it would be nice if the BIOS passed a pointer to the OS that identified a data structure defining all of the devices that the OS has at its disposal. The OS could then traverse this data structure, determine the current state of all devices, and manage them for the remainder of the power-up session. In order for this to work, the exact format of this data structure must be standardized and understood by both the BIOS that builds it and the OS that subsequently takes ownership of it. This is the other major area addressed by the OpenBoot standard.

These two areas are discussed in more detail in the two sections that follow. It should be noted that this is only intended as an introduction to this standard. There's a lot more to it than is covered here: the standard is approximately 300 pages in length,

{8.5}^{''} \times 11^{''}

in size. A detailed discussion of Open Firmware is outside the scope of this book.

Universal Device Driver Format

Historically, most of the PC-compatible machines marketed in the past have been based on Intel x86 processors. When writing ROM code for an add-in subsystem on a card, it was a simple decision that the device driver image to be stored in the ROM would be an

\times 86

machine language code image.

A number of system vendors have created systems incorporating PCI and based on processors other than the

\times 86

processor family. These machines would take a substantial performance hit when executing expansion ROM code that isn't written in the processor's native machine language (i.e., x86 code is "foreign" to PowerPC and other types of non-Intel compatible processors). They would be forced to emulate the

\times 86

code,an inherently inefficient solution.

Rather than writing an add-in device's ROM code in machine language native to a particular processor, the subsystem designer can write the ROM code in Fcode (tokenized Forth code) based on the Open Firmware specification, IEEE 1275-1994. In other words, the device driver is written in the high-order language Forth.

The Open Firmware components would consist of the following:

The system BIOS contains the Fcode interpreter and possibly an individual Fcode device driver associated with each of the embedded subsystems that the system Open Firmware is already cognizant of.

Each add-in subsystem would hopefully contain an Open Firmware Fcode image.

The Open Firmware language is based on the Forth programming language. The ROM code would be written in Forth source code. The source code is then supplied as input to a "tokenizer" program. The tokenizer processes the source code into a series of compressed commands, known as Fcode. As an example, an entire line of source code might be reduced to a single byte that represents the Forth command, only in a much more compact form.

The system BIOS that "discovered" the ROM (as described earlier in this chapter), incorporates an interpreter that converts the Fcode byte stream read from the ROM into machine language instructions specific to the system's processor.

The programmer only has to write this one universal version of the driver and any machine with an Fcode interpreter built into the system BIOS can then utilize this driver with the device during the boot process (allowing the device to PCI Express System Architecture

be selected as the Input, Output, or IPL boot device). Obviously, executing a driver written in interpretive code would yield less than optimum performance. However, once the OS is booted into memory it then loads native code drivers for the three boot devices to replace the Fcode drivers. Performance of the devices is then optimized.

The PCI spec refers the reader to another document, PCI Bus Binding to IEEE 1275-1994, for implementation of Open Firmware in a PCI-based machine. This document is available using anonymous FTP to the machine playground.sun.com with the file name /pub/p1275/bindings/postscript/PCI.ps.

Passing Resource List To Plug-and-Play OS

BIOS Calls Bus Enumerators For Different Bus Environments

A machine architecture can contain many different device environments. Examples would be PCI, CardBus, Plug-and-Play ISA, etc. The methods that must be used to access the configuration registers associated with each of these different device types are very different from each other. In addition, the layout and format of their configuration registers are quite different as well.

The BIOS includes a separate, bus-specific program for each of these environments. This program is frequently referred to as a Bus Enumerator. The Bus Enumerator knows:

how to access the configuration registers within devices of its specific type (e.g., PCI, PCI-X, PCI Express).

how to "discover" devices within its environment. For example, in a PCI, PCI-X, or PCI Express environment, the programmer reads the Vendor ID from a function's Vendor ID register. Any value other than FFFFh represents a valid ID, while FFFFh indicates that no function resides at the currently-addressed location.

how to probe the device's configuration registers to discover the device's resource requirements.

how to allocate selected resources to the device.

The system BIOS must call the Bus Enumerators for each of the bus environments supported in the platform. When a specific Enumerator is called, it discovers all of the devices within its target environment, discovers the resources each requires, and allocates non-conflicting resources to each. It does not, however, enable the devices. The Enumerator builds a data structure in memory that lists all devices of its type that were found. It then passes a pointer to the start of that data structure back to the system BIOS.

When the system BIOS has called each of the Bus Enumerators for the different environments, it now has a list of pointers to the various, bus-specific data structures that list all of the devices that it has to work with.

BIOS Selects Boot Devices and Finds Drivers For Them

The system BIOS would then scan the data structures to locate an Input device, an Output device, and an IPL device to use in booting the OS into memory. In order to use each of these devices during the boot process, it would also require a device driver for each of them. The drivers would either be embedded within the BIOS itself or within device ROMs discovered with each of the devices.

For each of the three boot devices, the BIOS would then:

call the initialization code within the device driver. The initialization code would then complete the preparation of the device for use.

The BIOS would then set the appropriate bits in its configuration Command register (e.g., Bus Master Enable, etc.) to enable the device and bring it online.

BIOS Boots Plug-and-Play OS and Passes Pointer To It

The system BIOS then uses the three devices to boot the OS into memory and passes control to the OS. It also passes the OS a pointer that points to the head of the list of data structures that identify all of the devices that the OS has to work with.

OS Locates and Loads Drivers and Calls Init Code In Each

Note that Init code refers to the Initialization code portion of the driver. The OS then locates the disk-based drivers for each device and loads them into memory one-by-one. As it loads each driver, it then calls its initialization code entry point and the driver completes the device-specific setup of the device and brings the device on-line. The machine is now up and running and the OS manages the system devices from this point forward.

24 Express-Specific Configuration Registers

The Previous Chapter

The previous chapter provided a detailed description of device ROMs associated with PCI, PCI Express, and PCI-X functions. This included the following topics:

device ROM detection.

internal code/data format.

shadowing.

initialization code execution.

interrupt hooking.

This Chapter

This chapter provides a description of:

The PCI Express Capability register set in a function's PCI-compatible configuration space.

The optional PCI Express Extended Capabilities register sets in a function's extended configuration space:

The Advanced Error Reporting Capability register set.

Virtual Channel Capability register set.

Device Serial Number Capability register set.

Power Budgeting Capability register set.

RCRBs.

PCI Express System Architecture

Introduction

Refer to Figure 24-1 on page 895. As described earlier in "Each Function Implements a Set of Configuration Registers" on page 715, each PCI Express function has a dedicated

4 KB

memory address range within which its configuration registers are implemented. Each Express function must implement the PCI Express Capability register set somewhere in the lower 48 dwords of the PCI-compatible register space (i.e., within the lower 48 dword region of the first 64 dwords of configuration space). In addition, the function may optionally implement any of the PCI Express Extended Capability register sets. The sections that follow provide a detailed description of each of these Express-specific register sets.

PCI Express Capability Register Set

Introduction

Refer to Figure 24-2 on page 897. Otherwise referred to as the PCI Express Capability Structure, implementation of the PCI Express Capability register set is mandatory for each function. It is implemented as part of the linked list of Capability register sets that reside in the lower 48 dwords of a function's PCI-compatible register area. It should be noted however, that some portions of this register set are optional.

Every Express function must implement the registers that reside in dwords 0-through-4.

The bridge associated with each Root Port must implement the registers that reside in dwords seven and eight.

Each bridge associated with a Root Port or a downstream Switch Port that is connected to a slot (i.e., an add-in card slot) must implement the registers that reside in dwords five and six.

The sections that follow provide a detailed description of each of these registers.

PCI Express System Architecture

PCI Express Capability ID Register

This read-only field must contain the value

10 h

,indicating this is the start of the PCI Express Capability register set.

Next Capability Pointer Register

This read-only field contains one of the following:

The dword-aligned, non-zero offset to the next capability register set in the lower 48 dwords of the function's PCI-compatible configuration space.

00h, if the PCI Express Capability register set is the final register set in the linked list of capability register sets in the function's PCI-compatible configuration space.

PCI Express Capabilities Register

Figure 24-3 on page 898 illustrates this register and Table 24 - 1 on page 899 provides a description of each bit field in this register.

Figure 24-3: PCI Express Capabilities Register

Table 24 - 1: PCI Express Capabilities Register

Bit(s)	Type	Description
3:0 7:4	RO RO	Capability Version. SIG-defined PCI Express capability structure version number (must be $1 h$ ). Device/Port Type. Express logical device type: - 0000b: PCI Express Endpoint. Some OSs and/or processors may not support IO accesses (i.e., accesses using IO rather than memory addresses). This being the case, the designer of a native PCI Express function should avoid the use of IO BARs. However, the target system that a function is designed for may use the function as one of the boot devices (i.e., the boot input device (e.g., keyboard), output display device, or boot mas storage device) and may utilize a legacy device driver for the function at startup time. The legacy driver may assume that the function's device-specific register set resides in IO space. In this case, the function designer would supply an IO BAR to which the configuration software will assign an IO address range. When the OS boot has completed and the OS has loaded a native PCI Express driver for the function, however, the OS may deallocate all legacy IO address ranges previously assigned to the selected boot devices. From that point forward and for the duration of the power-up session, the native driver will utilize memory accesses to communicate with its associated function through the function’s memory BARs. ’0001b: Legacy PCI Express Endpoint. A function that requires IO space assignment through BARs for run-time operations. Extended configuration space capabilities, if implemented on legacy PCI Express Endpoint devices, may be ignored by soft- ware. + 0100b: Root Port of PCI Express Root Complex. - $0101 b$ : Switch upstream port. - $0110 b$ : Switch downstream port. - $0111$ b: Express-to-PCI/PCI-X bridge. - $1000$ b: PCI/PCI-X to Express bridge. - All other encodings are reserved. Only valid for functions with a Type 1 configuration register layout.

Table 24 - 1: PCI Express Capabilities Register (Continued)

Bit(s)	Type	Description
8	HWInit	Slot Implemented. When set, indicates that this Root Port or Switch downstream port is connected to an add-in card slot (rathe than to an integrated component or being disabled). See "Chassis and Slot Number Assignment” on page 861 for more information.
13:9	RO	Interrupt Message Number. If this function is allocated more than one MSI interrupt message value (see "Message Data Regis- ter" on page 335), this register contains the MSI Data value that i written to the MSI destination address when any status bit in either the Slot Status register (see "Slot Status Register" on page 925) or the Root Status register (see "Root Status Register" o page 928) of this function are set. If system software should alter the number of message data values assigned to the function, the function's hardware must update this field to reflect the change.

Device Capabilities Register

Figure 24-4 on page 901 and Table 24 - 2 on page 901 provide a description of each bit field in this register. This register defines operational characteristics that are globally applicable to the device (and all functions that reside within it).

PCI Express System Architecture

Table 24 - 2: Device Capabilities Register (read-only) (Continued)

Bit(s)

Description

4:3

Phantom Functions Supported. Background: Normally, each Express function (when acting as a Requester) is limited to no more than 32 outstanding requests currently awaiting com. pletion (as indicated by the lower five bits of the transaction Tag; the upper three bits of the Tag must be zero). However, a function may require more than this. If the Extended Tag Field is supported (see bit 5 in this table) and if the Extended Tag Field Enable bit in the Device Control register is set (see "Device Control Register" on page 905), the max is increased to 256 and all eight bits of the Requester ID Tag field are used when a function within th device issues a request packet. If a function requires a greater limit than 256, it may do this via Phantom Functions. Description: When the device within which a function resides does not implement all eight functions, a non-zero value in this field indicates tha this is so. Assuming all functions are not implemented and that the pro- grammer has set the Phantom Function Enable bit in the Device Control register (see "Device Control Register" on page 905), a function may issue request packets using its own function number as well as one or more addi tional function numbers. This field indicates the number of msbs of the function number portion of Requester ID that are logically combined with the Tag identifier. - 00b. The Phantom Function feature is not available within this device - 01b. The msb of the function number in the Requestor ID is used for Phantom Functions. The device designer may implement functions 0-3 When issuing request packets, Functions 0, 1, 2, and 3 may also use function numbers 4, 5, 6, and 7, respectively, in the packet's Requeste ID. 10b. The two msbs of the function number in the Requestor ID are used for Phantom Functions. The device designer may implement functions ( and 1. When issuing request packets, Function 0 may also use function numbers 2, 4, and 6 in the packet’s Requester ID. Function 1 may also use function numbers 3, 5, and 7 in the packet's Requester ID ’ 11b. All three bits of the function number in the Requestor ID are used for Phantom Functions. The device designer must only implement Function 0 (and it may use any function number in the packet’s Requester ID).

Chapter 24: Express-Specific Configuration Registers

Table 24 - 2: Device Capabilities Register (read-only) (Continued)

Bit(s)

Description

Extended Tag Field Supported. Max supported size of the Tag field when this function acts as a Requester. -

0 = 5

-bit Tag field supported (max of 32 outstanding request pe Requester). - 1 = 8-bit Tag field supported (max of 256 outstanding request per Requester). If 8-bit Tags are supported and will be used, this feature is enabled by set ting the Extended Tag Field Enable bit in the Device Control register (see “Device Control Register” on page 905) to one.

8:6

Endpoint L0s Acceptable Latency. Acceptable total latency that an End- point can withstand due to the transition from the L0s state to the L0 state (see “L0s Exit Latency Update” on page 625). This value is an indirect ind cation of the amount of the Endpoint’s internal buffering. Power manage- ment software uses this value to compare against the L0s exit latencies reported by all components in the path between this Endpoint and its par ent Root Port to determine whether ASPM L0s entry can be used with no loss of performance. -

000 b =

Less than

64 ns

001 b = 64 ns

to less than

128 ns

010 b = 128 ns

to less than

256 ns

011 b = 256 ns

to less than

51 ns

100 b = 512 ns

to less than

1 us

101 b = 1 μ s

to less than

2 μ s

110 b = 2 μ s - 4 μ s

111 b =

More than

4 μ s

PCI Express System Architecture

Table 24 - 2: Device Capabilities Register (read-only) (Continued)

Bit(s)	Description
11:9	Endpoint L1 Acceptable Latency. Acceptable latency that an Endpoint ca withstand due to the transition from L1 state to the L0 state (see “L1 Exit Latency Update" on page 626). This value is an indirect indication of the amount of the Endpoint’s internal buffering. Power management softwa uses this value to compare against the L1 Exit Latencies reported by all components in the path between this Endpoint and its parent Root Port 1 determine whether ASPM L1 entry can be used with no loss of perfor- mance. - $000 b =$ Less than $1 μ s$ - $001 b$ = 1µs to less than 2µs - $010 b$ = 2µs to less than 4µs - $011 b$ = $4$ µs to less than $8$ µs - $100 b$ = $8 μ s$ to less than $16 μ s$ - $101 b = 16 μ s$ to less than $32 μ s$ - $110 b = 32 μ s$ - $64 μ s$ - $111 b =$ More than $64 μ s$
12	Attention Button Present. When set to one, indicates an Attention Button is implemented on the card or module. Valid for the following PCI Express device Types: - Express Endpoint device - Legacy Express Endpoint device - Switch upstream port - Express-to-PCI/PCI-X bridge
13	Attention Indicator Present. When set to one, indicates an Attention Indi cator is implemented on the card or module. Valid for the following PCI Express device Types: - Express Endpoint device - Legacy Express Endpoint device - Switch upstream port - Express-to-PCI/PCI-X bridge

Chapter 24: Express-Specific Configuration Registers

Table 24 - 2: Device Capabilities Register (read-only) (Continued)

Bit(s)	Description
14	Power Indicator Present. When set to one, indicates a Power Indicator is implemented on the card or module. Valid for the following PCI Expres: device Types: - Express Endpoint device - Legacy Express Endpoint device - Switch upstream port - Express-to-PCI/PCI-X bridge
25:18	Captured Slot Power Limit Value (upstream ports only). In combination with the Slot Power Limit Scale value (see the next row in this table), speci fies the upper limit on power supplied by slot: Power limit (in Watts) $=$ Slot Power Limit value $\times$ Slot Power Limit Scale value This value is either automatically set by the receipt of a Set Slot Power Limi Message received from the port on the downstream end of the link, or is hardwired to zero. Refer to "Slot Power Limit Control" on page 562 for a detailed description
27:26	Captured Slot Power Limit Scale (upstream ports only). Specifies the scale used for the calculation of the Power Limit (see the previous row in this table): - $00 b = 1.0 x$ - $01 b = 0.1 x$ - $10 b = 0.01 x$ - $11 b = 0.001 x$ This value is either automatically set by the receipt of a Set Slot Power Limit Message received from the port on the downstream end of the link, or is hardwired to zero.

Device Control Register

Figure 24-5 on page 906 and Table 24 - 3 on page 906 provide a description of each bit field in this register.

Figure 24-5: Device Control Register

Table 24 - 3: Device Control Register (read/write)

Bit(s)	Description
0	Correctable Error Reporting Enable. For a multifunction device, this bit controls error reporting for all functions. For a Root Port, the reporting of correctable errors occurs internally within the Root Complex. No external ERR_COR Message is generated. Default value of this field is 0.
1	Non-Fatal Error Reporting Enable. This bit controls the reporting of non fatal errors. For a multifunction device, it controls error reporting for all functions. For a Root Port, the reporting of non-fatal errors occurs inter- nally within the Root Complex. No external ERR_NONFATAL Message is generated. Default value of this field is 0.

Chapter 24: Express-Specific Configuration Registers

Table 24 - 3: Device Control Register (read/write) (Continued)

Bit(s)	Description
2	Fatal Error Reporting Enable. This bit controls the reporting of fatal errors. For a multifunction device, it controls error reporting for all functions within the device. For a Root Port, the reporting of fatal errors occurs inte nally within the Root Complex. No external ERR_FATAL Message is gener- ated. Default value of this bit is 0.
3	Unsupported Request (UR) Reporting Enable. When set to one, this bit enables the reporting of Unsupported Requests. For a multifunction device it controls UR reporting for all functions. The reporting of error messages (ERR_COR, ERR_NONFATAL, ERR_FATAL) received by a Root Port is controlled exclusively by the Root Control register (see "Root Control R ister” on page 926). Default value of this bit is 0.
4	Enable Relaxed Ordering. When set to one, the device is permitted to set the Relaxed Ordering bit (refer to "Relaxed Ordering" on page 319) in th Attributes field of requests it initiates that do not require strong write ordering. Default value of this bit is 1 , but it may be hardwired to 0 if a device neve sets the Relaxed Ordering attribute in requests it initiates as a Requester.
7:5	Max Payload Size. Sets the max TLP data payload size for the device. As $i$ Receiver, the device must handle TLPs as large as the set value; as a Trans- mitter, the device must not generate TLPs exceeding the set value. Permissi- ble values that can be programmed are indicated by the Max Payload Size Supported in the Device Capabilities register (see "Device Capabilities Reg- ister" on page 900). - $000 b = 128$ byte max payload size - $001 b = 256$ byte max payload size - $010 b = 512$ byte max payload size - $011 b = 1024$ byte max payload size - $100 b = 2048$ byte max payload size - $101 b = 4096$ byte max payload size - $110 b =$ Reserved - 111b = Reserved Default value of this field is 000b.

PCI Express System Architecture

Table 24 - 3: Device Control Register (read/write) (Continued)

Bit(s)	Description
8	Extended Tag Field Enable. When set to one, enables a device to use an 8- bit Tag field as a requester. If cleared to zero, the device is restricted to a 5 bit Tag field. Also refer to the description of the Phantom Functions Sup- ported field in Table 24 - 2 on page 901. The default value of this bit is 0. Devices that do not implement this capa- bility hardwire this bit to 0.
9	Phantom Functions Enable. See the description of the Phantom Function Supported field in Table 24 - 2 on page 901 Default value of this bit is 0. Devices that do not implement this capability hardwire this bit to 0.
10	Auxiliary (AUX) Power PM Enable. When set to one, this bit enable a device to draw Aux power independent of PME Aux power. In a legac OS environment, devices that require Aux power should continue to indi- cate PME Aux power requirements. Aux power is allocated as requeste the Aux Current field of the Power Management Capabilities register (PMC; see “Auxiliary Power” on page 645), independent of the PME Enable bit in the Power Management Control/Status register (PMCSR; see ‘ Control/Status (PMCSR) Register" on page 599). For multifunction devices, a component is allowed to draw Aux power if at least one of the functions has this bit set. - Note: Devices that consume Aux power must preserve the value in this field when Aux power is available. In such devices, this register value i. not modified by hot, warm, or cold reset. - Devices that do not implement this capability hardwire this bit to 0 .

Chapter 24: Express-Specific Configuration Registers

Table 24 - 3: Device Control Register (read/write) (Continued)

Bit(s)

Description

Enable No Snoop. Software sets this bit to one if the area of memory this Requester will access is not cached by the processor(s). When a request packet that targets system memory is received by the Root Complex (i.e., the memory that the processors cache from), the Root Complex does not have to delay the access to memory to perform a snoop transaction on the processor bus if the No Snoop attribute bit is set. This speeds up the mem ory access. - Note that setting this bit to one should not cause a function to unequivo- cally set the No Snoop attribute on every memory requests that it ini- tiates. The function may only set the bit when it knows that the processor(s) are not caching from the area of memory being accessed - Default value of this bit is 1 and it may be hardwired to 0 if a device never sets the No Snoop attribute in Request transactions that it initiates

14:12

Max_Read_Request_Size. Max read request size for the device when act- ing as the Requester. The device must not generate read requests with a size > this value. -

000 b = 128

byte max read request size -

001 b = 256

byte max read request size -

010 b = 512

byte max read request size -

011 b = 1 KB

max read request size -

100 b = 2 KB

max read request size -

101 b = 4 KB

max read request size -

110 b =

Reserved -

111 b =

Reserved Devices that do not generate read requests larger than 128 bytes are permit- ted to implement this field as Read Only (RO) with a value of 000b. Default value of this field is 010b.

Device Status Register

Figure 24-6 on page 910 and Table 24 - 4 on page 910 provide a description of each bit field in this register.

Table 24 - 4: Device Status Register

Bit(s)	Type	Description
0	RW1C	Correctable Error Detected. A one indicates that one or more cor- rectable errors were detected since the last time this bit was cleared by software. Correctable errors are reflected by this bit regardless of whether error reporting is enabled or not in the Device Control reg ister (see "Device Control Register" on page 905). In a multifunc- tion device, each function indicates whether or not that function has detected any correctable errors using this bit For devices supporting Advanced Error Handling (see “Advanced Error Reporting Mechanisms" on page 382), errors are logged i this register regardless of the settings of the Correctable Error Mask register. Default value of this bit is 0.

Chapter 24: Express-Specific Configuration Registers

Table 24 - 4: Device Status Register (Continued)

Bit(s)	Type	Description
1	RW1C	Non-Fatal Error Detected. A one indicates that one or more non- fatal errors were detected since the last time this bit was cleared by software. Non-fatal errors are reflected in this bit regardless of whether error reporting is enabled or not in the Device Control reg- ister (see “Device Control Register” on page 905). In a multifunc tion device, each function indicates whether or not that function has detected any non-fatal errors using this bit. For devices supporting Advanced Error Handling, errors are logged in this register regardless of the settings of the Uncorrect- able Error Mask register (note that the 1.0a spec says "Correctable Error Mask register", but the authors think this is incorrect). Default value of this bit is 0.
2	RW1C	Fatal Error Detected. A one indicates that one or more fatal errors were detected since the last time this bit was cleared by software Fatal errors are reflected in this bit regardless of whether error \| reporting is enabled or not in the Device Control register (see “Device Control Register” on page 905). In a multifunction device, each function indicates whether or not that function has detected any fatal errors using this bit. For devices supporting Advanced Error Handling (see “Advanced Error Reporting Capability" on page 930), errors are logged in this register regardless of the settings of the Uncorrectable Error Mask register (note that the 1.0a spec erroneously says "Correctable Error Mask register.”) Default value of this bit is 0.
3	RW1C	Unsupported Request (UR) Detected. When set to one, indicates that the function received an Unsupported Request. Errors are reflected in this bit regardless of whether error reporting is enabled or not in the Device Control register (see "Device Control Register" on page 905). In a multifunction device, each function indicates whether or not that function has detected any UR errors using this bit. Default value of this field is 0.
4	RO	Aux Power Detected. Devices that require Aux power set this bit to one if Aux power is detected by the device.

Table 24 - 4: Device Status Register (Continued)

Bit(s)	Type	Description
5	RO	Transactions Pending. When set to one, indicates that this function has issued non-posted request packets which have not yet been completed (either by the receipt of a corresponding Completion, or by the Completion Timeout mechanism). A function reports this bit cleared only when all outstanding non-posted requests have com- pleted or have been terminated by the Completion Timeout mecha- nism. - Root and Switch Ports: Root and Switch Ports adhering solely to the 1.0a Express spec never issue non-posted requests on their own behalf. Such Root and Switch Ports hardwire this bit to 0b

Link Registers (Required)

There are three link-related registers:

The Link Capabilities Register.

The Link Control Register.

The Link Status Register.

Link Capabilities Register. Figure 24-7 on page 913 and Table 24 - 5 on page 913 provide a description of each bit field in this register.

PCI Express System Architecture

Table 24 - 5: Link Capabilities Register (Continued)

Bit(s)	Type	Description
11:10	RO	Active State Power Management (ASPM) Support. Indicates the level of ASPM supported on this Link - $00 b =$ Reserved - 01b = L0s Entry Supported - 10b = Reserved - 11b = L0s and L1 Supported Refer to “Link Active State Power Management” on page 608 for more information.
14:12	RO	L0s Exit Latency. Indicates the L0s exit latency for the Link (i.e., the length of time this Port requires to complete a transition from L0s to L0). - $000 b =$ Less than $64 ns$ - 001b = 64ns to less than 128ns - $010 b = 128 ns$ to less than 256ns - 011b = 256ns to less than 512ns - $100 b = 512 ns$ to less than $1 μ s$ - $101 b = 1 μ s$ to less than 2 $μ s$ - $110 b = 2 μ s - 4 μ s$ - $111 b =$ More than $4 μ s$ Note: Exit latencies may be influenced by a port's reference clock configuration (i.e., whether the port uses the reference clock sup- plied by the port at the remote end of the link or it provides it own local reference clock). Refer to “ASPM Exit Latency” on page 624 for more information.

Chapter 24: Express-Specific Configuration Registers

Table 24 - 5: Link Capabilities Register (Continued)

Bit(s)	Type	Description
17:15	RO	L1 Exit Latency. Indicates the L1 exit latency for the Link (i.e., the length of time this Port requires to complete a transition from L1 to L0). - $000 b =$ Less than $1 μ s$ - $001 b = 1 μ s$ to less than 2μs - $010 b$ = 2µs to less than 4µs - $011 b$ = $4 μ$ s to less than $8 μ$ s - $100 b$ = $8 μ s$ to less than $16 μ s$ - $101 b = 16 μ s$ to less than $32 μ s$ - $110 b = 32 μ s$ - $64 μ s$ - 111b = More than 64μs Note: Exit latencies may be influenced by a port's reference clock configuration (i.e., whether the port uses the reference clock sup- plied by the port at the remote end of the link or it provides it own local reference clock). Refer to "ASPM Exit Latency" on page 62 for more information.
31:24	HWInit	Port Number. Indicates the Port number associated with this Link. The port number is assigned by the hardware designer

Link Control Register. Figure 24-8 on page 916 and Table 24 - 6 on page 916 provide a description of each bit field in this register.

Table 24 - 6: Link Control Register

Bit(s)	Type	Description
1:0	RW	Active State Power Management (ASPM) Control. Controls the level of ASPM supported on the Link - $00 b =$ Disabled. - $01 b =$ L0s Entry Enabled. Indicates the Transmitter entering L0s is supported. The Receiver must be capable of entering L0s ever when this field is disabled (00b). - $10 b$ = L1 Entry Enabled. * $11 b$ = L0s and L1 Entry Enabled. Default value of this field is 00b or 01b depending on form factor At the time of writing, only the Electromechanical specified had been released. This specification makes no mention of the default state of the ASPM bits.

Chapter 24: Express-Specific Configuration Registers

Table 24 - 6: Link Control Register (Continued)

Bit(s)	Type	Description
3	RO for Root and Switch Ports RW for End- points	Read Completion Boundary (RCB). - Root Ports: Hardwired. Indicates the RCB value for the Root Port. It is a hardwired, read-only value indicating the RCB sup- port capabilities: – 0b = 64 byte - $1 b = 128$ byte - Endpoints: Set by configuration software to indicate the RCB value of the Root Port upstream from the Endpoint. Devices that do not implement this feature must hardwire the field to 0b. - 0b = 64 byte - $1 b = 128 byte$ - Switch Ports: Reserved and hardwired to 0b.
4	RW	Link Disable. 1 = disable the Link. Reserved on Endpoint devices and Switch upstream ports. The value written can be read back immediately, before the link has actually changed state Default value of this bit is 0b.
5	RW	Retrain Link. - $1 =$ initiate Link retraining by changing the Physical Layer LTSSM to the Recovery state. - Reads of this bit always return $0 b$ . - Reserved on Endpoint devices and Switch upstream ports. See "Link Errors" on page 379 for more information
6	RW	Common Clock Configuration. - 1 indicates that this component and the component at the oppo- site end of this Link are using a common reference clock. - 0 indicates that this component and the component at the oppo- site end of this Link are using separate reference clock A component factors this bit setting into its calculation of the L0s and L1 Exit Latencies (see Table 24 - 5 on page 913) that it reports in the Link Capabilities register. ’ After changing this bit in a component on either end of a Link, software must trigger the Link to retrain by setting the Retrain Link bit to one in this register. - Default value of this field is $0 b$ See “ASPM Exit Latency” on page 624 for more information

Table 24 - 6: Link Control Register (Continued)

Bit(s)	Type	Description
7	RW	Extended Sync. When set to one, this bit forces the transmission of: - 4096 FTS Ordered Sets during the L0s state - followed by a single SKP ordered set prior to entering the L0 state, - as well as the transmission of 1024 TS1 Ordered Sets in the L1 state prior to entering the Recovery state. This mode gives external devices (e.g., logic analyzers) that may be monitoring Link activity time to achieve bit and symbol lock before the Link enters the L0 or Recovery state and resumes communica- tion. Default value for this bit is 0b. See “L0s State” on page 611 for more information.

Link Status Register. Figure 24-9 on page 918 and Table 24 - 7 on page 919 provide a description of each bit field in this register.

Figure 24-9: Link Status Register

Table 24 - 7: Link Status Register

Bit(s)	Type	Description
3:0	RO	Link Speed. The negotiated Link speed. - $0001 b = 2.5 Gb / s$ All other encodings are reserved.
9:4	RO	Negotiated Link Width. The negotiated Link width. - 000001b = x1 - 000010b = x2 - 000100b = x4 - 001000b = x8 - $001100 b = x 12$ - $010000 b = x 16$ - $100000 b = x 32$ - All other encodings are reserved. See "Negotiate Link Width[9:4]" on page 551 for more information
10	RO	Training Error. 1 $=$ indicates that a Link training error occurred. \| Reserved on Endpoint devices and Switch upstream ports. Cleared by hardware upon successful training of the Link to the L0 Link state. See “Link Errors” on page 379 for more information.
11	RO	Link Training. When set to one, indicates that Link training is in progress (Physical Layer LTSSM is in the Configuration or Recov ery state) or that the Retrain Link bit was set to one but Link train- ing has not yet begun. - Hardware clears this bit once Link training is complete - This bit is not applicable and reserved on Endpoint devices and the Upstream Ports of Switches. See “Link Errors” on page 379 for more information.
12	HWInit	Slot Clock Configuration. This bit indicates that the component uses the same physical reference clock that the platform provides on the connector. If the device uses an independent clock irrespec- tive of the presence of a reference on the connector, this bit must be clear. See “Config. Registers Used for ASPM Exit Latency Management and Reporting" on page 628 for more information.

Chapter 24: Express-Specific Configuration Registers

Figure 24-10: Slot Capabilities Register

Table 24 - 8: Slot Capabilities Register (all fields are HWInit)

Bit(s)	Description
0	Attention Button Present. $1 =$ Attention Button is implemented on the chassi for this slot.
1	Power Controller Present. 1 = A Power Controller is implemented for this slot.
2	MRL (Manually-operated Retention Latch) Sensor Present. $1 =$ An MRL Sensor is implemented on the chassis for this slot.
3	\| Attention Indicator Present. 1 = An Attention Indicator is implemented on the chassis for this slot.

PCI Express System Architecture

Table 24 - 8: Slot Capabilities Register (all fields are HWInit) (Continued)

Bit(s)	Description
4	Power Indicator Present. 1 = A Power Indicator is implemented on the chas sis for this slot.
5	Hot-Plug Surprise. 1 = A device installed in this slot may be removed from the system without any prior notification.
6	Hot-Plug Capable. 1 = This slot supports Hot-Plug operations
14:7	Slot Power Limit Value. In combination with the Slot Power Limit Scale value (see the next row in this table), specifies the max power (in Watts) ava able to the device installed in this slot. - Max power limit $=$ Slot Power Limit Value $\times$ Slot Power Limit Scal value. - This field must be implemented if the Slot Implemented bit is set to one in the PCI Express Capabilities Register (see "PCI Express Capabilities Regis ter” on page 898). - A write to this field causes the Port to send the Set Slot Power Limit Mes sage upstream to the port at the other end of the Link - The default value prior to hardware/firmware initialization is 0000 0000b. See “The Power Budget Capabilities Register Set” on page 564 for more infor mation.
16:15	Slot Power Limit Scale. See the description in the previous row of this table. - Possible values: - $00 b = 1.0 x$ - $01 b = 0.1 x$ - $10 b = 0.01 x$ - 11b = 0.001x This field must be implemented if the Slot Implemented bit is set to one in the PCI Express Capabilities Register (see “PCI Express Capabilities Regis- ter” on page 898). ’ A write to this field causes the Port to send the Set Slot Power Limit Mes- sage upstream to the port at the other end of the Link. - The default value prior to hardware/firmware initialization is $00 b$ .

Chapter 24: Express-Specific Configuration Registers

Table 24 - 8: Slot Capabilities Register (all fields are HWInit) (Continued)

Bit(s)

Description

31:19

Physical Slot Number. Indicates the physical slot number attached to this Port. Must be hardware initialized to a value that assigns a slot number tha globally unique within the chassis. Must be initialized to 0 for Ports con- nected to devices that are either integrated on the system board or integrate within the same silicon as the Switch downstream port or the Root Port See "Chassis and Slot Number Assignment" on page 861 for more informa- tion.

Slot Control Register

Figure 24-11 on page 923 and Table 24 - 9 on page 924 provide a description of each bit field in this register.

Figure 24-11: Slot Control Register

Table 24 - 9: Slot Control Register (all fields are RW)

Bit(s)	Description
0	Attention Button Pressed Enable. When set to one, enables the generation of a Hot-Plug interrupt or a wakeup event when the attention button is pressed Default value of this field is 0 . See "Attention Button" on page 667 for more information.
1	Power Fault Detected Enable. When set to one, enables the generation of a Hot-Plug interrupt or a wakeup event on a power fault event. Default value of this field is 0.
2	MRL Sensor Changed Enable. When set to one, enables the generation of a Hot-Plug interrupt or a wakeup event on an MRL sensor changed event. Default value of this field is 0. See “Electromechanical Interlock (optional)” on page 667 for more informa- tion.
3	Presence Detect Changed Enable. When set to one, enables the generation of a Hot-Plug interrupt or a wakeup event on a presence detect changed event. Default value of this field is 0. See “Slot Status and Events Management” on page 674 for more information.
4	Command Completed Interrupt Enable. When set to one, enables the gener- ation of a Hot-Plug interrupt when a command is completed by the Hot-Plug Controller. Default value of this field is 0.
5	Hot-Plug Interrupt Enable. When set to one, enables the generation of a Hot Plug interrupt on enabled Hot-Plug events. Default value of this field is 0.
7:6	Attention Indicator Control. A read from this field returns the current state of the Attention Indicator, while a write sets the Attention Indicator to the state indicated below: - $00 b =$ Reserved - $01 b = On$ - $10 b =$ Blink - $11 b =$ Off Writes to this field also cause the Port to send the respective Attention Indica- tor message.

Table 24 - 9: Slot Control Register (all fields are RW) (Continued)

Bit(s)	Description
9:8	Power Indicator Control. A read from this field returns the current state of the Power Indicator, while a write sets the Power Indicator to the state indi- cated below: - $00 b =$ Reserved - $01 b = On$ - $10 b =$ Blink - $11 b =$ Off Writes to this field also cause the Port to send the respective Power Indicator message.
10	Power Controller Control. A read from this field returns the current state of the power applied to the slot, while a write sets the power state of the slot to the state indicated below: - $0 b =$ Power On - $1 b =$ Power Off

Slot Status Register

Figure 24-12 on page 925 and Table 24 - 10 on page 926 provide a description of each bit field in this register.

Figure 24-12: Slot Status Register

Table 24 - 10: Slot Status Register

Bit(s)	Type	Description
0	RW1C	Attention Button Pressed. $1 =$ attention button pressed.
1	RW1C	Power Fault Detected. 1 = Power Controller detected a power fault at this slot.
2	RW1C	MRL Sensor Changed. 1 = MRL Sensor state change detected.
3	RW1C	Presence Detect Changed. $1 =$ Presence Detect change detected.
4	RW1C	\| Command Completed. 1 = Hot-Plug Controller completed a com- mand.
5	RO	MRL Sensor State. MRL sensor status (if MRL implemented). - 0b = MRL Closed - $1 b =$ MRL Open
6	RO	Presence Detect State. When set to one, a card is present in the slot (as indicated either by an in-band mechanism or via the Presence Detect pins as defined in the PCI Express Card Electromechanical Specification). - $0 b =$ Slot Empty - 1b = Card Present in slot This field must be implemented on all Switch downstream ports and on Root Ports that are attached to an add-in connector. It is hardwired to one if the port is not connected to an add-in slot cor nector.

Root Port Registers

Introduction

All Root Ports must implement the Root Control and Root Status registers. The following two sections provide a detailed description of these two registers.

Root Control Register

Figure 24-13 on page 927 and Table 24 -11 on page 927 provide a description of each bit field in this register.

Chapter 24: Express-Specific Configuration Registers

Table 24 - 11: Root Control Register (all fields are RW)

Bit(s)

Description

System Error on Correctable Error Enable. When set to one, a System Error is generated if a correctable error (ERR_COR) is reported by any of the child (i.e., downstream) devices associated with this Root Port, or by the Root Por itself. The mechanism for signaling a System Error to the system is system- specific (e.g., in an x86-based system, a Non-Maskable Interrupt—NMI— could be generated to the processor). Default value of this bit is 0. ee “Reporting Errors to the Host System” on page 392 for more information

System Error on Non-Fatal Error Enable. When set to one, a System Error is generated if a non-fatal error (ERR_NONFATAL) is reported by any of the child (i.e., downstream) devices associated with this Root Port, or by the Roo Port itself. The mechanism for signaling a System Error to the system is sys- tem-specific (e.g., in an x86-based system, a Non-Maskable Interrupt—NMI- could be generated to the processor). Default value of this bit is 0.

|

See “Reporting Errors to the Host System” on page 392 for more information.

PCI Express System Architecture

Table 24 - 11: Root Control Register (all fields are RW) (Continued)

Bit(s)

Description

System Error on Fatal Error Enable. When set to one, a System Error is gener ated if a Fatal error (ERR_FATAL) is reported by any of the child (i.e., down- stream) devices associated with this Root Port, or by the Root Port itself. The mechanism for signaling a System Error to the system is system-specific (e.g., in an x86-based system, a Non-Maskable Interrupt—NMI—could be gener- ated to the processor). Default value of this field is 0. See “Reporting Errors to the Host System” on page 392 for more information

PME Interrupt Enable. When set to one, enables interrupt generation on receipt of a PME Message from a child (i.e., downstream) device (which sets the PME Status bit in the Root Status register—see "Root Status Register" on page 928—to one). A PME interrupt is also generated when software sets this bit to one (assum- ing it was originally cleared to zero) when the PME Status bit in the Root Sta- tus register is set to one. Default value of this field is 0. See “The PME Sequence” on page 640 for more information

Root Status Register

Figure 24-14 on page 928 and Table 24 - 12 on page 929 provide a description of each bit field in this register.

Figure 24-14: Root Status Register

Table 24 - 12: Root Status Register

Bit(s)	Type	Description
15:0	RO	PME Requestor ID. Contains the Requester ID of the last child (i.e., downstream) device to issue a PME.
16	RW1C	PME Status. When set to one, indicates that PME was asserted by the Requester indicated in the PME Requestor ID field. Subsequent PMEs remain pending until this bit is cleared by software by writ- ing a 1 to it.
17	RO	PME Pending. When set to one and the PME Status bit is set, indi- cates that another PME is pending. When the PME Status bit is cleared by software, the Root Port hardware indicates the deliver of the next PME by setting the PME Status bit again and updating the Requester ID field appropriately. The PME Pending bit is cleared by hardware when no more PMEs are pending.

PCI Express Extended Capabilities

General

A PCI Express function may optionally implement any, all, or none of the following Extended Capability register sets:

Advanced Error Reporting Capability register set.

Virtual Channel (VC) Capability register set.

Device Serial Number Capability register set.

Power Budgeting Capability register set.

Refer to Figure 24-1 on page 895. The first extended capability register set must be implemented at offset

100 h

in a function’s

4 KB

configuration space and its Enhanced Capability Header register (see Figure 24-15 on page 930) contains a pointer (the Next Capability Offset field; this 12-bit field must contain either the dword-aligned start address of the next capability register set, or a value of zero if this is the last of the extended capability register sets) to the next extended capability register set in the list. The respective capability IDs of each register set are:

Advanced Error Reporting Capability register set. ID $= 0001 h$ .

Virtual Channel (VC) Capability register set. ID $= 0002 h$ .

PCI Express System Architecture

Device Serial Number Capability register set. ID $= 0003 h$ .

Power Budgeting Capability register set. ID $= 0004 h$ .

The Capability Version field is assigned by the SIG and defines the layout of the register set. It must be

1 h

for all of the extended capabilities currently defined.

Figure 24-15: Enhanced Capability Header Register

Advanced Error Reporting Capability

General

Figure 24-16 on page 931 illustrates the optional Advanced Error Reporting capability register set. Note that the registers in the last three dwords of this register set may only be implemented for a Root Port function (one that has a value of

0100 b

in the Device/Port Type field of the PCI Express Capabilities register in the function's PCI-compatible configuration space). This capability register set consists of the registers pictured in Figure 24-16 on page 931 and described in Table 24 - 13 on page 932.

Detailed Description

For a detailed description of the Advanced Error Reporting capability register set, refer to "Advanced Error Reporting Mechanisms" on page 382.

Table 24 - 13: Advanced Error Reporting Capability Register Set

Register Group	Register	Description
NA	Enhanced Capability Header	Capability ID $= 0001 h$ . The Capability Version field in this register is assigned by the SIG and defines the layout of the register set. It must be 1h for all of the extended capabilities currently defined. See Figure 24-17 on page 935.
NA	Capabilities and Control Register	Contains the following bits fields: - First Error Pointer. Read-only. Identifies the bit position of the first error reported in the Uncor- rectable Error Status register (see Figure 24-23 or page 937). ’ ECRC Generation Capable. Read-only. 1 indi- cates that the function is capable of generating ECRC (End-to-End CRC; refer to “ECRC Gener- ation and Checking" on page 361). ’ ECRC Generation Enable. Read/write sticky bit. When set to one, enables ECRC generation. Default = 0. - ECRC Check Capable. Read-only. 1 indicates that the function is capable of checking ECRC. - ECRC Check Enable. Read/write sticky bit. When set to one, enables ECRC checking. Default = 0. See Figure 24-18 on page 935.
Correctable Error Registers	Correctable Error Mask Register	Controls the reporting of individual correctable errors by the function to the Root Complex via a PCI Express error message. A masked error (respective bit set to one) is not reported to the Root Complex by the function. This register con- tains a mask bit for each corresponding error bit in the Correctable Error Status register (see the next row in this table and Figure 24-19 on page 935).
Correctable Error Registers	Correctable Error Status Register	Reports the error status of the function's correct- able error sources. Software clears a set bit by writ- ing a 1 to the respective bit. See Figure 24-20 or page 936.

Chapter 24: Express-Specific Configuration Registers

Table 24 - 13: Advanced Error Reporting Capability Register Set (Continued)

Register Group	Register	Description
Uncorrectable Error Registers	Uncorrectable Error Mask Register	Controls the function's reporting of errors to the Root Complex via a PCI Express error message. A masked error (respective bit set to 1b): - is not logged in the Header Log register (see Fig- ure 24-16 on page 931), - does not update the First Error Pointer (see the description of the Capabilities and Control Reg. ister in this table), and - is not reported to the Root Complex. This register (see Figure 24-21 on page 936)contains a mask bit for each corresponding error bit in the Uncorrectable Error Status register.
	Uncorrectable Error Severity Register	Each respective bit controls whether an error is reported to the Root Complex via a non-fatal or fatal error message. An error is reported as fatal if the corresponding bit is set to one. See Figure 24-22 on page 937.
	Uncorrectable Error Status Register	Reports the error status of the function's uncorrect- able error sources. See Figure 24-23 on page 937.

PCI Express System Architecture

Table 24 - 13: Advanced Error Reporting Capability Register Set (Continued)

Register Group	Register	Description
Root Error Registers	Root Error Command Register	Controls the Root Complex's ability to generate an interrupt to the processor upon receipt of: - a correctable error message, - a non-fatal error message, or - a fatal error message from a child function downstream of the Root Port. See Figure 24-24 on page 938.
	Root Error Sta- tus Register	Tracks the Root Port's receipt of error messages received by the Root Complex from a child func- tion downstream of the Root Port, and of errors detected by the Root Port itself. Non-fatal and fatal error messages are grouped together as uncorrect- able. There is a first error bit and a next error bit associated with correctable and uncorrectable errors, respectively. When an error is received by a Root Port, the respective first error bit is set and the Requestor ID is logged in the Error Source Identifi- cation register. If software does not clear the first reported error before another error message is received of the same category (correctable or uncorrectable), the corresponding next error status bit will be set, but the Requestor ID of the subse- quent error message is discarded. Updated regard- less of the settings in the Root Control and the Root Error Command registers. See Figure 24-25 on page 938.
	Uncorrectable Error Source ID Register	Identifies the source (Requestor ID) of the first uncorrectable (non-fatal/fatal) error reported in the Root Error Status register. Updated regardless of the settings in the Root Control and the Root Error Command registers. See Figure 24-26 on page 938.
	Correctable Error Source ID Register	Identifies the source (Requestor ID) of the first cor- rectable error reported in the Root Error Status reg- ister. Updated regardless of the settings in the Root Control and the Root Error Command registers. See Figure 24-26 on page 938.

Chapter 24: Express-Specific Configuration Registers

Virtual Channel Capability

The VC Register Set's Purpose

This register set serves several purposes:

In a port that implements multiple VC buffers, it permits the configuration of the TC-to-VC mapping.

See Figure 24-27 on page 939. In an egress port that implements multiple VC buffers, it permits the configuration of the arbitration scheme that defines the order in which each VC accepts packets from the various source ingress ports within the device. This is referred to as the VC's port arbitration scheme.

See Figure 24-27 on page 939. In an egress port that implements multiple VC buffers, it permits the configuration of the arbitration scheme that defines the order in which the egress port accepts packets from its VC buffers for transmit onto the link. This is referred to as the port's VC arbitration scheme.

In a port that only implements a single VC (VC0), the configuration software may specify that only packets with certain TCs be accepted into the VC0 buffer for transfer. This is referred to as TC filtering.

Figure 24-27: Port and VC Arbitration

PCI Express System Architecture

Who Must Implement This Register Set?

The following functions must implement this optional register set:

A function (i.e., a port) that only implements VC0 but permits the configuration software to specify that only packets with certain TCs may be placed in the VC0 buffer for transfer.

A function that implements VCs in addition to VC0.

This applies to Endpoint devices, upstream and downstream Switch ports, Root Ports, and RCRBs.

Multifunction Upstream Port Restriction

The spec contains the following statement:

"The PCI Express Virtual Channel Capability structure can be present in the Extended Configuration Space of all devices or in RCRB with the restriction that it is only present in the Extended Configuration Space of Function 0 for multifunction devices at their Upstream Ports."

The authors take this to mean that if the upstream port of a device is implemented as a multifunction device (see Figure 21-11 on page 761) and that port meets the criteria specified in "Who Must Implement This Register Set?" on page 940 , this capability register set is only implemented in the Extended Configuration Space of function 0 of that device.

The Register Set

Figure 24-28 on page 941 illustrates the VC Capability register set and Figure 24- 29 on page 941 illustrates the detail of its Enhanced Capability Header register.

Detailed Description of VCs

For a detailed description of VCs, refer to Chapter 6, entitled "QoS/TCs/VCs and Arbitration," on page 251.