Are your company's technical training needs being addressed in the most effective manner?

MindShare has over 25 years experience in conducting technical training on cutting-edge technologies. We understand the challenges companies have when searching for quality, effective training which reduces the students’ time away from work and provides cost-effective alternatives. MindShare offers many flexible solutions to meet those needs. Our courses are taught by highly-skilled, enthusiastic, knowledgeable and experienced instructors. We bring life to knowledge through a wide variety of learning methods and delivery options.

training that fits your needs

MindShare recognizes and addresses your company's technical training issues with:

Scalable cost training - Customizable training options

Just-in-time training - Overview and advanced topic courses

Training in a classroom, at your cubicle or home office

Reducing time away from work

Training delivered effectively globally

Concurrently delivered multiple-site training

MindShare training courses expand your technical skillset

a PCI Express 2.0®

Intel Core 2 Processor Architecture

5 AMD Opteron Processor Architecture

= Intel 64 and IA-32 Software Architecture

Intel PC and Chipset Architecture

a PC Virtualization

ausb 2.0

a Wireless USB

FRATALATA

5 Serial Attached SCSI (SAS)

= DDR2/DDR3 DRAM Technology

PCIBIOS Firmware

High-Speed Design

= Windows Internals and Drivers

a Linux Fundamentals

... and many more.

All courses can be customized to meet your

group's needs. Detailed course outlines can

be found at www.mindshare.com bringing life to knowledge. real-world tech training put into practice worldwide MindShare Learning Options

Engage MindShare

Have knowledge that you want to bring to life? MindShare will work with you to “Bring Your Knowledge to Life.” Engage us to transform your knowledge and design courses that can be delivered in classroom or virtual classroom settings, create online eLearning modules, or publish a book that you author.

We are proud to be the preferred training provider at an extensive list of clients that include:

ADAPTEC • AMD • AGILENT TECHNOLOGIES • APPLE • BROADCOM • CADENCE • CRAY • CISCO • DELL • FREESCALE GENERAL DYNAMICS • HP • IBM • KODAK • LSI LOGIC • MOTOROLA • MICROSOFT • NASA • NATIONAL SEMICONDUCTOR NETAPP - NOKIA - NVIDIA - PLX TECHNOLOGY - QLOGIC - SIEMENS - SUN MICROSYSTEMS SYNOPSYS - TI - UNISYS

The PC System Architecture Series

MindShare, Inc.

Please see our web site at www.awprofessional.com/series/ for more information on these titles.

AGP System Architecture: Second Edition

0-201-70069-7

CardBus System Architecture

0-201-40997-6

FireWire

^{®}

System Architecture: Second Edition

0-201-48535-4

HyperTransport

^{TM}

System Architecture

0-321-16845-3

InfiniBand System Architecture

0-321-11765-4

ISA System Architecture: Third Edition

0-201-40996-8

PCI Express System Architecture

0-321-15630-7

PCI System Architecture: Fourth Edition

0-201-30974-2

PCI-X System Architecture

0-201-72682-3

PCMCIA System Architecture: Second Edition

0-201-40991-7

Pentium® Pro and Pentium® II System Architecture: Second Edition

0-201-30973-4

Pentium® Processor System Architecture: Second Edition

0-201-40992-5

Plug and Play System Architecture

0-201-41013-3

Protected Mode Software Architecture

0-201-55447-X

The Unabridged Pentium 4

0-321-24656-X

Universal Serial Bus System Architecture: Second Edition

0-201-30975-0

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designators appear in this book, and Addison-Wesley was aware of the trademark claim, the designations have been printed in initial capital letters or all capital letters.

The authors and publishers have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.

Library of Congress Cataloging-in-Publication Data ISBN: 0-321-15630-7

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America. Published simultaneously in Canada.

Sponsoring Editor:

Project Manager:

Cover Design:

Set in 10 point Palatino by MindShare, Inc.

123456789-MA-999897

9th Printing, April, 2008

Addison-Wesley books available for bulk purchases by corporations, institutions, and other organizations. For more information please contact the Corporate, Government, and Special Sales Department at (800) 238-9682.

Find A-W Developer's Press on the World-Wide Web at:

http://www.awl.com/devpress/

The MindShare Architecture Series ..1

Cautionary Note . .. 2

Intended Audience ..2

Prerequisite Knowledge . .. 3

Topics and Organization . . 3

Documentation Conventions. ..4

PCI Express

^{TM}

. ..4

Hexadecimal Notation ..4

Binary Notation. ..4

Decimal Notation. ..4

Bits Versus Bytes Notation ..5

Bit Fields. ..5

Active Signal States. ..5

Visit Our Web Site. . . 5

We Want Your Feedback. .. 6

Part One: The Big Picture

Chapter 1: Architectural Perspective

Introduction To PCI Express. ...9

The Role of the Original PCI Solution.. .10

Don't Throw Away What is Good! Keep It .10

Make Improvements for the Future.. ..10

Looking into the Future .. 11

Predecessor Buses Compared . .. 11

Author's Disclaimer.. ..12

Bus Performances and Number of Slots Compared . . 12

PCI Express Aggregate Throughput . .13

Performance Per Pin Compared . 1.14

I/O Bus Architecture Perspective . ..16

33 MHz PCI Bus Based System .16

Electrical Load Limit of a 33 MHz PCI Bus. .17

PCI Transaction Model - Programmed IO. .. 19

PCI Transaction Model - Peer-to-Peer. .22

PCI Bus Arbitration. . . 22

PCI Delayed Transaction Protocol ..23

PCI Retry Protocol: . .23

PCI Disconnect Protocol: . . 24

PCI Interrupt Handling.. ..25

About This Book

PCI Error Handling . . . 26

PCI Address Space Map . ..27

PCI Configuration Cycle Generation. ..29

PCI Function Configuration Register Space .30

PCI Programming Model . .31

Limitations of a 33 MHz PCI System. ..31

Latest Generation of Intel PCI Chipsets .32

66 MHz PCI Bus Based System. .33

Limitations of

66 MHz

PCI bus . .34

Limitations of PCI Architecture. 3.34

66 MHz and 133 MHz PCI-X 1.0 Bus Based Platforms. .35

PCI-X Features.. . 36

PCI-X Requester/Completer Split Transaction Model . . . 37

DDR and QDR PCI-X 2.0 Bus Based Platforms . . 39

The PCI Express Way . ..41

The Link - A Point-to-Point Interconnect .41

Differential Signaling . 4.41

Switches Used to Interconnect Multiple Devices. . .42

Packet Based Protocol . . .42

Bandwidth and Clocking. .43

Address Space . ..43

PCI Express Transactions . .43

PCI Express Transaction Model. .43

Error Handling and Robustness of Data Transfer .44

Quality of Service (QoS), Traffic Classes (TCs) and Virtual Channels (VCs) . .44

Flow Control.. .45

MSI Style Interrupt Handling Similar to PCI-X . .45

Power Management.: ..45

Hot Plug Support.. . 46

PCI Compatible Software Model.. . .46

Mechanical Form Factors. ..47

PCI-like Peripheral Card and Connector . .47

Mini PCI Express Form Factor. ..47

Mechanical Form Factors Pending Release.. ..47

NEWCARD Form Factor. . .47

Server IO Module (SIOM) Form Factor. . .47

PCI Express Topology . ..48

Enumerating the System. ..50

PCI Express System Block Diagram. . .51

Low Cost PCI Express Chipset . ..51

High-End Server System. .53

PCI Express Specifications. .. 54

Chapter 2: Architecture Overview

Introduction to PCI Express Transactions. .. 55

PCI Express Transaction Protocol . 5.57

Non-Posted Read Transactions. . . 58

Non-Posted Read Transaction for Locked Requests . . 59

Non-Posted Write Transactions. ..61

Posted Memory Write Transactions. .. 62

Posted Message Transactions. ..63

Some Examples of Transactions.. 6.64

Memory Read Originated by CPU, Targeting an Endpoint.. 6.64

Memory Read Originated by Endpoint, Targeting System Memory. 6.66

IO Write Initiated by CPU, Targeting an Endpoint. . .67

Memory Write Transaction Originated by CPU and

Targeting an Endpoint. .68

PCI Express Device Layers . .. 69

Overview. .69

Transmit Portion of Device Layers. ..71

Receive Portion of Device Layers. ..71

Device Layers and their Associated Packets ..71

Transaction Layer Packets (TLPs) . ..71

TLP Packet Assembly. ..72

TLP Packet Disassembly. ..73

Data Link Layer Packets (DLLPs) ..74

DLLP Assembly. ..75

DLLP Disassembly ..76

Physical Layer Packets (PLPs) . ..77

Function of Each PCI Express Device Layer. .78

Device Core / Software Layer . ..78

Transmit Side. ..78

Receive Side . .78

Transaction Layer . ..79

Transmit Side. ..80

Receiver Side . . . 81

Flow Control.. . . 81

Quality of Service (QoS) . . 82

Traffic Classes (TCs) and Virtual Channels (VCs). . . 84

Port Arbitration and VC Arbitration ..85

Transaction Ordering . ..87

Power Management . ..87

Configuration Registers. . . 87

Data Link Layer.. ..87

Transmit Side . . 88

Receive Side. ..89

Data Link Layer Contribution to TLPs and DLLPs ..89

Non-Posted Transaction Showing ACK-NAK Protocol .. 90

Posted Transaction Showing ACK-NAK Protocol . ..92

Other Functions of the Data Link Layer. ..92

Physical Layer . ..93

Transmit Side ..93

Receive Side. ..93

Link Training and Initialization ..94

Link Power Management . ..95

Reset.. ..95

Electrical Physical Layer. ..96

Example of a Non-Posted Memory Read Transaction . . 96

Memory Read Request Phase. ..97

Completion with Data Phase . ..99

Hot Plug . .101

PCI Express Performance and Data Transfer Efficiency .101

Part Two: Transaction Protocol

Chapter 3: Address Spaces & Transaction Routing

Introduction. .106

Receivers Check For Three Types of Link Traffic .107

Multi-port Devices Assume the Routing Burden. .107

Endpoints Have Limited Routing Responsibilities. .107

System Routing Strategy Is Programmed .108

Two Types of Local Link Traffic. .108

Ordered Sets . 108

Data Link Layer Packets (DLLPs).. ..111

Transaction Layer Packet Routing Basics. .113

TLPs Used to Access Four Address Spaces. ..113

Split Transaction Protocol Is Used. .114

Split Transactions: Better Performance, More Overhead. 1.114

Write Posting: Sometimes a Completion Isn’t Needed . ..115

Three Methods of TLP Routing. .117

PCI Express Routing Is Compatible with PCI ..117

PCI Express Adds Implicit Routing for Messages. . . 118

Why Were Messages Added to PCI Express Protocol? .118

How Implicit Routing Helps with Messages.. .118

Header Fields Define Packet Format and Routing .119

Using TLP Header Information: Overview. . . 120

General ..120

Header Type/Format Field Encodings .. 120

Applying Routing Mechanisms .. 121

Address Routing . .122

Memory and IO Address Maps. .122

Key TLP Header Fields in Address Routing .123

TLPs with 3DW, 32-Bit Address. .123

TLPs With 4DW, 64-Bit Address . .124

An Endpoint Checks an Address-Routed TLP. .125

A Switch Receives an Address Routed TLP: Two Checks. .125

General. . . 125

Other Notes About Switch Address-Routing. .127

ID Routing. .127

ID Bus Number, Device Number, Function Number Limits ..127

Key TLP Header Fields in ID Routing. . 128

3DW TLP, ID Routing. .128

4DW TLP, ID Routing. .129

An Endpoint Checks an ID-Routed TLP . .130

A Switch Receives an ID-Routed TLP: Two Checks. ..130

Other Notes About Switch ID Routing. .130

Implicit Routing . .131

Only Messages May Use Implicit Routing.. .132

Messages May Also Use Address or ID Routing. .132

Routing Sub-Field in Header Indicates Routing Method .132

Key TLP Header Fields in Implicit Routing .132

Message Type Field Summary. .133

An Endpoint Checks a TLP Routed Implicitly. .134

A Switch Receives a TLP Routed Implicitly .134

Plug-And-Play Configuration of Routing Options .. 135

Routing Configuration Is PCI-Compatible .135

Two Configuration Space Header Formats: Type 0, Type 1 .135

Routing Registers Are Located in Configuration Header .135

Base Address Registers (BARs): Type 0, 1 Headers. .136

General . ..136

BAR Setup Example One: 1MB, Prefetchable Memory Request. .138

BAR Setup Example Two: 64-Bit, 64MB Memory Request. ..140

BAR Setup Example Three: 256-Byte IO Request . .142

Base/Limit Registers, Type 1 Header Only ..144

General . ..144

Prefetchable Memory Base/Limit Registers. 1.144

Non-Prefetchable Memory Base/Limit Registers. .146

IO Base/Limit Registers. . . 148

Bus Number Registers, Type 1 Header Only. .150

Primary Bus Number. .151

Secondary Bus Number. .151

Subordinate Bus Number. .151

A Switch Is a Two-Level Bridge Structure. . . 151

Chapter 4: Packet-Based Transactions

Introduction to the Packet-Based Protocol. .154

Why Use A Packet-Based Transaction Protocol . .154

Packet Formats Are Well Defined. .154

Framing Symbols Indicate Packet Boundaries. .156

CRC Protects Entire Packet . .156

Transaction Layer Packets . .156

TLPs Are Assembled And Disassembled. .157

Device Core Requests Access to Four Spaces .159

TLP Transaction Variants Defined . 1.160

TLP Structure. ..161

Generic TLP Header Format . ..161

Generic Header Field Summary. .162

Header Type/Format Field Encodings .165

The Digest and ECRC Field. .166

ECRC Generation and Checking . .166

Who Can Check ECRC?.. ..167

Using Byte Enables . .167

Byte Enable Rules . ..167

Transaction Descriptor Fields .169

Transaction ID.. .169

Traffic Class. .169

Transaction Attributes . .169

Additional Rules For TLPs With Data Payloads. .170

Building Transactions: TLP Requests & Completions. ..171

IO Requests . .171

IO Request Header Format . 172

Definitions Of IO Request Header Fields .173

Memory Requests . .174

Description of 3DW And 4DW Memory Request Header Fields. ..176

Memory Request Notes . ..179

Configuration Requests . .179

Definitions Of Configuration Request Header Fields. .181

Configuration Request Notes .183

Completions. .183

Definitions Of Completion Header Fields . . 185

Summary of Completion Status Codes: ..187

Calculating The Lower Address Field (Byte 11, bits 7:0): ..187

Using The Byte Count Modified Bit. .188

Data Returned For Read Requests: .188

Receiver Completion Handling Rules:. .189

Message Requests. .190

Definitions Of Message Request Header Fields. .191

Message Notes: .193

INTx Interrupt Signaling. .193

Power Management Messages .194

Error Messages. . 195

Unlock Message. .196

Slot Power Limit Message. .196

Hot Plug Signaling Message ..197

Data Link Layer Packets .198

Types Of DLLPs . .199

DLLPs Are Local Traffic. .199

Receiver handling of DLLPs. .199

Sending A Data Link Layer Packet. .200

Fixed DLLP Packet Size: 8 Bytes. ..201

DLLP Packet Types. .. 201

Ack Or Nak DLLP Packet Format .202

Definitions Of Ack Or Nak DLLP Fields.. .203

Power Management DLLP Packet Format. .204

Definitions Of Power Management DLLP Fields .204

Flow Control Packet Format . .205

Definitions Of Flow Control DLLP Fields .206

Vendor Specific DLLP Format . .207

Definitions Of Vendor Specific DLLP Fields. .. 207

Chapter 5: ACK/NAK Protocol

Reliable Transport of TLPs Across Each Link. .210

Elements of the ACK/NAK Protocol . .. 212

Transmitter Elements of the ACK/NAK Protocol. ..213

Replay Buffer. .213

NEXT_TRANSMIT_SEQ Counter. .. 213

LCRC Generator.. . . 213

REPLAY_NUM Count . ..213

REPLAY_TIMER Count. .214

ACKD_SEQ Count. ..214

DLLP CRC Check . .. 214

Receiver Elements of the ACK/NAK Protocol. ..216

Receive Buffer. .216

LCRC Error Check. ..216

NEXT_RCV_SEQ Count. ..216

Sequence Number Check. ..216

NAK_SCHEDULED Flag ..217

ACKNAK_LATENCY_TIMER ..217

ACK/NAK DLLP Generator . ..217

ACK/NAK DLLP Format . ..219

ACK/NAK Protocol Details. . . 220

Transmitter Protocol Details . . 220

Sequence Number. ..220

32-Bit LCRC . . . 221

Replay (Retry) Buffer. .221

General. .. 221

Replay Buffer Sizing. .. 221

Transmitter's Response to an ACK DLLP. . . 222

General. . . 222

Purging the Replay Buffer. .222

Examples of Transmitter ACK DLLP Processing ..222

Example 1. .. 222

Example 2. . . 223

Transmitter's Response to a NAK DLLP. . . 224

TLP Replay. .225

Efficient TLP Replay. .225

Example of Transmitter NAK DLLP Processing. .. 225

Repeated Replay of TLPs. ..226

What Happens After the Replay Number Rollover? .227

Transmitter's Replay Timer. .227

REPLAY_TIMER Equation. .. 227

REPLAY_TIMER Summary Table .. 228

Transmitter DLLP Handling . ..229

Receiver Protocol Details . ..230

TLP Received at Physical Layer. .230

Received TLP Error Check .230

Next Received TLP's Sequence Number. .230

Receiver Schedules An ACK DLLP.. ..231

Example of Receiver ACK Scheduling . ..232

NAK Scheduled Flag. .233

Receiver Schedules a NAK. .233

Receiver Sequence Number Check .234

Receiver Preserves TLP Ordering . .235

Example of Receiver NAK Scheduling. .236

Receivers ACKNAK_LATENCY_TIMER ..237

ACKNAK_LATENCY_TIMER Equation. ..238

ACKNAK_LATENCY_TIMER Summary Table. .238

Error Situations Reliably Handled by ACK/NAK Protocol. .239

ACK/NAK Protocol Summary. . . 241

Transmitter Side . .. 241

Non-Error Case (ACK DLLP Management). . . 241

Error Case (NAK DLLP Management). .. 242

Receiver Side. ..242

Non-Error Case . .. 242

Error Case .243

Recommended Priority To Schedule Packets. .. 244

Some More Examples . .244

Lost TLP.. . . 244

Lost ACK DLLP or ACK DLLP with CRC Error. ..245

Lost ACK DLLP followed by NAK DLLP.. .246

Switch Cut-Through Mode . . . 248

Without Cut-Through Mode . 248

Background.. . 248

Possible Solution. .. 248

Switch Cut-Through Mode. . . 249

Background. ...249

Example That Demonstrates Switch Cut-Through Feature .. 249

Chapter 6: QoS/TCs/VCs and Arbitration

Quality of Service. .252

Isochronous Transaction Support.. ..253

Synchronous Versus Isochronous Transactions. .253

Isochronous Transaction Management .255

Differentiated Services . ..255

Perspective on QOS/TC/VC and Arbitration. .. 255

Traffic Classes and Virtual Channels . .256

VC Assignment and TC Mapping . . 258

Determining the Number of VCs to be Used . . 258

Assigning VC Numbers (IDs) . . 260

Assigning TCs to each VC - TC/VC Mapping .262

Arbitration. ..263

Virtual Channel Arbitration . . 264

Strict Priority VC Arbitration. ..265

Low- and High-Priority VC Arbitration. .267

Hardware Fixed Arbitration Scheme. ...269

Weighted Round Robin Arbitration Scheme. . 269

Round Robin Arbitration (Equal or Weighted) for All VCs. ..270

Loading the Virtual Channel Arbitration Table. ..270

VC Arbitration within Multiple Function Endpoints. .273

Port Arbitration. . . 274

The Port Arbitration Mechanisms. . . 277

Non-Configurable Hardware-Fixed Arbitration . . 278

Weighted Round Robin Arbitration. ..279

Time-Based, Weighted Round Robin Arbitration. ..279

Loading the Port Arbitration Tables. .280

Switch Arbitration Example. . . 282

Chapter 7: Flow Control

Flow Control Concept . . 286

Flow Control Buffers. .288

VC Flow Control Buffer Organization. . . 288

Flow Control Credits. .289

Maximum Flow Control Buffer Size . . . 290

Introduction to the Flow Control Mechanism . . 290

The Flow Control Elements . . 290

Transmitter Elements . ..291

Receiver Elements. ..291

Flow Control Packets. .293

Operation of the Flow Control Model - An Example . . 294

Stage 1 - Flow Control Following Initialization. .294

Stage 2 — Flow Control Buffer Fills Up. .298

Stage 3 — The Credit Limit count Rolls Over. ...299

Stage 4 — FC Buffer Overflow Error Check .300

Infinite Flow Control Advertisement . .301

Who Advertises Infinite Flow Control Credits?. .301

Special Use for Infinite Credit Advertisements. . . 302

Header and Data Advertisements May Conflict. ..302

The Minimum Flow Control Advertisement. .303

Flow Control Initialization. .304

The FC Initialization Sequence.. .305

FC Init1 Packets Advertise Flow Control Credits Available. .305

FC Init2 Packets Confirm Successful FC Initialization. ..307

Rate of FC_INIT1 and FC_INIT2 Transmission . . 308

Violations of the Flow Control Initialization Protocol .308

Flow Control Updates Following FC_INIT. . . 308

FC_Update DLLP Format and Content. ...309

Flow Control Update Frequency . .310

Immediate Notification of Credits Allocated ..311

Maximum Latency Between Update Flow Control DLLPs. ..311

Calculating Update Frequency Based on Payload Size and Link Width . . 311

Error Detection Timer - A Pseudo Requirement . .312

Chapter 8: Transaction Ordering

Introduction. . .316

Producer/Consumer Model . .317

Native PCI Express Ordering Rules .318

Producer/Consumer Model with Native Devices. ..318

Relaxed Ordering. .319

RO Effects on Memory Writes and Messages. 3.319

RO Effects on Memory Read Transactions. . . 320

Summary of Strong Ordering Rules. . . 321

Modified Ordering Rules Improve Performance .322

Strong Ordering Can Result in Transaction Blocking . . 322

The Problem. .323

The Weakly Ordered Solution . 324

Order Management Accomplished with VC Buffers ..324

Summary of Modified Ordering Rules. . . 325

Support for PCI Buses and Deadlock Avoidance. . . 326

Chapter 9: Interrupts

Two Methods of Interrupt Delivery. .330

Message Signaled Interrupts . .331

The MSI Capability Register Set . .332

Capability ID. .332

Pointer To Next New Capability .333

Message Control Register. .333

Message Address Register. .335

Message Data Register. ..335

Basics of MSI Configuration. .336

Basics of Generating an MSI Interrupt Request . .338

Memory Write Transaction (MSI) .338

Multiple Messages. .339

Memory Synchronization When Interrupt Handler Entered. . .340

The Problem. .340

Solving the Problem ..341

Interrupt Latency . .341

MSI Results In ECRC Error. . 341

Some Rules, Recommendations, etc. ..341

Legacy PCI Interrupt Delivery . . 342

Background - PCI Interrupt Signaling ..342

Device INTx# Pins ..342

Determining if a Function Uses INTx# Pins 3.343

Interrupt Routing. 3.344

Associating the INTx# Line to an IRQ Number. . 345

INTx# Signaling . 3.345

Interrupt Disable. .346

Interrupt Status. . 346

Virtual INTx Signaling. 3.347

Virtual INTx Wire Delivery.. 3.348

Collapsing INTx Signals within a Bridge. .349

INTx Message Format. . 351

Devices May Support Both MSI and Legacy Interrupts .352

Special Consideration for Base System Peripherals .353

Example System . .353

Background. .356

Introduction to PCI Express Error Management. .356

PCI Express Error Checking Mechanisms. . .356

Transaction Layer Errors . 358

Data Link Layer Errors . .358

Physical Layer Errors . . 358

Error Reporting Mechanisms. . 359

Error Handling Mechanisms. . . 360

Sources of PCI Express Errors. .361

ECRC Generation and Checking ..361

Data Poisoning (Optional). .362

TC to VC Mapping Errors.. 3.363

Link Flow Control-Related Errors. .363

Malformed Transaction Layer Packet (TLP). . 364

Split Transaction Errors . .365

Unsupported Request . .365

Completer Abort. .366

Unexpected Completion ..367

Completion Time-out. .367

Error Classifications . . . 368

Correctable Errors. . 369

Uncorrectable Non-Fatal Errors. .369

Uncorrectable Fatal Errors. 3.369

How Errors are Reported . . . 370

Chapter 10: Error Detection and Handling

Error Messages . .370

Completion Status. . . 371

Baseline Error Detection and Handling. .372

PCI-Compatible Error Reporting Mechanisms . .372

Configuration Command and Status Registers. .373

PCI Express Baseline Error Handling. . . 375

Enabling/Disabling Error Reporting. . 376

Enabling Error Reporting - Device Control Register. ..377

Error Status - Device Status Register . 378

Link Errors. 3.379

Root's Response to Error Message . . . 381

Advanced Error Reporting Mechanisms .382

ECRC Generation and Checking .383

Handling Sticky Bits . .383

Advanced Correctable Error Handling . .384

Advanced Correctable Error Status .385

Advanced Correctable Error Reporting .385

Advanced Uncorrectable Error Handling . . 386

Advanced Uncorrectable Error Status. .387

Selecting the Severity of Each Uncorrectable Error. .388

Uncorrectable Error Reporting . . . 388

Error Logging . . . 389

Root Complex Error Tracking and Reporting .390

Root Complex Error Status Registers .390

Advanced Source ID Register . .391

Root Error Command Register . . 392

Reporting Errors to the Host System .392

Summary of Error Logging and Reporting . .392

Part Three: The Physical Layer

Chapter 11: Physical Layer Logic

Physical Layer Overview. .. 397

Disclaimer. .400

Transmit Logic Overview . .400

Receive Logic Overview . .. 402

Physical Layer Link Active State Power Management .. 403

Link Training and Initialization. ..403

Transmit Logic Details. .. 403

Tx Buffer. .404

Multiplexer (Mux) and Mux Control Logic . .. 404

General . ..404

Definition of Characters and Symbols. .. 405

Byte Striping (Optional). .408

Packet Format Rules. .411

General Packet Format Rules. .411

x1 Packet Format Example. . 412

x4 Packet Format Rules. .412

x4 Packet Format Example. .. 412

x8, x12, x16 or x32 Packet Format Rules. .413

x8 Packet Format Example. .415

Scrambler.. ..416

Purpose of Scrambling Outbound Transmission. .416

Scrambler Algorithm. .416

Some Scrambler implementation rules:. ..417

Disabling Scrambling . .418

8 b / 10 b

Encoding. 4.419

General . ..419

Purpose of Encoding a Character Stream. 4.419

Properties of 10-bit (10b) Symbols.. . 421

Preparing 8-bit Character Notation. . . 422

Disparity. ..423

Definition. .423

Two Categories of 8-bit Characters. . 423

CRD (Current Running Disparity). .423

8 b / 10 b

Encoding Procedure .424

Example Encodings. .424

Example Transmission. .425

The Lookup Tables . . . 427

Control Character Encoding .430

Ordered-Sets. .433

General. .433

TS1 and TS2 Ordered-Sets. .434

SKIP Ordered-Set. .434

Electrical Idle Ordered-Set . .434

FTS Ordered-Set.. .434

Parallel-to-Serial Converter (Serializer). .434

Differential Transmit Driver.. .435

Transmit (Tx) Clock . .435

Other Miscellaneous Transmit Logic Topics . . 436

Logical Idle Sequence. .436

Inserting Clock Compensation Zones. . .436

Background. .436

SKIP Ordered-Set Insertion Rules. .437

Receive Logic Details . .437

Differential Receiver. .439

Rx Clock Recovery. .440

General . . .440

Achieving Bit Lock .440

Losing Bit Lock. . 441

Regaining Bit Lock. .441

Serial-to-Parallel converter (Deserializer) . .441

Symbol Boundary Sensing (Symbol Lock). .441

Receiver Clock Compensation Logic 4.442

Background.. . . 442

The Elastic Buffer's Role in the Receiver. . . 442

Lane-to-Lane De-Skew. . . 444

Not a Problem on a Single-Lane Link. ..444

Flight Time Varies from Lane-to-Lane .444

If Lane Data Is Not Aligned, Byte Unstriping Wouldn't Work .444

TS1/TS2 or FTS Ordered-Sets Used to De-Skew Link 4.444

De-Skew During Link Training, Retraining and L0s Exit.. . 445

Lane-to-Lane De-Skew Capability of Receiver. . .445

8 b / 10 b

Decoder. ..446

General . . .446

Disparity Calculator . .446

Code Violation and Disparity Error Detection. .446

General. . . 446

Code Violations. .446

Disparity Errors.. ..447

De-Scrambler . . 448

Some De-Scrambler Implementation Rules: 4.448

Disabling De-Scrambling. .449

Byte Un-Striping.. . 449

Filter and Packet Alignment Check. .450

Receive Buffer (Rx Buffer) . .450

Physical Layer Error Handling . .450

Response of Data Link Layer to 'Receiver Error' Indication. .451

Chapter 12: Electrical Physical Layer

Electrical Physical Layer Overview .453

High Speed Electrical Signaling .455

Clock Requirements. . 456

General . ..456

Spread Spectrum Clocking (SSC) . .456

Impedance and Termination . 456

Transmitter Impedance Requirements . .457

Receiver Impedance Requirements. ..457

DC Common Mode Voltages. .457

Transmitter DC Common Mode Voltage .457

Receiver DC Common Mode Voltage. .457

ESD and Short Circuit Requirements. .458

Receiver Detection . .459

General . . 459

With a Receiver Attached . .459

Without a Receiver Attached .459

Procedure To Detect Presence or Absence of Receiver . .459

Differential Drivers and Receivers .461

Advantages of Differential Signaling .461

Differential Voltages. . 461

Differential Voltage Notation. .462

General . . 462

Differential Peak Voltage. . . 462

Differential Peak-to-Peak Voltage. .462

Common Mode Voltage ..462

Electrical Idle . ..464

Transmitter Responsibility . .464

Receiver Responsibility. .465

Power Consumed When Link Is in Electrical Idle State . 465

Electrical Idle Exit . . 465

Transmission Line Loss on Link ..465

AC Coupling. ..466

De-Emphasis (or Pre-Emphasis) . . 466

What is De-Emphasis? .466

What is the Problem Addressed By De-emphasis? .467

Solution. . 468

Beacon Signaling . .469

General . ..469

Properties of the Beacon Signal .469

LVDS Eye Diagram. 4.470

Jitter, Noise, and Signal Attenuation .470

The Eye Test. 4.470

Optimal Eye .471

Jitter Widens or Narrows the Eye Sideways. . 471

Noise and Signal Attenuation Heighten the Eye ..472

Transmitter Driver Characteristics 4.477

General.. .. 477

Transmit Driver Compliance Test and Measurement Load . . 479

Input Receiver Characteristics. . . 480

Electrical Physical Layer State in Power States. . . 481

Chapter 13: System Reset

Two Categories of System Reset. .487

Fundamental Reset. . 488

Methods of Signaling Fundamental Reset . .489

PERST# Type Fundamental Reset Generation. .489

Autonomous Method of Fundamental Reset Generation .489

In-Band Reset or Hot Reset. .491

Response to Receiving a Hot Reset Command . . 491

Switches Generate Hot Reset on Their Downstream Ports . . 492

Bridges Forward Hot Reset to the Secondary Bus . .492

How Does Software Tell a Device (e.g. Switch or Root Complex) to Generate Hot

Reset?. . . 492

Reset Exit. .496

Link Wakeup from L2 Low Power State. 4.497

Device Signals Wakeup. . . 497

Power Management Software Generates Wakeup Event. .497

Chapter 14: Link Initialization & Training

Link Initialization and Training Overview. 500

General. .500

Ordered-Sets Used During Link Training and Initialization .504

TS1 and TS2 Ordered-Sets . .505

Electrical Idle Ordered-Set. 5.507

FTS Ordered-Set. .507

SKIP Ordered-Set. .508

Link Training and Status State Machine (LTSSM). 5.508

General. . . 508

Overview of LTSSM States . .511

Detailed Description of LTSSM States. .513

Detect State.. .513

Detect.Quiet SubState. .513

Detect.Active SubState . 5.514

Polling State . 5.515

Introduction. ..515

Polling.Active SubState. ..516

Polling.Configuration SubState . .517

Polling.Compliance SubState. 5.518

Polling.Speed SubState. ..518

Configuration State. 5.519

General . ..519

Configuration.RcvrCfg SubState . . 521

Configuration.Idle SubState. .522

Designing Devices with Links that can be Merged .522

General. .522

Four-x2 Configuration .523

Two-x4 Configuration. .523

Examples That Demonstrate Configuration.RcvrCfg Function. .524

RcvrCfg Example 1 . . . 524

Link Number Negotiation. .525

Lane Number Negotiation . . 526

Confirmation of Link Number and Lane Number Negotiated . .526

RcvrCfg Example 2 . .527

Link Number Negotiation: .527

Lane Number Negotiation .528

Confirmation of Link Number and Lane Number Negotiated .529

RcvrCfg Example 3 . .530

Link Number Negotiation. .530

Lane Number Negotiation ..531

Confirmation of Link Number and Lane Number Negotiated .531

Recovery State . .532

Reasons that a Device Enters the Recovery State. .533

Initiating the Recovery Process. 5.533

Recovery.RcvrLock SubState .533

Recovery.RcvrCfg SubState. .534

Recovery.Idle SubState. .535

L0 State . 5.537

L0s State. .538

L0s Transmitter State Machine .538

Tx_L0s.Entry SubState .538

Tx_L0s.Idle SubState . .538

Tx_L0s.FTS SubState . 5.539

L0s Receiver State Machine . .540

Rx_L0s.Entry SubState . .540

Rx_L0s.Idle SubState. 5.540

Rx_L0s.FTS SubState . 5.540

L1 State . ..541

L1.Entry SubState. . .541

L1.Idle SubState.. .542

L2 State . ..543

L2.Idle SubState. ..543

L1.TransmitWake SubState . ..543

Hot Reset State.. .544

Disable State.. 5.545

Loopback State 5.547

Loopback.Entry SubState. 5.547

Loopback.Active SubState. . .548

Loopback.Exit SubState. 5.548

LTSSM Related Configuration Registers. 5.549

Link Capability Register . 5.549

Maximum Link Speed[3:0] . 5.549

Maximum Link Width[9:4]. .550

Link Status Register. .551

Link Speed[3:0]:. ..551

Negotiate Link Width[9:4]. ..551

Training Error[10] ..551

Link Training[11]. ..551

Link Control Register . .552

Link Disable. . 552

Retrain Link. . 552

Extended Synch. ..552

Part Four: Power-Related Topics

Chapter 15: Power Budgeting

Introduction to Power Budgeting 5.557

The Power Budgeting Elements. 5.558

Slot Power Limit Control. .562

Expansion Port Delivers Slot Power Limit. .562

Expansion Device Limits Power Consumption. 5.564

The Power Budget Capabilities Register Set. .564

Chapter 16: Power Management

Introduction. .568

Primer on Configuration Software .569

Basics of PCI PM . .569

OnNow Design Initiative Scheme Defines Overall PM . ..571

Goals. .572

System PM States . .572

Device PM States. .573

Definition of Device Context. 5.574

General . 5.574

PM Event (PME) Context . .575

Device Class-Specific PM Specifications . . . 576

Default Device Class Specification. .576

Device Class-Specific PM Specifications. .576

Power Management Policy Owner . .577

General. 5.577

In Windows OS Environment.. 5.577

PCI Express Power Management vs. ACPI. .577

PCI Express Bus Driver Accesses PCI Express Configuration and PM Registers.

577

ACPI Driver Controls Non-Standard Embedded Devices 5.577

Some Example Scenarios . 5.579

Scenario-OS Wishes To Power Down PCI Express Devices. . . 580

Scenario-Restore All Functions To Powered Up State .582

Scenario-Setup a Function-Specific System WakeUp Event. .583

Function Power Management. 5.585

The PM Capability Register Set .585

Device PM States.. .586

D0 State—Full On . .586

Mandatory. . 586

D0 Uninitialized. .586

D0 Active . 5.587

D1 State-Light Sleep. .587

D2 State-Deep Sleep. 5.589

D3-Full Off . .590

D3Hot State.. ..591

D3Cold State. .592

Function PM State Transitions . .593

Detailed Description of PCI-PM Registers. .596

PM Capabilities (PMC) Register. 5.597

PM Control/Status (PMCSR) Register 5.599

Data Register. .603

Determining Presence of the Data Register. . . 604

Operation of the Data Register . . 604

Multi-Function Devices . .. 604

Virtual PCI-to-PCI Bridge Power Data. . . 604

Introduction to Link Power Management. 6.606

Link Active State Power Management. .. 608

L0s State. .611

Entry into L0s. . .611

Entry into L0s Triggered by Link Idle Time . ..611

Flow Control Credits Must be Delivered. ..612

Transmitter Initiates Entry to L0s ..612

Exit from L0s State. ..613

Transmitter Initiates L0s Exit. .613

Actions Taken by Switches that Receive L0s Exit. .613

L1 ASPM State . 6.614

Downstream Component Decides to Enter L1 ASPM ..615

Negotiation Required to Enter L1 ASPM. .616

Scenario 1: Both Ports Ready to Enter L1 ASPM State . ..616

Downstream Component Issues Request to Enter L1 State. .616

Upstream Component Requirements to Enter L1 ASPM. ..617

Upstream Component Acknowledges Request to Enter L1 6.617

Downstream Component Detects Acknowledgement ..617

Upstream Component Receives Electrical Idle. ..617

Scenario 2: Upstream Component Transmits TLP Just Prior to Receiving L1 Re-

quest. ..618

TLP Must Be Accepted by Downstream Component . 6.619

Upstream Component Receives Request to Enter L1. 6.619

Exit from L1 ASPM State . .. 621

L1 ASPM Exit Signaling. .. 621

Switch Receives L1 Exit from Downstream Component. .622

Switch Receives L1 Exit from Upstream Component ..623

ASPM Exit Latency. ..624

Reporting a Valid ASPM Exit Latency . .625

L0s Exit Latency Update. 6.625

L1 Exit Latency Update . . .626

Calculating Latency Between Endpoint to Root Complex . ..626

Software Initiated Link Power Management .. 629

D 1 / D 2 / D 3

Hot and the L1 State . 6.629

Entering the L1 State . .630

Exiting the L1 State. .632

Upstream Component Initiates L1 to L0 Transition. ..632

Downstream Component Initiates L1 to L0 Transition .633

The L1 Exit Protocol. ..633

L2/L3 Ready — Removing Power from the Link. ..633

L2/L3 Ready Handshake Sequence. 6.634

Exiting the L2/L3 Ready State - Clock and Power Removed. ..637

The L2 State.. ..637

The L3 State.. .637

Link Wake Protocol and PME Generation 6.638

The PME Message. .639

The PME Sequence.. . . 640

PME Message Back Pressure Deadlock Avoidance . . . 640

Background.. .. 641

The Problem. . .641

The Solution. .. 641

The PME Context . 6.642

Waking Non-Communicating Links. .. 642

Beacon.. . .643

WAKE# (AUX Power). .. 643

Auxiliary Power . .. 645

Part Five: Optional Topics

Chapter 17: Hot Plug

Background . 6.650

Hot Plug in the PCI Express Environment. . .651

Surprise Removal Notification.. .652

Differences between PCI and PCI Express Hot Plug. ..652

Elements Required to Support Hot Plug. 6.655

Software Elements . 6.655

Hardware Elements. .656

Card Removal and Insertion Procedures . 6.658

On and Off States . 6.658

Definition of On and Off. 6.658

Turning Slot Off. .658

Turning Slot On.. 6.659

Card Removal Procedure. . 659

Attention Button Used to Initiate Hot Plug Removal . .659

Hot Plug Removal Request Issued via User Interface. .660

Card Insertion Procedure.. . .661

Card Insertion Initiated by Pressing Attention Button . 6.661

Card Insertion Initiated by User Interface ..662

Standardized Usage Model . .663

Background. ..663

Standard User Interface .664

Attention Indicator . 6.664

Power Indicator.. .665

Manually Operated Retention Latch and Sensor. 6.666

Electromechanical Interlock (optional). ..667

Software User Interface. 6.667

Attention Button. .667

Slot Numbering Identification. 6.668

Standard Hot Plug Controller Signaling Interface. 6.668

The Hot-Plug Controller Programming Interface. 6.670

Slot Capabilities. ..670

Slot Power Limit Control. ..672

Slot Control. 6.672

Slot Status and Events Management. 6.674

Card Slot vs Server IO Module Implementations . 6.676

Detecting Module and Blade Capabilities. 6.678

Hot Plug Messages. 6.678

Attention and Power Indicator Control Messages . .678

Attention Button Pressed Message 6.679

Limitations of the Hot Plug Messages. 6.679

Slot Numbering. .681

Physical Slot ID.. . . 681

Quiescing Card and Driver. .681

General. . . 681

Pausing a Driver (Optional) . .681

Quiescing a Driver That Controls Multiple Devices . . 682

Quiescing a Failed Card. . . 682

The Primitives.. 6.682

Introduction. .686

Add-in Connector. .686

Auxiliary Signals. .693

General . ..693

Reference Clock. 6.694

PERST#. ...695

WAKE#. 6.696

SMBus. .698

JTAG . 6.699

PRSNT Pins. .699

Electrical Requirements. ..700

Power Supply Requirements .700

Power Dissipation Limits .701

Add-in Card Interoperability. ..702

Form Factors Under Development. .703

General.. ..703

Server IO Module (SIOM). ..703

Riser Card. ..704

Mini PCI Express Card. ..704

NEWCARD form factor. ..707

Chapter 18: Add-in Cards and Connectors

Part Six: PCI Express Configuration

Chapter 19: Configuration Overview

Definition of Device and Function. .. 712

Definition of Primary and Secondary Bus ..714

Topology Is Unknown At Startup . . .714

Each Function Implements a Set of Configuration

Registers. ..715

Introduction. ..715

Function Configuration Space. ..715

PCI-Compatible Space ..715

PCI Express Extended Configuration Space. ..716

Host/PCI Bridge's Configuration Registers. ..716

Configuration Transactions Are Originated by the

Processor. .718

Only the Root Complex Can Originate Configuration Transactions ..718

Configuration Transactions Only Move DownStream.. ..718

No Peer-to-Peer Configuration Transactions.. ..718

Configuration Transactions Are Routed Via Bus, Device, and Function Number... 718

How a Function Is Discovered.. ..719

How To Differentiate a PCI-to-PCI Bridge From a Non-Bridge Function. . .719

Chapter 20: Configuration Mechanisms

Introduction. .722

PCI-Compatible Configuration Mechanism. . .723

Background. ...724

PCI-Compatible Configuration Mechanism

Description. . .724

General ..724

Configuration Address Port. ..725

Bus Compare and Data Port Usage. ..726

Target Bus

= 0

.. . .726

Bus Number

<

Target Bus

\leq

Subordinate Bus Number. .727

Single Host/PCI Bridge. .727

Multiple Host/PCI Bridges. ...729

PCI Express Enhanced Configuration Mechanism ..731

Description. . .731

Some Rules. .731

Type 0 Configuration Request .732

Type 1 Configuration Request .733

Example PCI-Compatible Configuration Access . 735

Example Enhanced Configuration Access. ..736

Initial Configuration Accesses ..738

What's Going On During Initialization Time? . .738

Definition of Initialization Period In PCI . .738

Definition of Initialization Period In PCI-X ..739

PCI Express and Initialization Time. .739

Initial Configuration Access Failure Timeout ...739

Delay Prior To Initial Configuration Access to Device ..739

A Device With a Lengthy Self-Initialization Period . ..740

RC Response To CRS Receipt During Run-Time ..740

Chapter 21: PCI Express Enumeration

Introduction. . . 741

Enumerating a System With a Single Root Complex. . 742

Enumerating a System With Multiple Root Complexes . 753

Operational Characteristics of the PCI-Compatible Mechanism. ..754

Operational Characteristics of the Enhanced

Configuration Mechanism . ..755

The Enumeration Process . ..755

A Multifunction Device Within a Root Complex or a Switch .. 758

A Multifunction Device Within a Root Complex. ..758

A Multifunction Device Within a Switch . .759

An Endpoint Embedded in a Switch or Root Complex. ..761

Memorize Your Identity ..763

General. ..763

Root Complex Bus Number/Device Number

Assignment . .764

Initiating Requests Prior To ID Assignment .764

Initiating Completions Prior to ID Assignment .765

Root Complex Register Blocks (RCRBs) . .765

What Problem Does an RCRB Address? . ..765

Additional Information on RCRBs. . .766

Miscellaneous Rules. ..766

A Split Configuration Transaction Requires a Single

Completion. .766

An Issue For PCI Express-to-PCI or -PCI-X Bridges. ..767

PCI Special Cycle Transactions. ..767

Chapter 22: PCI Compatible Configuration Registers

Header Type 0 . . . 770

General. ..770

Header Type 0 Registers Compatible With PCI .772

Header Type 0 Registers Incompatible With PCI ..772

Registers Used to Identify Device's Driver. ..773

Vendor ID Register. ..773

Device ID Register . ..773

Revision ID Register. ...773

Class Code Register. . . 774

General . ...774

The Programming Interface Byte . ..774

Detailed Class Code Description. . .775

Subsystem Vendor ID and Subsystem ID Registers. ..776

General . ..776

The Problem Solved by This Register Pair. ...776

Must Contain Valid Data When First Accessed. ..777

Header Type Register. ..777

BIST Register.. ..778

Capabilities Pointer Register. ..779

Configuration Header Space Not Large Enough ..779

Discovering That Capabilities Exist . ..779

What the Capabilities List Looks Like . . . 780

CardBus CIS Pointer Register ..782

Expansion ROM Base Address Register. ..783

Command Register. ..785

Status Register . 7.788

Cache Line Size Register. .790

Master Latency Timer Register .790

Interrupt Line Register. ..791

Usage In a PCI Function . ..791

Usage In a PCI Express Function. ..791

Interrupt Pin Register. . .792

Usage In a PCI Function ..792

Usage In a PCI Express Function. ..792

Base Address Registers. ..792

Introduction.. ..793

IO Space Usage. ..793

Memory Base Address Register. . . 794

Decoder Width Field. . . 794

Prefetchable Attribute Bit . ...795

Base Address Field . . . 796

IO Base Address Register ..797

Introduction. ..797

IO BAR Description. . . 797

PC-Compatible IO Decoder ...797

Legacy IO Decoders ...798

Finding Block Size and Assigning Address Range. . . 799

How It Works. ...799

A Memory Example . ..799

An IO Example. .. 800

Smallest/Largest Decoder Sizes. ... 800

Smallest/Largest Memory Decoders. . . 800

Smallest/Largest IO Decoders. . . 800

Byte Merging . .801

Bridge Must Discard Unconsumed Prefetched Data .801

Min_Gnt/Max_Lat Registers .802

Header Type 1. .802

General. . . 802

Header Type 1 Registers Compatible With PCI . . 803

Header Type 1 Registers Incompatible With PCI . ... 804

Terminology. . . 805

Bus Number Registers. 8.805

Introduction. ... 805

Primary Bus Number Register. 8.806

Secondary Bus Number Register.. .806

Subordinate Bus Number Register. ..807

Bridge Routes ID Addressed Packets Using Bus Number

Registers. ..807

Vendor ID Register . . . 808

Device ID Register . . . 808

Revision ID Register. . . 808

Class Code Register . ... 808

Header Type Register. . . 808

BIST Register. . . 809

Capabilities Pointer Register. ... 809

Basic Transaction Filtering Mechanism. . . 809

Bridge's Memory, Register Set and Device ROM . . 810

Introduction. ..810

Base Address Registers . . . 811

Expansion ROM Base Address Register. ..811

Bridge's IO Filter. . . 811

Introduction. . . 811

Bridge Doesn’t Support Any IO Space Behind Bridge. ..812

Bridge Supports 64KB IO Space Behind Bridge . . 813

Bridge Supports 4GB IO Space Behind Bridge. . . 817

Bridge's Prefetchable Memory Filter ..819

An Important Note From the Authors .819

In PCI. . 820

In PCI Express . . . 821

Spec References To Prefetchable Memory. . . 822

Characteristics of Prefetchable Memory Devices. ..823

Multiple Reads Yield the Same Data . . . 823

Byte Merging Permitted In the Posted Write Buffer .823

Characteristics of Memory-Mapped IO Devices. . . 823

Read Characteristics. .824

Write Characteristics. . . 824

Determining If Memory Is Prefetchable or Not . .824

Bridge Support For Downstream Prefetchable Memory Is Optional . . 825

Must Support > 4GB Prefetchable Memory On Secondary Side .825

Rules for Bridge Prefetchable Memory Accesses. . . 829

Bridge's Memory-Mapped IO Filter. . 830

Bridge Command Registers. ..832

Introduction. . 832

Bridge Command Register .832

Bridge Control Register .835

Bridge Status Registers. .837

Introduction. .837

Bridge Status Register (Primary Bus). .837

Bridge Secondary Status Register ..840

Bridge Cache Line Size Register . 8.843

Bridge Latency Timer Registers. . . 843

Bridge Latency Timer Register (Primary Bus). . . 843

Bridge Secondary Latency Timer Register. . . 843

Bridge Interrupt-Related Registers. . . 844

Interrupt Line Register. ..844

Interrupt Pin Register. . . 844

PCI-Compatible Capabilities. .. 845

AGP Capability . . 845

AGP Status Register . ..845

AGP Command Register . . . 846

Vital Product Data (VPD) Capability. . . 848

Introduction.. ..848

It's Not Really Vital . 8.849

What Is VPD? 8.849

Where Is the VPD Really Stored? . 8.849

VPD On Cards vs. Embedded PCI Devices . . 849

How Is VPD Accessed?. . . 849

Reading VPD Data. .850

Writing VPD Data. . 850

Rules That Apply To Both Read and Writes . . . 850

VPD Data Structure Made Up of Descriptors and Keywords ..851

VPD Read-Only Descriptor (VPD-R) and Keywords. . 853

Is Read-Only Checksum Keyword Mandatory?. . .855

VPD Read/Write Descriptor (VPD-W) and Keywords . 856

Example VPD List. ..857

Introduction To Chassis/Slot Numbering Registers. . 859

Chassis and Slot Number Assignment . ..861

Problem: Adding/Removing Bridge Causes Buses to Be Renumbered. ..861

If Buses Added/Removed, Slot Labels Must Remain Correct.. ..861

Definition of a Chassis . . . 862

Chassis/Slot Numbering Registers. .863

PCI-Compatible Chassis/Slot Numbering Register Set . 863

Express-Specific Slot-Related Registers.. . . 863

Two Examples. . . 866

First Example.. . . 866

Second Example. ..867

Chapter 23: Expansion ROMs

ROM Purpose-Device Can Be Used In Boot Process. .872

ROM Detection. .872

ROM Shadowing Required .875

ROM Content. .875

Multiple Code Images . . . 875

Format of a Code Image. .878

General . ..878

ROM Header Format. ..879

ROM Data Structure Format . ..881

ROM Signature . . . 883

Vendor ID field in ROM data structure . . . 883

Device ID in ROM data structure. . 883

Pointer to Vital Product Data (VPD). . . 884

PCI Data Structure Length. . . 884

PCI Data Structure Revision . .884

Class Code . . . 884

Image Length. 8.884

Revision Level of Code/Data . . 885

Code Type. . . 885

Indicator Byte. . . 885

Execution of Initialization Code . . 885

Introduction to Open Firmware . . . 888

Introduction. . . 888

Universal Device Driver Format. . . 889

Passing Resource List To Plug-and-Play OS. . . 890

BIOS Calls Bus Enumerators For Different Bus Environments . . 890

BIOS Selects Boot Devices and Finds Drivers For Them . . 891

BIOS Boots Plug-and-Play OS and Passes Pointer To It . ..891

OS Locates and Loads Drivers and Calls Init Code In Each. .. 891

Chapter 24: Express-Specific Configuration Registers

Introduction. . . 894

PCI Express Capability Register Set. . . 896

Introduction. . . 896

Required Registers. .897

General . ..897

PCI Express Capability ID Register ...898

Next Capability Pointer Register.. .898

PCI Express Capabilities Register . . 898

Device Capabilities Register . . 900

Device Control Register .005

Device Status Register. ... 909

Link Registers (Required). ..912

Link Capabilities Register .912

Link Control Register. ..915

Link Status Register.. ..918

Slot Registers.. . . 920

Introduction. . . 920

Slot Capabilities Register . . . 920

Slot Control Register . .923

Slot Status Register. . .925

Root Port Registers . .. 926

Introduction. . . 926

Root Control Register. . .926

Root Status Register.. . 928

PCI Express Extended Capabilities . . 929

General.. ..929

Advanced Error Reporting Capability. ..930

General . ..930

Detailed Description. . . 930

Virtual Channel Capability. . 939

The VC Register Set's Purpose. .939

Who Must Implement This Register Set?. ..940

Multifunction Upstream Port Restriction. . . 940

The Register Set.. . .940

Detailed Description of VCs. . . 940

Port VC Capability Register 1 ..941

Port VC Capability Register 2 . . 943

Port VC Control Register ..944

Port VC Status Register. ..945

VC Resource Registers. . 946

General. . .946

VC Resource Capability Register . 9.946

VC Resource Control Register. . 948

VC Resource Status Register. . 950

VC Arbitration Table. ..951

Port Arbitration Tables . ..952

Device Serial Number Capability . 952

Power Budgeting Capability . 954

General . . 954

How It Works. . 955

RCRB . 0.957

General. ..957

Firmware Gives OS Base Address of Each RCRB .957

Misaligned or Locked Accesses To an RCRB. .057

Extended Capabilities in an RCRB. .957

The RCRB Missing Link. . 958

Appendices

Appendix A: Test,Debug and Verification of PCI Express

^{TM}

Designs . . 961 Appendix B: Markets &Applications for the PCI Express

^{TM}

Architecture . . 989 Appendix C: Implementing Intelligent Adapters and Multi-Host Systems With

PCI Express

^{TM}

Technology. ..999

Appendix D: Class Codes. .1019

Appendix E: Locked Transactions Series .1033

Index. .. 1043

1-1 Comparison of Performance Per Pin for Various Buses ..15

33 MHz PCI Bus Based Platform . ..17

1-3 Typical PCI Burst Memory Read Bus Cycle ..18

1-4 33 MHz PCI Based System Showing Implementation of a PCI-to-PCI Bridge. ..19

1-5 PCI Transaction Model. ..20

1-6 PCI Bus Arbitration. ..22

1-7 PCI Transaction Retry Mechanism. ..23

1-8 PCI Transaction Disconnect Mechanism ...24

1-9 PCI Interrupt Handling.. ..26

1-10 PCI Error Handling Protocol. ..27

1-11 Address Space Mapping. ..28

1-12 PCI Configuration Cycle Generation. ..29

1-13 256 Byte PCI Function Configuration Register Space. ..30

1-14 Latest Generation of PCI Chipsets. ..32

1-15 66 MHz PCI Bus Based Platform. ..33

1-16 66 MHz/133 MHz PCI-X Bus Based Platform.. ..36

1-17 Example PCI-X Burst Memory Read Bus Cycle ..37

1-18 PCI-X Split Transaction Protocol .. ..38

1-19 Hypothetical PCI-X 2.0 Bus Based Platform ..40

1-20 PCI Express Link. ..41

1-21 PCI Express Differential Signal.. ..42

1-22 PCI Express Topology. ..48

1-23 Low Cost PCI Express System. ..52

1-24 Another Low Cost PCI Express System ..53

1-25 PCI Express High-End Server System .. ..54

2-1 Non-Posted Read Transaction Protocol. ..59

2-2 Non-Posted Locked Read Transaction Protocol. ...60

2-3 Non-Posted Write Transaction Protocol.. ...61

2-4 Posted Memory Write Transaction Protocol. ...63

2-5 Posted Message Transaction Protocol.. ...64

2-6 Non-Posted Memory Read Originated by CPU and Targeting an Endpoint . ..65

2-7 Non-Posted Memory Read Originated by Endpoint and Targeting Memory ...67

2-8 IO Write Transaction Originated by CPU, Targeting Legacy Endpoint. ...68

2-9 Memory Write Transaction Originated by CPU, Targeting Endpoint. ...69

2-10 PCI Express Device Layers . ..70

2-11 TLP Origin and Destination. ...72

2-12 TLP Assembly.. ...73

2-13 TLP Disassembly. ...74

2-14 DLLP Origin and Destination. ...75

2-15 DLLP Assembly ...76

2-16 DLLP Disassembly. ...76

2-17 PLP Origin and Destination. ...77

2-18 PLP or Ordered-Set Structure . ...78

2-19 Detailed Block Diagram of PCI Express Device's Layers . ...79

2-20 TLP Structure at the Transaction Layer. ..80

2-21 Flow Control Process.. ...82

2-22 Example Showing QoS Capability of PCI Express . ...83

2-23 TC Numbers and VC Buffers. ...85

2-24 Switch Implements Port Arbitration and VC Arbitration Logic. ...86

-25 Data Link Layer Replay Mechanism. ...88

2-26 TLP and DLLP Structure at the Data Link Layer ...90

2-27 Non-Posted Transaction on Link. ...91

2-28 Posted Transaction on Link. ...92

2-29 TLP and DLLP Structure at the Physical Layer. ...94

2-30 Electrical Physical Layer Showing Differential Transmitter and Receiver ...96

2-31 Memory Read Request Phase.. ...97

2-32 Completion with Data Phase.. ...99

3-1 Multi-Port PCI Express Devices Have Routing Responsibilities . ..106

PCI Express Link Local Traffic: Ordered Sets. ..110

3-3 PCI Express Link Local Traffic: DLLPs.. ..112

3-4 PCI Express Transaction Request And Completion TLPs ..115

5 Transaction Layer Packet Generic 3DW And 4DW Headers . ..119

6 Generic System Memory And IO Address Maps. ..122

3-7 3DW TLP Header Address Routing Fields . ..123

3-8 4DW TLP Header Address Routing Fields . ...124

3-9 Endpoint Checks Routing Of An Inbound TLP Using Address Routing. ..125

3-10 Switch Checks Routing Of An Inbound TLP Using Address Routing. ..126

3-11 3DW TLP Header ID Routing Fields.. ..128

3-12 4DW TLP Header ID Routing Fields. ..129

3-13 Switch Checks Routing Of An Inbound TLP Using ID Routing. ..131

3-14 4DW Message TLP Header Implicit Routing Fields. ..133

3-15 PCI Express Devices And Type 0 And Type 1 Header Use. ..136

3-16 PCI Express Configuration Space Type 0 and Type 1 Headers ..137

3-17 32-Bit Prefetchable Memory BAR Set Up. ..139

3-18 64-Bit Prefetchable Memory BAR Set Up. ..141

3-19 IO BAR Set Up. ..143

-20 6GB, 64-Bit Prefetchable Memory Base/Limit Register Set Up ..145

-21 2MB, 32-Bit Non-Prefetchable Base/Limit Register Set Up. ...147

3-22 IO Base/Limit Register Set Up.. ..149

3-23 Bus Number Registers In A Switch.. ..152

TLP And DLLP Packets.. ..155

PCI Express Layered Protocol And TLP Assembly/Disassembly ..158

Generic TLP Header Fields.. ..162

Using First DW and Last DW Byte Enable Fields. ..168

Transaction Descriptor Fields . ..169

4-6 System IO Map. ..171

4-7 3DW IO Request Header Format.. ..172

4-8 3DW And 4DW Memory Request Header Formats ..175

4-9 3DW Configuration Request And Header Format . ..180

4-10 3DW Completion Header Format . ..184

4-11 4DW Message Request Header Format. ..190

4-12 Data Link Layer Sends A DLLP.. ..198

4-13 Generic Data Link Layer Packet Format.. ...200

4-14 Ack Or Nak DLLP Packet Format. ...202

4-15 Power Management DLLP Packet Format. ...204

4-16 Flow Control DLLP Packet Format . ...205

4-17 Vendor Specific DLLP Packet Format.. ..207

5-1 Data Link Layer.. ..210

5-2 Overview of the ACK/NAK Protocol ..211

5-3 Elements of the ACK/NAK Protocol.. ...212

5-4 Transmitter Elements Associated with the ACK/NAK Protocol . ...215

5-5 Receiver Elements Associated with the ACK/NAK Protocol. ..218

5-6 Ack Or Nak DLLP Packet Format. ..219

5-7 Example 1 that Shows Transmitter Behavior with Receipt of an ACK DLLP. ..223

5-8 Example 2 that Shows Transmitter Behavior with Receipt of an ACK DLLP . ...224

5-9 Example that Shows Transmitter Behavior on Receipt of a NAK DLLP. ..226

5-10 Table and Equation to Calculate REPLAY_TIMER Load Value.. ...229

5-11 Example that Shows Receiver Behavior with Receipt of Good TLP ..233

5-12 Example that Shows Receiver Behavior When It Receives Bad TLPs. ..237

5-13 Table to Calculate ACKNAK_LATENCY_TIMER Load Value. ...239

5-14 Lost TLP Handling.. ...245

5-15 Lost ACK DLLP Handling. ...246

5-16 Lost ACK DLLP Handling.. ...247

5-17 Switch Cut-Through Mode Showing Error Handling. ..250

6-1 Example Application of Isochronous Transaction.. ..254

6-2 VC Configuration Registers Mapped in Extended Configuration Address Space. ..257

6-3 The Number of VCs Supported by Device Can Vary. ...259

6-4 Extended VCs Supported Field ...260

6-5 VC Resource Control Register.. ...261

6-6 TC to VC Mapping Example ... ...262

6-7 Conceptual VC Arbitration Example. ..265

6-8 Strict Arbitration Priority....... ..266

6-9 Low Priority Extended VC Count. ..267

6-10 Determining VC Arbitration Capabilities and Selecting the Scheme ...268

6-11 VC Arbitration with Low- and High-Priority Implementations . ...269

6-12 Weighted Round Robin Low-Priority VC Arbitration Table Example. ..270

-13 VC Arbitration Table Offset and Load VC Arbitration Table Fields . ..271

Loading the VC Arbitration Table Entries . ..272

6-15 15 Example Multi-Function Endpoint Implementation with VC Arbitration ...274

6-16 Port Arbitration Concept . ...275

6-17 Port Arbitration Tables Needed for nEach VC. ..276

6-18 Port Arbitration Buffering . ..277

6-19 Software checks Port Arbitration Capabilities and Selects the Scheme to be Used. ..278

6-20 Maximum Time Slots Register.. ..280

6-21 Format of Port Arbitration Table. ...281

6-22 Example of Port and VC Arbitration within A Switch. ..283

7-1 Location of Flow Control Logic. ...287

7-2 Flow Control Buffer Organization. ..289

7-3 Flow Control Elements. ...292

7-4 Types and Format of Flow Control Packets. ...293

-5 Flow Control Elements Following Initialization ...295

7-6 Flow Control Elements Following Delivery of First Transaction. ..297

7-7 Flow Control Elements with Flow Control Buffer Filled. ..299

7-8 Flow Control Rollover Problem.. ..300

7-9 Initial State of Example FC Elements. ..304

7-10 INIT1 Flow Control Packet Format and Contents ..305

7-11 Devices Send and Initialize Flow Control Registers. ..306

7-12 Device Confirm that Flow Control Initialization is Completed for a Given Buffer. .307

7-13 Flow Control Update Example . ..309

7-14 Update Flow Control Packet Format and Contents. ..310

8-1 Example of Strongly Ordered Transactions that Results in Temporary Blocking . ...323

-1 Native PCI Express and Legacy PCI Interrupt Delivery. ..331

9-2 64-bit MSI Capability Register Format . ..332

9-3 32-bit MSI Capability Register Set Format ..332

9-4 Message Control Register. ..333

Device MSI Configuration Process. ..337

Format of Memory Write Transaction for Native-Deive MSI Delivery. ...339

9-7 Interrupt Pin Register within PCI Configuration Header ..343

9-8 INTx Signal Routing is Platform Specific ..344

9-9 Configuration Command Register - Interrupt Disable Field. ..346

-10 Configuration Status Register - Interrupt Status Field . .347

9-11 Legacy Devices Use INTx Messages Virtualize INTA#-INTD# Signal Transitions . ..348

9-12 Switch Collapses INTx Message to Achieve Wired-OR Characteristics. ..350

9-13 INTx Message Format and Types. ..351

4 PCI Express System with PCI-Based IO Controller Hub. ...354

0-1 The Scope of PCI Express Error Checking and Reporting. ..357

10-2 Location of PCI Express Error-Related Configuration Registers ..360

-3 The Error/Poisoned Bit within Packet Headers. ..362

10-4 Basic Format of the Error Messages . .370

5 Completion Status Field within the Completion Header ...371

0-6 PCI-Compatible Configuration Command Register . ..373

PCI-Compatible Status Register (Error-Related Bits). ..374

10 - 8

PCI Express Capability Register Set.. ..376

Device Control Register Bit Fields Related to Error Handling ..378

10 - 10

Device Status Register Bit Fields Related to Error Handling ..379

Link Control Register Allows Retraining of Link . ..380

2 Link Retraining Status Bits within the Link Status Register. ..380

10 - 13

Root Control Register . ..381

Advanced Error Capability Registers. ..382

10 - 15

The Advanced Error Capability & Control Register . ..383

10 - 16

Advanced Correctable Error Status Register. ..385

10 - 17

Advanced Correctable Error Mask Register .. ..386

10-18 Advanced Uncorrectable Error Status Register.. ..387

10-19 Advanced Uncorrectable Error Severity Register. ..388

10-20 Advanced Uncorrectable Error Mask Register. ...389

10-21 Root Error Status Register . ..391

10-22 Advanced Source ID Register ...391

10-23 Advanced Root Error Command Register. ..392

10-24 Error Handling Flow Chart. ..393

11-1 Physical Layer. ..398

11-2 Logical and Electrical Sub-Blocks of the Physical Layer. ...399

11-3 Physical Layer Details . ..401

11-4 Physical Layer Transmit Logic Details ..406

11-5 Transmit Logic Multiplexer. ..407

11-6 TLP and DLLP Packet Framing with Start and End Control Characters ..408

11-7 x1 Byte Striping . ..409

11-8 x4 Byte Striping . ..410

11-9 x8, x12, x16, x32 Byte Striping. ..411

11-10 x1 Packet Format. ..413

11-11 x4 Packet Format. ..414

11-12 x8 Packet Format. ..415

11-13 Scrambler.. ..418

11-14 Example of 8-bit Character of 00h Encoded to 10-bit Symbol.. ..420

11-15 Preparing 8-bit Character for Encode . ..422

11-16 8-bit to 10-bit (8b/10b) Encoder. ..425

11-17 Example 8-bit/10-bit Encodings. ..426

11-18 Example 8-bit/10-bit Transmission ..427

11-19 SKIP Ordered-Set. ..437

11-20 Physical Layer Receive Logic Details. ..438

11-21 Receiver Logic's Front End Per Lane . ..439

11-22 Receiver's Link De-Skew Logic. ..445

11-23 8b/10b Decoder per Lane . ..447

11-24 Example of Delayed Disparity Error Detection. ..448

11-25 Example of x8 Byte Un-Striping . ..449

12-1 Electrical Sub-Block of the Physical Layer ..454

12-2 Differential Transmitter/Receiver... ..455

12-3 Receiver DC Common Mode Voltage Requirement. ..458

12-4 Receiver Detection Mechanism.. ..460

12-5 Pictorial Representation of Differential Peak-to-Peak and Differential Peak Voltages.. 463

12-6 Electrical Idle Ordered-Set.. ..464

12-7 Transmission with De-emphasis .. ..467

12-8 Problem of Inter-Symbol Interference . ..468

12-9 Solution is Pre-emphasis.. ..468

12-10 LVDS (Low-Voltage Differential Signal) Transmitter Eye Diagram. ..472

12-11 Transmitter Eye Diagram Jitter Indication. ..473

12 - 12

2 Transmitter Eye Diagram Noise/Attenuation Indication ..474

12-13 Screen Capture of a Normal Eye (With no De-emphasis Shown) ..475

12-14 Screen Capture of a Bad Eye Showing Effect of Jitter, Noise and

Signal Attenuation (With no De-emphasis Shown). ...476

12 - 15

Compliance Test/Measurement Load. ...479

12-16 Receiver Eye Diagram. ...481

12-17 L0 Full-On Link State . ..482

12-18 L0s Low Power Link State ..483

12-19 L1 Low Power Link State. ..484

12-20 L2 Low Power Link State. ..485

12-21 L3 Link Off State . ..486

13-1 PERST# Generation. ..490

13-2 TS1 Ordered-Set Showing the Hot Reset Bit. ..491

13-3 Secondary Bus Reset Register to Generate Hot Reset. ..493

13-4 Switch Generates Hot Reset on One Downstream Port. ..494

13-5 Switch Generates Hot Reset on All Downstream Ports ..495

14-1 Link Training and Status State Machine Location. ..501

14-2 Example Showing Lane Reversal . ...502

14-3 Example Showing Polarity Inversion ...503

14-4 Five Ordered-Sets Used in the Link Training and Initialization Process . ..504

14-5 Link Training and Status State Machine (LTSSM). ..510

14-6 Detect State Machine . ..515

14-7 Polling State Machine. ..519

14-8 Configuration State Machine . ..520

14-9 Combining Lanes to form Links. ..524

14-10 Example 1 Link Numbering and Lane Numbering. ..527

14-11 Example 2 Link Numbering and Lane Numbering. ..529

14-12 Example 3 Link Numbering and Lane Numbering. ..532

14-13 Recovery State Machine. ..537

14-14 L0s Transmitter State Machine ..539

14-15 L0s Receiver State Machine . ..541

14-16 L1 State Machine. ...542

14-17 L2 State Machine . ..544

14-18 Hot Reset State Machine. ..545

14-19 Disable State Machine . ..546

14-20 Loopback State Machine. ..549

14-21 Link Capabilities Register. ...550

14-22 Link Status Register. ..552

14-23 Link Control Register. ..553

15-1 System Allocated Bit. ..559

15-2 Elements Involved in Power Budget.. ..561

15-3 Slot Power Limit Sequence . ...563

15-4 Power Budget Capability Registers.. ..565

15-5 Power Budget Data Field Format and Definition ..566

16-1 Relationship of OS, Device Drivers, Bus Driver, PCI Express Registers, and ACPI ..578

2 Example of OS Powering Down All Functions On

PCI Express Links and then the Links Themselves . ...581

-3 Example of OS Restoring a PCI Express Function To Full Power. ...583

16-4 OS Prepares a Function To Cause System WakeUp On Device-Specific Event . ...584

16 - 5

PCI Power Management Capability Register Set.. ...586

16 - 6

PCI Express Function Power Management State Transitions. ...594

16-7 PCI Function's PM Registers. ...596

16-8 Power Management Capabilities (PMC) Register - Read Only ...597

16-9 Power Management Control/Status (PMCSR) Register - R/W ...600

16-10 PM Registers. ...605

16-11 ASPM Link State Transitions ...609

16-12 ASPM Support.. ..610

16-13 Active State PM Control Field. ...611

16-14 Ports that Initiate L1 ASPM Transitions ...615

6-15 Negotiation Sequence Required to Enter L1 Active State PM.. ...618

16-16 Negotiation Sequence Resulting in Rejection to Enter L1 ASPM State. ..621

16-17 Switch Behavior When Downstream Component Signals L1 Exit. ..623

-18 Switch Behavior When Upstream Component Signals L1 Exit ..624

16-19 Example of Total L1 Latency. ...627

16-20 Config. Registers Used for ASPM Exit Latency Management and Reporting. ..628

16-21 Devices Transition to L1 When Software Changes their Power Level from D0. ..629

5-22 Software Placing a Device into a D2 State and Subsequent Transition to L1 ...630

16-23 Procedure Used to Transition a Link from the L0 to L1 State ..632

5-24 Link States Transitions Associated with Preparing Devices

for Removal of the Reference Clock and Power. ..634

6-25 Negotiation for Entering L2/L3 Ready State. ..636

16-26 State Transitions from L2/L3 Ready When Power is Removed. ..637

16-27 PME Message Format.. ..639

16-28 WAKE# Signal Implementations. ..644

16-29 Auxiliary Current Enable for Devices Not Supporting PMEs ..645

17-1 PCI Hot Plug Elements. ..653

17-2 PCI Express Hot-Plug Hardware/Software Elements. ..654

17-3 Hot Plug Control Functions within a Switch.. ...669

17-4 PCI Express Configuration Registers Used for Hot-Plug. ...670

17-5 Attention Button and Hot Plug Indicators Present Bits ..671

17-6 Slot Control Register Fields ...673

17-7 Slot Status Register Fields. ...675

17-8 Location of Attention Button and Indicators. ...677

17-9 Hot-Plug Capability Bits for Server IO Modules ..678

17-10 Hot Plug Message Format ..680

18-1 PCI Express

\times 1

connector.. ...687

18-2 PCI Express Connectors on System Board. ...688

18-3 PERST Timing During Power Up.. ...695

18-4 PERST# Timing During Power Management States. ..696

18-5 Example of WAKE# Circuit Protection. ..698

18-6 Presence Detect.. ..700

18-7 PCI Express Riser Card . ...704

18-8 Mini PCI Express Add-in Card Installed in a Mobile Platform. ...705

18-9 Mini PCI Express Add-in Card Photo 1 . ...706

18-10 Mini PCI Express Add-in Card Photo 2 . ..706

19-1 Example System. ...713

19-2 Topology View At Startup. ...714

19-3 4KB Configuration Space per PCI Express Function. ...717

19-4 Header Type Register.. ...719

20-1 A Function's Configuration Space.. ...723

20-2 Configuration Address Port at 0CF8h ...726

20-3 Example System. ..728

20-4 Peer Root Complexes.. ...730

20-5 Type 0 Configuration Read Request Packet Header ...733

20-6 Type 0 Configuration Write Request Packet Header ...733

20-7 Type 1 Configuration Read Request Packet Header . ...734

20-8 Type 1 Configuration Write Request Packet Header ...734

20-9 Example Configuration Access.. ...737

21-1 Topology View At Startup. ...742

21-2 Example System Before Bus Enumeration. ...748

21-3 Example System After Bus Enumeration ...749

21-4 Header Type Register. ..750

21-5 Capability Register . ...750

21-6 Header Type 0 . ...751

21-7 Header Type 1 . ...752

21-8 Peer Root Complexes.. ...757

21-9 Multifunction Bridges in Root Complex. ...759

21-10 First Example of a Multifunction Bridge In a Switch ...760

21-11 Second Example of a Multifunction Bridge In a Switch. ...761

21-12 Embedded Root Endpoint . ...762

21-13 Embedded Switch Endpoint . ...763

21-14 Type 0 Configuration Write Request Packet Header ...764

21-15 RCRB Example . ..766

22-1 Header Type 0 . ...771

22-2 Class Code Register . ...775

22-3 Header Type Register Bit Assignment. ...778

22-4 BIST Register Bit Assignment . ...778

22-5 Status Register . ..780

22-6 General Format of a New Capabilities List Entry. ...782

22-7 Expansion ROM Base Address Register Bit Assignment. ...785

22-8 Command Register. ...785

22-9 PCI Configuration Status Register.. ...788

22-10 32-Bit Memory Base Address Register Bit Assignment ..796

22-11 64-Bit Memory Base Address Register Bit Assignment . ...797

22-12 IO Base Address Register Bit Assignment . ...798

22-13 Header Type 1 . ...803

22-14 IO Base Register . ..815

22-15 IO Limit Register. ...815

22-16 Example of IO Filtering Actions . ..817

22-17 Prefetchable Memory Base Register. ...827

22-18 Prefetchable Memory Limit Register ...828

22-19 Memory-Mapped IO Base Register . ...831

Memory-Mapped IO Limit Register ...831

22-21 Command Register. ...832

22-22 Bridge Control Register.. ...835

22-23 Primary Interface Status Register. ...838

22-24 Secondary Status Register. ...841

22-25 Format of the AGP Capability Register Set. ..845

22-26 VPD Capability Registers ...851

22-27 Chassis and Slot Number Registers. ...859

22-28 Main Chassis.. ...862

22-29 Expansion Slot Register.. ...864

22-30 Slot Capability Register. ...865

22-31 PCI Express Capabilities Register. ...865

22-32 Chassis Example One. ..867

22-33 Chassis Example Two.. ...869

23-1 Expansion ROM Base Address Register Bit Assignment.. ..873

23-2 Header Type Zero Configuration Register Format.. ..874

23-3 Multiple Code Images Contained In One Device ROM. ..877

23-4 Code Image Format. ..879

23-5 AX Contents On Entry To Initialization Code. ..888

24-1 Function's Configuration Space Layout ...895

24-2 PCI Express Capability Register Set. ...897

24-3 PCI Express Capabilities Register. ...898

24-4 Device Capabilities Register. ...901

24-5 Device Control Register ...906

24-6 Device Status Register.. ..910

24-7 Link Capabilities Register. ...913

24-8 Link Control Register. ..916

24-9 Link Status Register.. ...918

24-10 Slot Capabilities Register ...921

24-11 Slot Control Register. ...923

24-12 Slot Status Register . ...925

24-13 Root Control Register. ...927

24-14 Root Status Register. ...928

24-15 Enhanced Capability Header Register.. ...930

24-16 Advanced Error Reporting Capability Register Set. ...931

24-17 Advanced Error Reporting Enhanced Capability Header ..935

8 Advanced Error Capabilities and Control Register. ...935

24-19 Advanced Error Correctable Error Mask Register.. ...935

24-20 Advanced Error Correctable Error Status Register.. ...936

24-21 Advanced Error Uncorrectable Error Mask Register . ...936

24-22 Advanced Error Uncorrectable Error Severity Register. ..937

24-23 Advanced Error Uncorrectable Error Status Register ...937

24-24 Advanced Error Root Error Command Register. ...938

24 - 25

Advanced Error Root Error Status Register. ...938

24-26 Advanced Error Uncorrectable and Uncorrectable Error Source ID Registers ...938

24-27 Port and VC Arbitration.. ...939

24-28 Virtual Channel Capability Register Set. ...941

24-29 VC Enhanced Capability Header . ...941

24-30 Port VC Capability Register 1 (Read-Only) . ...942

24-31 Port VC Capability Register 2 (Read-Only) ...943

24-32 Port VC Control Register (Read-Write). ...944

24-33 Port VC Status Register (Read-Only). ...946

24-34 VC Resource Capability Register.. ...947

24-35 VC Resource Control Register (Read-Write). ...948

24-36 VC Resource Status Register (Read-Only) ...951

24-37 Device Serial Number Enhanced Capability Header ...953

24-38 Device Serial Number Register. ..953

24-39 EUI-64 Format . ...954

24-40 Power Budget Register Set . ..955

24-41 Power Budgeting Enhanced Capability Header ...955

24-42 Power Budgeting Data Register... ...956

24-43 Power Budgeting Capability Register.. ...956

24-44 RCRB Example . ..958

A-1 2.5-GT/s PCIe Compliance Pattern. ..963

A-2 5-GT/s PCIe Compliance Pattern. ...964

A-3 3 Typical Setup for Testing and Add-In Card ...965

A-4 Typical Setup for Testing a Motherboard . ...965

A-5 Oscilloscope Eye Diagram. ...966

A-6 Testing the LTSSM with the Agilent N5309A Exerciser for PCIe 2.0. ...967

A-7 Representation of Traffic on a PCIe Link Using an Agilent Protocol Analyzer...... ...970

A-8 Finding a Particular Condition or Sequence of Events Using a Trigger Sequencer. ..973

A-9 Typical PCIe Error Conditions Triggerable on Agilent Protocol Analyzer . ...974

A-10 Agilent PCIe 2.0 Exerciser Provides Powerful and Flexible Validation Platform ...975

A-11 The Agilent PCIe 2.0 N5309A Exerciser... ...976

A-12 Templates for Creating Traffic Using the Exerciser.. ...977

Adding Single or Multiple Error Scenarios to each Request. ...978

A-14 Programming Completer Behaviors. ...979

-15 Inserting Request and Completion Errors Using the Exerciser ...980

A-16 Using Continuous Mode on Exerciser Running Loop of Memory and I/O Transactions ..

...981

17 The Exerciser Has Fully Programmable Memory and I/O Decoders. ...982

A-18 Topology Tests Using the Agilent E2969A Protocol Test Card.. ...984

B-1 Migration from PCI to PCI Express. ...990

B-2 PCI Express in a Desktop System. ...992

B-3 PCI Express in a Server System. ...993

B-4 PCI Express in Embedded-Control Applications.. ...994

B-5 PCI Express in a Storage System . ...995

B-6 PCI Express in Communications Systems. ...997

C-1 Enumeration Using Transparent Bridges.. ..1002

C-2 Direct Address Translation.. ..1004

C-3 Look Up Table Translation Creates Multiple Windows . ..1005

1 PC Architecture Book Series ..1

1-1 Bus Specifications and Release Dates .12

1-2 Comparison of Bus Frequency, Bandwidth and Number of Slots .. 13

1-3 PCI Express Aggregate Throughput for Various Link Widths. . . 14

2-1 PCI Express Non-Posted and Posted Transactions. .56

2-2 PCI Express TLP Packet Types . ..57

2-3 PCI Express Aggregate Throughput for Various Link Widths. .101

3-1 Ordered Set Types . .109

3-2 Data Link Layer Packet (DLLP) Types ..111

3-3 PCI Express Address Space And Transaction Types .113

3-4 PCI Express Posted and Non-Posted Transactions. .116

3-5 PCI Express TLP Variants And Routing Options 1.117

3-6 TLP Header Type and Format Field Encodings. . . 120

3-7 Message Request Header Type Field Usage .134

3-8 Results Of Reading The BAR after Writing All "1s" To It. ..140

3-9 Results Of Reading The BAR Pair after Writing All "1s" To Both .142

3-10 Results Of Reading The IO BAR after Writing All "1s" To It . .144

3-11 6 GB, 64-Bit Prefetchable Base/Limit Register Setup. .146

3-12 2MB, 32-Bit Non-Prefetchable Base/Limit Register Setup .148

3-13 256 Byte IO Base/Limit Register Setup ..150

4-1 PCI Express Address Space And Transaction Types . .. 159

4-2 TLP Header Type Field Defines Transaction Variant .160

4-3 TLP Header Type Field Defines Transaction Variant .161

4-4 Generic Header Field Summary. ..163

4-5 TLP Header Type and Format Field Encodings. .165

4-6 IO Request Header Fields. .173

4-7 4DW Memory Request Header Fields. ..176

4-8 Configuration Request Header Fields ..181

4-9 Completion Header Fields .185

4-10 Message Request Header Fields . 1.191

4-11 INTx Interrupt Signaling Message Coding. . . 193

4-12 Power Management Message Coding ..194

4-13 Error Message Coding . ..195

4-14 Unlock Message Coding . . ..196

4-15 Slot Power Limit Message Coding. .196

4-16 Hot Plug Message Coding. .197

4-17 DLLP Packet Types. .. 201

4-18 Ack or Nak DLLP Fields. .. 203

4-19 Power Management DLLP Fields. . . 204

4-20 Flow Control DLLP Fields. . . 206

4-21 Vendor-Specific DLLP Fields. .. 207

5-1 Ack or Nak DLLP Fields. . 219

6-1 Example TC to VC Mappings . .263

7-1 Required Minimum Flow Control Advertisements ..303

8-1 Transactions That Can Be Reordered Due to Relaxed Ordering . . . 321

8-2 Fundamental Ordering Rules Based on Strong Ordering and RO Attribute .. 322

8-3 Weak Ordering Rules Enhance Performance . . 326

8-4 Ordering Rules with Deadlock Avoidance Rules .327

9-1 Format and Usage of Message Control Register. . 333

9-2 INTx Message Codes. ..351

10-1 Error Message Codes and Description ..371

10-2 Completion Code and Description . 372

10-3 Error-Related Command Register Bits . . 373

10-4 Description of PCI-Compatible Status Register Bits for Reporting Errors . 374

10-5 Default Classification of Errors. . 377

10-6 Transaction Layer Errors That are Logged .389

11-1 5-bit to 6-bit Encode Table for Data Characters ..427

11-2 5-bit to 6-bit Encode Table for Control Characters .429

11-3 3-bit to 4-bit Encode Table for Data Characters . . 429

11-4 3-bit to 4-bit Encode Table for Control Characters .430

11-5 Control Character Encoding and Definition. . . 432

12-1 Output Driver Characteristics. ..477

12-2 Input Receiver Characteristics . . . 480

14-1 Summary of TS1 and TS2 Ordered-Set Contents. .506

15-1 Maximum Power Consumption for System Board Expansion Slots .562

16-1 Major Software/Hardware Elements Involved In PC PM. 5.69

16-2 System PM States as Defined by the OnNow Design Initiative . 5.572

16-3 OnNow Definition of Device-Level PM States. .573

16-4 Concise Description of OnNow Device PM States .574

16-5 Default Device Class PM States. 5.576

16-6 D0 Power Management Policies. .587

16-7 D1 Power Management Policies. .588

16-8 D2 Power Management Policies. .590

16-9 D3hot Power Management Policies. 5.592

16-10 D3cold Power Management Policies .593

16-11 Description of Function State Transitions. 5.594

16-12 Function State Transition Delays . .596

16-13 The PMC Register Bit Assignments 5.597

6-14 PM Control/Status Register (PMCSR) Bit Assignments .000

16 - 15

Data Register Interpretation. 6.605

-16 Relationship Between Device and Link Power States. .007

16-17 Link Power State Characteristics . .. 608

16-18 8 Active State Power Management Control Field Definition. . .610

17 - 1

Introduction to Major Hot-Plug Software Elements.. 6.655

17-2 Major Hot-Plug Hardware Elements. 6.656

17-3 Behavior and Meaning of the Slot Attention Indicator. 6.665

17-4 Behavior and Meaning of the Power Indicator . . .666

17-5 Slot Capability Register Fields and Descriptions. ..671

17-6 Slot Control Register Fields and Descriptions. 6.673

17-7 Slot Status Register Fields and Descriptions. ..675

17-8 The Primitives. .682

18-1 PCI Express Connector Pinout. . . 689

18-2 PCI Express Connector Auxiliary Signals. . 693

18-3 Power Supply Requirements .701

18-4 Add-in Card Power Dissipation. ..702

18-5 Card Interoperability. ..703

20 - 1 Enhanced Configuration Mechanism Memory-Mapped IO Address Range .. 73

21 - 1 Capability Register's Device/Port Type Field Encoding. .753

22-1 Defined Class Codes ..775

22-2 BIST Register Bit Assignment . . .779

22-3 Currently-Assigned Capability IDs ..781

22 - 4 Command Register. .786

22 - 5 Status Register. ...789

22-6 Bridge Command Register Bit Assignment. .833

22-7 Bridge Control Register Bit Assignment .836

22 - 8 Bridge Primary Side Status Register .. .838

22 - 9 Bridge Secondary Side Status Register. ..841

22-10 AGP Status Register (Offset CAP_PTR + 4). 8.846

22-11 AGP Command Register (Offset CAP_PTR + 8). 8.87

22-12 Basic Format of VPD Data Structure. . 852

22-13 Format of the Identifier String Tag. . 853

22-14 Format of the VPD-R Descriptor . . 853

22-15 General Format of a Read or a Read/Write Keyword Entry . . 854

22-16 List of Read-Only VPD Keywords .854

22-17 Extended Capability (CP) Keyword Format. . 855

22-18 Format of Checksum Keyword. . . 855

22-19 Format of the VPD-W Descriptor. . 856

22-20 List of Read/Write VPD Keywords. . . 856

22-21 Example VPD List.. ..857

22-22 Slot Numbering Register Set. .859

22-23 Expansion Slot Register Bit Assignment. . . 864

23-1 PCI Expansion ROM Header Format . . 880

23-2 PC-Compatible Processor/Architecture Data Area In ROM Header. ..881

23-3 PCI Expansion ROM Data Structure Format. . . 882

24 - 1 PCI Express Capabilities Register . 899

24 - 2 Device Capabilities Register (read-only). . . 901

24 - 3 Device Control Register (read/write). . . 906

24 - 4 Device Status Register. ..910

24 - 5 Link Capabilities Register.. ..913

24 - 6 Link Control Register. .016

24 - 7 Link Status Register. 9

24 - 8 Slot Capabilities Register (all fields are HWInit) .. 921

24 - 9 Slot Control Register (all fields are RW). . . 924

24 - 10 Slot Status Register. ..926

24 - 11 Root Control Register (all fields are RW). .. 927

24 - 12 Root Status Register.. . . 929

24 - 13 Advanced Error Reporting Capability Register Set.. . 932

24 - 14 Port VC Capability Register 1 (Read-Only) .942

24 - 15 Port VC Capability Register 2 (Read-Only) 9.944

24 - 16 Port VC Control Register (Read-Write). ..945

24 - 17 Port VC Status Register (Read-Only). . 946

24 - 18 VC Resource Capability Register ..947

24 - 19 VC Resource Control Register (Read-Write) 9.949

24 - 20 VC Resource Status Register (Read-Only). ..951

D-1 Defined Class Codes .1019

D-2 Class Code 0 (PCI rev 1.0) . .1020

D-3 Class Code 1: Mass Storage Controllers .1020

D-4 Class Code 2: Network Controllers .1021

D-5 Class Code 3: Display Controllers .1022

D-6 Class Code 4: Multimedia Devices. .1022

D-7 Class Code 5: Memory Controllers 1022

D-8 Class Code 6: Bridge Devices .1023

D-9 Class Code 7: Simple Communications Controllers .1024

D-10 Class Code 8: Base System Peripherals .1026

D-11 Class Code 9: Input Devices .1027

D-12 Class Code A: Docking Stations . .1027

D-13 Class Code B: Processors .1028

D-14 Class Code C: Serial Bus Controllers .1028

D-15 Class Code D: Wireless Controllers. .1029

D-16 Class Code E: Intelligent IO Controllers. .1030

D-17 Class Code F: Satellite Communications Controllers. .1030

D-18 Class Code 10h: Encryption/Decryption Controllers .1030

D-19 Class Code 11h: Data Acquisition and Signal Processing Controllers . .1031

D-20 Definition of IDE Programmer's Interface Byte Encoding ..1031

Acknowledgments

Thanks to those who made significant contributions to this book:

Joe Winkles - for this superb job of technical editing.

Jay Trodden - for his contribution in developing the chapter on Transaction Routing and Packet-Based Transactions.

Mike Jackson - for his contribution in preparing the Card Electromechanical chapter.

Dave Dzatko - for research and editing.

Special thanks to Agilent Technologies for supplying:

Appendix A: Test, Debug and Verification of PCI Express Designs by Gordon Getty, Agilent Technologies

Special thanks to PLX Technology for contributing two appendices:

Appendix B: Markets & Applications for the PCI Express

^{TM}

Architecture Appendix C: Implementing Intelligent Adapters and Multi-Host Systems With PCI Express

^{TM}

Technology

Thanks also to the PCI SIG for giving permission to use some of the mechanical drawings from the specification.

The MindShare Architecture Series

The MindShare Architecture book series currently includes the books listed in Table 1 below. The entire book series is published by Addison-Wesley.

Table 1: PC Architecture Book Series

Category	Title	Edition	ISBN
Processor Architecture	80486 System Architecture	3rd	0-201-40994-1
	Pentium Processor System Architecture	2nd	0-201-40992-5
	Pentium Pro and Pentium II System Architecture	2nd	0-201-30973-4
	PowerPC System Architecture	1st	0-201-40990-9
Bus Architecture	PCI System Architecture	4th	0-201-30974-2
	PCI-X System Architecture	1st	0-201-72682-3
	EISA System Architecture	Out-of- print	0-201-40995-X
	Firewire System Architecture: IEEE 1394a	2nd	0-201-48535-4
	ISA System Architecture	$3 rd$	0-201-40996-8
	Universal Serial Bus System Architecture 2.0	2nd	0-201-46137-4
	HyperTransport System Architecture	1st	0-321-16845-3
	PCI Express System Architecture	1st	0-321-15630-7
Network Architecture	Infiniband Network Architecture	1st	0-321-11765-4

Table 1: PC Architecture Book Series (Continued)

Category	Title	Edition	ISBN
Other Architectures	PCMCIA System Architecture: 16-Bit PC Cards	2nd	0-201-40991-7
	CardBus System Architecture	1st	0-201-40997-6
	Plug and Play System Architecture	1st	0-201-41013-3
	Protected Mode Software Architecture	1st	0-201-55447-X
	AGP System Architecture	1st	0-201-37964-3

Cautionary Note

The reader should keep in mind that MindShare's book series often details rapidly evolving technologies, as is the case with PCI Express. This being the case, it should be recognized that the book is a "snapshot" of the state of the technology at the time the book was completed. We make every attempt to produce our books on a timely basis, but the next revision of the specification is not introduced in time to make necessary changes. This PCI Express book complies with revision 1.0a of the PCI Express

^{TM}

Base Specification released and trademarked by the PCI Special Interest Group. Several expansion card form-factor specifications are planned for PCI Express, but only the Electromechanical specification, revision 1.0 was released when this book was completed. However, the chapter covering the Card Electromechanical topic reviews several form-factors that were under development at the time of writing.

Intended Audience

This book is intended for use by hardware and software design and support personnel. The tutorial approach taken may also make it useful to technical personnel not directly involved design, verification, and other support functions.

Prerequisite Knowledge

It is recommended that the reader has a reasonable background in PC architecture, including experience or knowledge of an I/O bus and related protocol. Because PCI Express maintains several levels of compatibility with the original PCI design, critical background information regarding PCI has been incorporated into this book. However, the reader may find it beneficial to read the MindShare publication entitled PCI System Architecture, which focusses on and details the PCI architecture.

Topics and Organization

Topics covered in this book and the flow of the book are as follows:

Part 1: Background and Comprehensive Overview. Provides an architectural perspective of the PCI Express technology by comparing and contrasting it with the PCI and PCI-X buses. It also introduces the major features of the PCI Express architecture.

Part 2: PCI Express Transaction Protocol. Includes packet format and field definition and use, along with transaction and link layer functions.

Part 3: Physical Layer Description. Describes the physical layer functions, link training and initialization, reset, and electrical signaling.

Part 4: Power-Related Topics. Discusses Power Budgeting and Power Management.

Part 5: Optional Topics. Discusses the major features of PCI Express that are optional, including Hot Plug and Expansion Card implementation details.

Part 6: PCI Express Configuration. Discusses the configuration process, accessing configuration space, and details the content and use of all configuration registers.

Appendix:

Test, Debug, and Verification

Markets &Applications for the PCI Express $^{TM}$ Architecture

Implementing Intelligent Adapters and Multi-Host Systems With PCI Express $^{TM}$ Technology

PCI Express Class Codes

Legacy Support for Locking

Documentation Conventions

This section defines the typographical convention used throughout this book.

PCI Express TM

PCI Express

^{TM}

is a trademark of the PCI SIG. This book takes the liberty of abbreviating PCI Express as "PCI-XP" primarily in illustration where limited space is an issue.

Hexadecimal Notation

All hex numbers are followed by a lower case "h." For example:

89F2BD02h

0111h

Binary Notation

All binary numbers are followed by a lower case "b." For example:

1000100111110010b

01b

Decimal Notation

Number without any suffix are decimal. When required for clarity, decimal numbers are followed by a lower case "d." Examples:

512 d

Bits Versus Bytes Notation

This book represents bit with lower case "b" and bytes with an upper case "B." For example:

Megabits/second

= Mb / s

Megabytes/second

= MB / s

Bit Fields

Groups bits are represented with the high-order bits first followed by the low-order bits and enclosed by brackets. For example:

[7 : 0] =

bits 0 through 7

Active Signal States

Signals that are active low are followed by #, as in PERST# and WAKE#. Active high signals have no suffix, such as POWERGOOD.

Visit Our Web Site

Our web site lists all of our courses and the delivery options available for each course:

Information on MindShare courses:

Self-paced DVDs and CDs

Live web-delivered classes

Live on-site classes.

Free short courses on selected topics

Technical papers

Errata for a number of our books

All of our books are listed and can be ordered in bound or e-book versions.

www.mindshare.com

Architectural Perspective

This Chapter

This chapter describes performance advantages and key features of the PCI Express (PCI-XP) Link. To highlight these advantages, this chapter describes performance characteristics and features of predecessor buses such as PCI and PCI-X buses with the goal of discussing the evolution of PCI Express from these predecessor buses. The reader will be able to compare and contrast features and performance points of PCI, PCI-X and PCI Express buses. The key features of a PCI Express system are described. In addition, the chapter describes some examples of PCI Express system topologies.

The Next Chapter

The next chapter describes in further detail the features of the PCI Express bus. It describes the layered architecture of a device design while providing a brief functional description of each layer. The chapter provides an overview of packet formation at a transmitter device, the transmission and reception of the packet over the PCI Express Link and packet decode at a receiver device.

Introduction To PCI Express

PCI Express is the third generation high performance I/O bus used to interconnect peripheral devices in applications such as computing and communication platforms. The first generation buses include the ISA, EISA, VESA, and Micro Channel buses, while the second generation buses include PCI, AGP, and PCI-X. PCI Express is an all encompassing I/O device interconnect bus that has applications in the mobile, desktop, workstation, server, embedded computing and communication platforms.

The Role of the Original PCI Solution

Don't Throw Away What is Good! Keep It

The PCI Express architects have carried forward the most beneficial features from previous generation bus architectures and have also taken advantages of new developments in computer architecture.

For example, PCI Express employs the same usage model and load-store communication model as PCI and PCI-X. PCI Express supports familiar transactions such as memory read/write, IO read/write and configuration read/write transactions. The memory, IO and configuration address space model is the same as PCI and PCI-X address spaces. By maintaining the address space model, existing OSs and driver software will run in a PCI Express system without any modifications. In other words, PCI Express is software backwards compatible with PCI and PCI-X systems. In fact, a PCI Express system will boot an existing OS with no changes to current drivers and application programs. Even PCI/ACPI power management software will still run.

Like predecessor buses, PCI Express supports chip-to-chip interconnect and board-to-board interconnect via cards and connectors. The connector and card structure are similar to PCI and PCI-X connectors and cards. A PCI Express motherboard will have a similar form factor to existing FR4 ATX motherboards which is encased in the familiar PC package.

Make Improvements for the Future

To improve bus performance, reduce overall system cost and take advantage of new developments in computer design, the PCI Express architecture had to be significantly re-designed from its predecessor buses. PCI and PCI-X buses are multi-drop parallel interconnect buses in which many devices share one bus.

PCI Express on the other hand implements a serial, point-to-point type interconnect for communication between two devices. Multiple PCI Express devices are interconnected via the use of switches which means one can practically connect a large number of devices together in a system. A point-to-point interconnect implies limited electrical load on the link allowing transmission and reception frequencies to scale to much higher numbers. Currently PCI Express transmission and reception data rate is

2.5 Gbits / \sec

. A serial interconnect between two devices results in fewer pins per device package which reduces PCI Express chip and board design cost and reduces board design complexity. PCI Express performance is also highly scalable. This is achieved by implement-

ing scalable numbers for pins and signal Lanes per interconnect based on communication performance requirements for that interconnect.

PCI Express implements switch-based technology to interconnect a large number of devices. Communication over the serial interconnect is accomplished using a packet-based communication protocol. Quality Of Service (QoS) features provide differentiated transmission performance for different applications. Hot Plug/Hot Swap support enables "always-on" systems. Advanced power management features allow one to design for low power mobile applications. RAS (Reliable, Available, Serviceable) error handling features make PCI Express suitable for robust high-end server applications. Hot plug, power management, error handling and interrupt signaling are accomplished in-band using packet based messaging rather than side-band signals. This keeps the device pin count low and reduces system cost.

The configuration address space available per function is extended to

4 KB

, allowing designers to define additional registers. However, new software is required to access this extended configuration register space.

Looking into the Future

In the future, PCI Express communication frequencies are expected to double and quadruple to 5 Gbits/sec and 10 Gbits/sec. Taking advantage of these frequencies will require Physical Layer re-design of a device with no changes necessary to the higher layers of the device design.

Additional mechanical form factors are expected. Support for a Server IO Module, Newcard (PC Card style), and Cable form factors are expected.

Predecessor Buses Compared

In an effort to compare and contrast features of predecessor buses, the next section of this chapter describes some of the key features of IO bus architectures defined by the PCI Special Interest Group (PCISIG). These buses, shown in Table 1-1 on page 12, include the PCI 33 MHz bus, PCI- 66 MHz bus, PCI-X 66 MHz/133 MHz buses, PCI-X 266/533 MHz buses and finally PCI Express.

Table 1-1: Bus Specifications and Release Dates

Bus Type	Specification Release	Date of Release
PCI 33 MHz	2.0	1993
PCI 66 MHz	2.1	1995
PCI-X 66 MHz and 133 MHz	1.0	1999
PCI-X 266 MHz and 533 MHz	2.0	Q1, 2002
PCI Express	1.0	Q2, 2002

Author's Disclaimer

In comparing these buses, it is not the authors' intention to suggest that any one bus is better than any other bus. Each bus architecture has its advantages and disadvantages. After evaluating the features of each bus architecture, a particular bus architecture may turn out to be more suitable for a specific application than another bus architecture. For example, it is the system designers responsibility to determine whether to implement a PCI-X bus or PCI Express for the I/ O interconnect in a high-end server design. Our goal in this chapter is to document the features of each bus architecture so that the designer can evaluate the various bus architectures.

Bus Performances and Number of Slots Compared

Table 1-2 on page 13 shows the various bus architectures defined by the PCISIG. The table shows the evolution of bus frequencies and bandwidths. As is obvious, increasing bus frequency results in increased bandwidth. However, increasing bus frequency compromises the number of electrical loads or number of connectors allowable on a bus at that frequency. At some point, for a given bus architecture, there is an upper limit beyond which one cannot further increase the bus frequency, hence requiring the definition of a new bus architecture.

Table 1-2: Comparison of Bus Frequency, Bandwidth and Number of Slots

Bus Type	Clock Frequency	Peak Bandwidth *	Number of Card Slots per Bus
PCI 32-bit	33 MHz	133 MBytes/sec	4-5
PCI 32-bit	66 MHz	266 MBytes/sec	$1 - 2$
PCI-X 32-bit	66 MHz	266 MBytes/sec	4
PCI-X 32-bit	133 MHz	533 MBytes/sec	$1 - 2$
PCI-X 32-bit	266 MHz effective	1066 MBytes/sec	1
PCI-X 32-bit	533 MHz effective	2131 MByte/sec	1
* Double all these bandwidth numbers for 64-bit bus implementations

PCI Express Aggregate Throughput

A PCI Express interconnect that connects two devices together is referred to as a Link. A Link consists of either

x 1, x 2, x 4, x 8, x 12, x 16

x 32

signal pairs in each direction. These signals are referred to as Lanes. A designer determines how many Lanes to implement based on the targeted performance benchmark required on a given Link.

Table 1-3 shows aggregate bandwidth numbers for various Link width implementations. As is apparent from this table, the peak bandwidth achievable with PCI Express is significantly higher than any existing bus today.

Let us consider how these bandwidth numbers are calculated. The transmission/reception rate is 2.5 Gbits/sec per Lane per direction. To support a greater degree of robustness during data transmission and reception, each byte of data transmitted is converted into a 10-bit code (via an

8 b / 10 b

encoder in the transmitter device). In other words, for every Byte of data to be transmitted, 10-bits of encoded data are actually transmitted. The result is

25 %

additional overhead to transmit a byte of data. Table 1-3 accounts for this

25 %

loss in transmission performance.

PCI Express System Architecture

PCI Express implements a dual-simplex Link which implies that data is transmitted and received simultaneously on a transmit and receive Lane. The aggregate bandwidth assumes simultaneous traffic in both directions.

To obtain the aggregate bandwith numbers in Table 1-3 multiply 2.5 Gbits/sec by 2 (for each direction), then multiply by number of Lanes, and finally divide by 10-bits per Byte (to account for the 8-to-10 bit encoding).

Table 1-3: PCI Express Aggregate Throughput for Various Link Widths

PCI Express Link Width	x1	x2	x4	x8	x12	x16	x32
Aggregate Band- width (GBytes/sec)	0.5	1	2	4	6	8	16

Performance Per Pin Compared

As is apparent from Figure 1-1, PCI Express achieves the highest bandwidth per pin. This results in a device package with fewer pins and a motherboard implementation with few wires and hence overall reduced system cost per unit bandwidth.

In Figure 1-1, the first 7 bars are associated with PCI and PCI-X buses where we assume 84 pins per device. This includes 46 signal pins, interrupt and power management pins, error pins and the remainder are power and ground pins. The last bar associated with a x8 PCI Express Link assumes 40 pins per device which include 32 signal lines (8 differential pairs per direction) and the rest are power and ground pins.

I/O Bus Architecture Perspective

33 MHz PCI Bus Based System

Figure 1-2 on page 17 is a 33 MHz PCI bus based system. The PCI system consists of a Host (CPU) bus-to-PCI bus bridge, also referred to as the North bridge. Associated with the North bridge is the system memory bus, graphics (AGP) bus,and a

33 MHz

PCI bus. I/O devices share the PCI bus and are connected to it in a multi-drop fashion. These devices are either connected directly to the PCI bus on the motherboard or by way of a peripheral card plugged into a connector on the bus. Devices connected directly to the motherboard consume one electrical load while connectors are accounted for as 2 loads. A South bridge bridges the PCI bus to the ISA bus where slower, lower performance peripherals exist. Associated with the south bridge is a USB and IDE bus. A CD or hard disk is associated with the IDE bus. The South bridge contains an interrupt controller (not shown) to which interrupt signals from PCI devices are connected. The interrupt controller is connected to the CPU via an INTR signal or an APIC bus. The South bridge is the central resource that provides the source of reset, reference clock, and error reporting signals. Boot ROM exists on the ISA bus along with a Super IO chip, which includes keyboard, mouse, floppy disk controller and serial/parallel bus controllers. The PCI bus arbiter logic is included in the North bridge.

Figure 1-3 on page 18 represents a typical PCI bus cycle. The PCI bus clock is 33 MHz. The address bus width is 32-bits (4GB memory address space), although PCI optionally supports 64-bit address bus. The data bus width is implemented as either 32-bits or 64-bits depending on bus performance requirement. The address and data bus signals are multiplexed on the same pins (AD bus) to reduce pin count. Command signals (C/BE#) encode the transaction type of the bus cycle that master devices initiate. PCI supports 12 transaction types that include memory, IO, and configuration read/write bus cycles. Control signals such as FRAME#, DEVSEL#, TRDY#, IRDY#, STOP# are handshake signals used during bus cycles. Finally, the PCI bus consists of a few optional error related signals, interrupt signals and power management signals. A PCI master device implements a minimum of 49 signals.

Any PCI master device that wishes to initiate a bus cycle first arbitrates for use of the PCI bus by asserting a request (REQ#) to the arbiter in the North bridge. After receiving a grant (GNT#) from the arbiter and checking that the bus is idle, the master device can start a bus cycle.

Figure 1-2: 33 MHz PCI Bus Based Platform

Electrical Load Limit of a 33 MHz PCI Bus

The PCI specification theoretically supports 32 devices per PCI bus. This means that PCI enumeration software will detect and recognize up to 32 devices per bus. However, as a rule of thumb, a PCI bus can support a maximum of 10-12 electrical loads (devices) at

33 MHz

. PCI implements a static clocking protocol with a clock period of

30 ns

33 MHz

PCI implements reflected-wave switching signal drivers. The driver drives a half signal swing signal on the rising edge of PCI clock. The signal propagates down the PCI bus transmission line and is reflected at the end of the transmission line where there is no termination. The reflection causes the half swing signal to double. The doubled (full signal swing) signal must settle to a steady state PCI Express System Architecture value with sufficient setup time prior to the next rising edge of PCI clock where receiving devices sample the signal. The total time from when a driver drives a signal until the receiver detects a valid signal (including propagation time and reflection delay plus setup time) must be less than the clock period of

30 ns

Chapter 1: Architectural Perspective

To connect any more than 10-12 loads in a system requires the implementation of a PCI-to-PCI bridge as shown in Figure 1-4. This permits an additional 10-12 loads to be connected on the secondary PCI bus 1. The PCI specification theoretically supports up to 256 buses in a system. This means that PCI enumeration software will detect and recognize up to

256 PCI

bridges per system.

Figure 1-4: 33 MHz PCI Based System Showing Implementation of a PCI-to-PCI Bridge

PCI Transaction Model - Programmed IO

Consider an example in which the CPU communicates with a PCI peripheral such as an Ethernet device shown in Figure 1-5. Transaction 1 shown in the figure, which is initiated by the CPU and targets a peripheral device, is referred to as a programmed IO transaction. Software commands the CPU to initiate a memory or IO read/write bus cycle on the host bus targeting an address mapped in a PCI device's address space. The North bridge arbitrates for use of the PCI bus and when it wins ownership of the bus generates a PCI memory or IO read/write bus cycle represented in Figure 1-3 on page 18. During the first clock of this bus cycle (known as the address phase), all target devices decode PCI Express System Architecture

the address. One target (the Ethernet device in this example) decodes the address and claims the transaction. The master (North bridge in this case) communicates with the claiming target (Ethernet controller). Data is transferred between master and target in subsequent clocks after the address phase of the bus cycle. Either 4 bytes or 8 bytes of data are transferred per clock tick depending on the PCI bus width. The bus cycle is referred to as a burst bus cycle if data is transferred back-to-back between master and target during multiple data phases of that bus cycle. Burst bus cycles result in the most efficient use of PCI bus bandwidth.

Figure 1-5: PCI Transaction Model

33 MHz

and the bus width of 32-bits (4 Bytes),peak bandwidth achievable is 4 Bytes

\times 33 MHz = 133 MBytes / \sec

. Peak bandwidth on a 64-bit bus is 266 Mbytes/sec. See Table 1-2 on page 13.

Efficiency of the PCI bus for data payload transport is in the order of 50%. Efficiency is defined as number of clocks during which data is transferred divided by the number of total clocks, times 100 . The lost performance is due to bus idle time between bus cycles, arbitration time, time lost in the address phase of a bus cycle, wait states during data phases, delays during transaction retries (not discussed yet), as well as latencies through PCI bridges.

PCI Transaction Model Direct Memory Access (DMA)

Data transfer between a PCI device and system memory is accomplished in two ways:

The first less efficient method uses programmed IO transfers as discussed in the previous section. The PCI device generates an interrupt to inform the CPU that it needs data transferred. The device interrupt service routine (ISR) causes the CPU to read from the PCI device into one of its own registers. The ISR then tells the CPU to write from its register to memory. Similarly, if data is to be moved from memory to the PCI device, the ISR tells the CPU to read from memory into its own register. The ISR then tells the CPU to write from its register to the PCI device. It is apparent that the process is very inefficient for two reasons. First, there are two bus cycles generated by the CPU for every data transfer, one to memory and one to the PCI device. Second, the CPU is busy transferring data rather than performing its primary function of executing application code.

The second more efficient method to transfer data is the DMA (direct memory access) method illustrated by Transaction 2 in Figure 1-5 on page 20, where the PCI device becomes a bus master. Upon command by a local application (software) which runs on a PCI peripheral or the PCI peripheral hardware itself, the PCI device may initiate a bus cycle to talk to memory. The PCI bus master device (SCSI device in this example) arbitrates for the PCI bus, wins ownership of the bus and initiates a PCI memory bus cycle. The North bridge which decodes the address acts as the target for the transaction. In the data phase of the bus cycle, data is transferred between the SCSI master and the North bridge target. The bridge in turn generates a DRAM bus cycle to communicate with system memory. The PCI peripheral generates an interrupt to inform the system software that the data transfer has completed. This bus master or DMA method of data transport is more efficient because the CPU is not involved in the data move and further only one burst bus cycle is generated to move a block of data.

PCI Express System Architecture

PCI Transaction Model Peer-to-Peer

A Peer-to-peer transaction shown as Transaction 3 in Figure 1-5 on page 20 is the direct transfer of data between two PCI devices. A master that wishes to initiate a transaction, arbitrates, wins ownership of the bus and starts a transaction. A target PCI device that recognizes the address claims the bus cycle. For a write bus cycle, data is moved from master to target. For a read bus cycle, data is moved from target to master.

PCI Bus Arbitration

A PCI device that wishes to initiate a bus cycle arbitrates for use of the bus first. The arbiter implements an arbitration algorithm with which it decides who to grant the bus to next. The arbiter is able to grant the bus to the next requesting device while a bus cycle is in progress. This arbitration protocol is referred to as hidden bus arbitration. Hidden bus arbitration allows for more efficient hand over of the bus from one bus master device to another with only one idle clock between two bus cycles (referred to as back-to-back bus cycles). PCI protocol does not provide a standard mechanism by which system software or device drivers can configure the arbitration algorithm in order to provide for differentiated class of service for various applications.

Figure 1-6: PCI Bus Arbitration

Chapter 1: Architectural Perspective

PCI Delayed Transaction Protocol

PCI Retry Protocol: When a PCI master initiates a transaction to access a target device and the target device is not ready, the target signals a transaction retry. This scenario is illustrated in Figure 1-7.

Figure 1-7: PCI Transaction Retry Mechanism PCI Express System Architecture

Consider the following example in which the North bridge initiates a memory read transaction to read data from the Ethernet device. The Ethernet target claims the bus cycle. However, the Ethernet target does not immediately have the data to return to the North bridge master. The Ethernet device has two choices by which to delay the data transfer. The first is to insert wait-states in the data phase. If only a few wait-states are needed, then the data is still transferred efficiently. If however the target device requires more time (more than 16 clocks from the beginning of the transaction), then the second option the target has is to signal a retry with a signal called STOP#. A retry tells the master to end

the bus cycle prematurely without transferring data. Doing so prevents the bus from being held for a long time in wait-states, which compromises the bus efficiency. The bus master that is retried by the target waits a minimum of 2 clocks and must once again arbitrate for use of the bus to re-initiate the identical bus cycle. During the time that the bus master is retried, the arbiter can grant the bus to other requesting masters so that the PCI bus is more efficiently utilized. By the time the retried master is granted the bus and it re-initiates the bus cycle, hopefully the target will claim the cycle and will be ready to transfer data. The bus cycle goes to completion with data transfer. Otherwise, if the target is still not ready, it retries the master's bus cycle again and the process is repeated until the master successfully transfers data.

PCI Disconnect Protocol: When a PCI master initiates a transaction to access a target device and if the target device is able to transfer at least one doubleword of data but cannot complete the entire data transfer, it disconnects the bus cycle at the point at which it cannot continue the data transfer. This scenario is illustrated in Figure 1-8. Chapter 1: Architectural Perspective

Figure 1-8: PCI Transaction Disconnect Mechanism

Consider the following example in which the North bridge initiates a burst memory read transaction to read data from the Ethernet device. The Ethernet target device claims the bus cycle and transfers some data, but then runs out of data to transfer. The Ethernet device has two choices to delay the data transfer. The first option is to insert wait-states during the current data phase while waiting for additional data to arrive. If the target needs to insert only a few wait-states, then the data is still transferred efficiently. If however the target device requires more time (the PCI specification allows maximum of 8 clocks in the data phase), then the target device must signal a disconnect. To do this the target asserts STOP# in the middle of the bus cycle to tell the master to end the bus cycle prematurely. A disconnect results in some data is transferred, while a retry does not. Disconnect frees the bus from long periods of wait states. The disconnected master waits a minimum of 2 clocks before once again arbitrating for use of the bus and continuing the bus cycle at the disconnected address. During the time that the bus master is disconnected, the arbiter may grant the bus to other requesting masters so that the PCI bus is utilized more efficiently. By the time the disconnected master is granted the bus and continues the bus cycle, hopefully the target is ready to continue the data transfer until it is completed. Otherwise, the target once again retries or disconnects the master's bus cycle and the process is repeated until the master successfully transfers all its data.

PCI Interrupt Handling

Central to the PCI interrupt handling protocol is the interrupt controller shown in Figure 1-9. PCI devices use one-of-four interrupt signals (INTA#, INTB#, INTC#, INTD#) to trigger an interrupt request to the interrupt controller. In turn, the interrupt controller asserts INTR to the CPU. If the architecture supports an APIC (Advanced Programmable Interrupt Controller) then it sends an APIC message to the CPU as opposed to asserting the INTR signal. The interrupted CPU determines the source of the interrupt, saves its state and services the device that generated the interrupt. Interrupts on PCI INTx# signals are sharable. This allows multiple devices to generate their interrupts on the same interrupt signal. OS software has the overhead to determine which one of the devices sharing the interrupt signal generated the interrupt. This is accomplished by polling the Interrupt Pending bit mapped in a device's memory space. Doing so incurs additional latency in servicing the interrupting device.

Figure 1-9: PCI Interrupt Handling

PCI Error Handling

PCI devices are optionally designed to detect address and data phase parity errors during transactions. Even parity is generated on the PAR signal during each bus cycle's address and data phases. The device that receives the address or data during a bus cycle uses the parity signal to determine if a parity error has occurred due to noise on the PCI bus. If a device detects an address phase parity error, it asserts SERR#. If a device detects a data phase parity error, it asserts PERR#. The PERR# and SERR# signals are connected to the error logic (in the South bridge) as shown in Figure 1-10 on page 27. In many systems, the error logic asserts the NMI signal (non-maskable interrupt signal) to the CPU upon detecting PERR# or SERR#. This interrupt results in notification of a parity error and the system shuts down (We all know the blue screen of death). Kind of draconian don't you agree?

Chapter 1: Architectural Perspective

Figure 1-10: PCI Error Handling Protocol

Unfortunately, PCI error detection and reporting is not robust. PCI errors are fatal uncorrectable errors that many times result in system shutdown. Further, errors are detectable as long as an odd number of signals are affected by noise. Given the poor PCI error detection protocol and error handling policies, many system designs either disable or do not support error checking and reporting.

PCI Address Space Map

PCI architecture supports 3 address spaces shown in Figure 1-11. These are the memory, IO and configuration address spaces. The memory address space goes up to

4 GB

for systems that support 32-bit memory addressing and optionally up to

16 EB

(exabytes) for systems that support 64-bit memory addressing. PCI supports up to

4 GB

of IO address space,however,many platforms limit IO space to

64 KB

due to X86 CPUs only supporting

64 KB

of IO address space. PCI devices are configured to map to a configurable region within either the memory or IO address space.

Figure 1-11: Address Space Mapping

PCI device configuration registers map to a third space called configuration address space. Each PCI function may have up to 256 Bytes of configuration address space. The configuration address space is 16 MBytes. This is calculated by multiplying 256 Bytes, by 8 functions per device, by 32 devices per bus, by 256 buses per system. An

\times 86

CPU can access memory or IO address space but does not support configuration address space directly. Instead, CPUs access PCI configuration space indirectly by indexing through an IO mapped Address Port and Data Port in the host bridge (North bridge or MCH). The Address Port is located at IO address CF8h-CFBh and the Data Port is mapped to location CFCh-CFFh.

Chapter 1: Architectural Perspective

PCI Configuration Cycle Generation

PCI configuration cycle generation involves two steps.

Step 1: The CPU generates an IO write to the Address Port at IO address CF8h in the North bridge. The data written to the Address Port is the configuration register address to be accessed.

Figure 1-12: PCI Configuration Cycle Generation

Step 2: The CPU either generates an IO read or IO write to the Data Port at location CFCh in the North bridge. The North bridge in turn then generates either a configuration read or configuration write transaction on the PCI bus.

The address for the configuration transaction address phase is obtained from the contents of the Address register. During the configuration bus cycle, one of the point-to-point IDSEL signals shown in Figure 1-12 on page 29 is asserted to select the device whose register is being accessed. That PCI target device claims the configuration cycle and fulfills the request.

PCI Function Configuration Register Space

Each PCI device contains up to 256 Bytes of configuration register space. The first 64 bytes are configuration header registers and the remainding 192 Bytes are device specific registers. The header registers are configured at boot time by the Boot ROM configuration firmware and by the OS. The device specific registers are configured by the device's device driver that is loaded and executed by the OS at boot time.

Figure 1-13: 256 Byte PCI Function Configuration Register Space

Within the header space, the Base Address registers are one of the most important registers configured by the 'Plug and Play' configuration software. It is via these registers that software assigns a device its memory and/or IO address space within the system's memory and IO address space. No two devices are assigned the same address range, thus ensuring the 'plug and play' nature of the PCI system.

PCI Programming Model

Software instructions may cause the CPU to generate memory or IO read/write bus cycles. The North bridge decodes the address of the resulting CPU bus cycles, and if the address maps to PCI address space, the bridge in turn generates a PCI memory or IO read/write bus cycle. A target device on the PCI bus claims the cycle and completes the transfer. In summary, the CPU communicates with any PCI device via the North bridge, which generates PCI memory or IO bus cycles on the behalf of the CPU.

An intelligent PCI device that includes a local processor or bus master state machine (typically intelligent IO cards) can also initiate PCI memory or IO transactions on the PCI bus. These masters can communicate directly with any other devices, including system memory associated with the North bridge.

A device driver executing on the CPU configures the device-specific configuration register space of an associated PCI device. A configured PCI device that is bus master capable can initiate its own transactions, which allows it to communicate with any other PCI target device including system memory associated with the North bridge.

The CPU can access configuration space as described in the previous section.

PCI Express architecture assumes the identical programming model as the PCI programming model described above. In fact, current OSs written for PCI systems can boot a PCI Express system. Current PCI device drivers will initialize PCI Express devices without any driver changes. PCI configuration and enumeration firmware will function unmodified on a PCI Express system.

Limitations of a 33 MHz PCI System

As indicated in Table 1-2 on page 13, peak bandwidth achievable on a 64-bit 33 MHz PCI bus is

266

Mbytes/sec. Current high-end workstation and server applications require greater bandwidth.

Applications such as gigabit Ethernet and high performance disc transfers in RAID and SCSI configurations require greater bandwidth capability than the 33 MHz PCI bus offers.

PCI Express System Architecture

Latest Generation of Intel PCI Chipsets

Figure 1-14 shows an example of a later generation Intel PCI chipset. The two shaded devices are NOT the North bridge and South bridge shown in earlier diagrams. Instead, one device is the Memory Controller Hub (MCH) and the other is the IO Controller Hub (ICH). The two chips are connected by a proprietary Intel high throughput, low pin count bus called the Hub Link.

Figure 1-14: Latest Generation of PCI Chipsets

The ICH includes the South bridge functionality but does not support the ISA bus. Other buses associated with ICH include LPC (low pin count) bus, AC'97, Ethernet, Boot ROM, IDE, USB, SMbus and finally the PCI bus. The advantage of this architecture over previous architectures is that the IDE, USB, Ethernet and audio devices do not transfer their data through the PCI bus to memory as is the case with earlier chipsets. Instead they do so through the Hub Link. Hub Link is a higher performance bus compared to PCI. In other words, these devices bypass the PCI bus when communicating with memory. The result is improved performance.

Chapter 1: Architectural Perspective

66 MHz PCI Bus Based System

High end systems that require better IO bandwidth implement a

66 MHz 64

-bit PCI buses. This PCI bus supports peak data transfer rate of 533 MBytes/sec.

The PCI 2.1 specification released in 1995 added 66MHz PCI support.

Figure 1-15 shows an example of a

66 MHz

PCI bus based system. This system has similar features to that described in Figure 1-14 on page 32. However, the MCH chip in this example supports two additional Hub Link buses that connect to P64H (PCI 64-bit Hub) bridge chips, providing access to the 64-bit, 66 MHz buses. These buses each support 1 connector in which a high-end peripheral card may be installed. PCI Express System Architecture

Figure 1-15: 66 MHz PCI Bus Based Platform

Limitations of $66 MHz PCl$ bus

The PCI clock period at

66 MHz

15 ns

. Recall that PCI supports reflected-wave signaling drivers that are weaker drivers, which have slower rise and fall times as compared to incident-wave signaling drivers. It is a challenge to design a

66 MHz

device or system that satisfies the signal timing requirements.

66 MHz

PCI based motherboard is routed with shorter signal traces to ensure shorter signal propagation delays. In addition, the bus is loaded with fewer loads in order to ensure faster signal rise and fall times. Taking into account typical board impedances and minimum signal trace lengths, it is possible to interconnect a maximum of four to five

66 MHz

PCI devices. Only one or two connectors may be connected on a

66 MHz

PCI bus. This is a significant limitation for a system which requires multiple devices interconnected.

The solution requires the addition of PCI bridges and hence multiple buses to interconnect devices. This solution is expensive and consumes additional board real estate. In addition, transactions between devices on opposite sides of a bridge complete with greater latency because bridges implement delayed transactions. This requires bridges to retry all transactions that must cross to the other side (with the exception of memory writes which are posted).

Limitations of PCI Architecture

The maximum frequency achievable with the PCI architecture is

66 MHz

. This is a result of the static clock method of driving and latching signals and because reflected-wave signaling is used.

PCI bus efficiency is in the order of 50% or

60 %

. Some of the factors that contribute towards this reduced efficiency are listed below.

The PCI specification allows master and target devices to insert wait-states during data phases of a bus cycle. Slow devices will add wait-states which reduces the efficiency of bus cycles.

PCI bus cycles do not indicate transfer size. This makes buffer management within master and target devices inefficient.

Delayed transactions on PCI are handled inefficiently. When a master is retried, it guesses when to try again. If the master tries too soon, the target may retry the transaction again. If the master waits too long to retry, the latency to complete a data transfer is increased. Similarly, if a target disconnects a transaction the master must guess when to resume the bus cycle at a later time.

All PCI bus master accesses to system memory result in a snoop access to the CPU cache. Doing so results in additional wait states during PCI bus master accesses of system memory. The North bridge or MCH must assume all system memory address space is cachable even though this may not be the case. PCI bus cycles provide no mechanism by which to indicate an access to non-cach-able memory address space.

PCI architecture observes strict ordering rules as defined by the specification. Even if a PCI application does not require observation of these strict ordering rules, PCI bus cycles do not provide a mechanism to allow relaxed ordering rule. Observing relaxed ordering rules allows bus cycles (especially those that cross a bridge) to complete with reduced latency.

PCI interrupt handling architecture is inefficient especially because multiple devices share a PCI interrupt signal. Additional software latency is incurred while software discovers which device or devices that share an interrupt signal actually generated the interrupt.

The processor's NMI interrupt input is asserted when a PCI parity or system error is detected. Ultimately the system shuts down when an error is detected. This is a severe response. A more appropriate response might be to detect the error and attempt error recovery. PCI does not require error recovery features, nor does it support an extensive register set for documenting a variety of detectable errors.

These limitations above have been resolved in the next generation bus architectures, namely PCI-X and PCI Express.

66 MHz and 133 MHz PCI-X 1.0 Bus Based Platforms

Figure 1-16 on page 36 is an example of an Intel 7500 server chipset based system. This chipset has similarities to the

8 XX

chipset described earlier. MCH and ICH chips are connected via a Hub Link 1.0 bus. Associated with ICH is a 32-bit

33 MHz

PCI bus. The 7500 MCH chip includes 3 additional high performance Hub Link 2.0 ports. These Hub Link ports are connected to 3 Hub Link-to-PCI-X Hub 2 bridges (P64H2). Each P64H2 bridge supports 2 PCI-X buses that can run at frequencies up to

133 MHz

. Hub Link 2.0 Links can sustain the higher bandwidth requirements for PCI-X traffic that targets system memory.

PCI-X Features

The PCI-X bus is a higher frequency, higher performance, higher efficiency bus compared to the PCI bus.

PCI-X devices can be plugged into PCI slots and vice-versa. PCI-X and PCI slots employ the same connector format. Thus, PCI-X is 100% backwards compatible to PCI from both a hardware and software standpoint. The device drivers, OS, and applications that run on a PCI system also run on a PCI-X system.

PCI-X signals are registered. A registered signal requires smaller setup time to sample the signal as compared with a non-registered signal employed in PCI. Also, PCI-X devices employ PLLs that are used to pre-drive signals with smaller clock-to-out time. The time gained from reduced setup time and clock-to-out time is used towards increased clock frequency capability and the ability to sup-

Chapter 1: Architectural Perspective

port more devices on the bus at a given frequency compared to PCI. PCI-X supports 8-10 loads or 4 connectors at

66 MHz

and 3-4 loads or 1-2 connectors at

133 MHz

The peak bandwidth achievable with 64-bit 133 MHz PCI-X is 1064 MBytes/sec.

Following the first data phase, the PCI-X bus does not allow wait states during subsequent data phases.

Most PCI-X bus cycles are burst cycles and data is generally transferred in blocks of no less than 128 Bytes. This results in higher bus utilization. Further, the transfer size is specified in the attribute phase of PCI-X transactions. This allows for more efficient device buffer management. Figure 1-17 is an example of a PCI-X burst memory read transaction.

Figure 1-17: Example PCI-X Burst Memory Read Bus Cycle

PCI-X Requester/Completer Split Transaction Model. Consider an example of the split transaction protocol supported by PCI-X for delaying transactions. This protocol is illustrated in Figure 1-18. A requester initiates a read transaction. The completer that claims the bus cycles may be unable to return the requested data immediately. Rather than signaling a retry as would be the case in PCI protocol, the completer memorizes the transaction (address, transac-

PCI Express System Architecture

tion type, byte count, requester ID are memorized) and signals a split response. This prompts the requester to end the bus cycle, and the bus goes idle. The PCI-

X

bus is now available for other transactions,resulting in more efficient bus utilization. Meanwhile, the requester simply waits for the completer to supply it the requested data at a later time. Once the completer has gathered the requested data, it then arbitrates and obtains bus ownership and initiates a split completion bus cycle during which it returns the requested data. The requester claims the split completion bus cycle and accepts the data from the completer.

The split completion bus cycle is very much like a write bus cycle. Exactly two bus transactions are needed to complete the entire data transfer. In between these two bus transactions (the read request and the split completion transaction) the bus is utilized for other transactions. The requester also receives the requested data in a very efficient manner.

PCI Express architecture employs a similar transaction protocol.

Figure 1-18: PCI-X Split Transaction Protocol

These performance enhancement features described so far contribute towards an increased transfer efficiency of 85% for PCI-X as compared to 50%-60% with PCI protocol.

PCI-X devices must support Message Signaled Interrupt (MSI) architecture, which is a more efficient architecture than the legacy interrupt architecture described in the PCI architecture section. To generate an interrupt request, a

PCI-X devices initiates a memory write transaction targeting the Host (North) bridge. The data written is a unique interrupt vector associated with the device generating the interrupt. The Host bridge interrupts the CPU and the vector is delivered to the CPU in a platform specific manner. With this vector, the CPU is immediately able to run an interrupt service routine to service the interrupting device. There is no software overhead in determining which device generated the interrupt. Also, unlike in the PCI architecture, no interrupt pins are required.

PCI Express architecture implements the MSI protocol, resulting in reduced interrupt servicing latency and elimination of interrupt signals.

PCI Express architecture also supports the RO bit and NS bit feature with the result that those transactions with either

NS = 1

RO = 1

complete with better performance than transactions with

NS = 0

RO = 0

. PCI transactions by definition assume

NS = 0

and

RO = 0

NS - No Snoop (NS) may be used when accessing system memory. PCI-X bus masters can use the NS bit to indicate whether the region of memory being accessed is cachable $(NS = 0)$ or not $(NS = 1)$ . For those transactions with $NS = 1$ ,the Host bridge does not snoop the processor cache. The result is improved performance during accesses to non-cachable memory.

RO - Relaxed Ordering (RO) allows transactions that do not have any order of completion requirements to complete more efficiently. We will not get into the details here. Suffice it to say that transactions with the RO bit set can complete on the bus in any order with respect to other transactions that are pending completion.

The PCI-X 2.0 specification released in Q1 2002 was designed to further increase the bandwidth capability of PCI-X bus. This bus is described next.

DDR and QDR PCI-X 2.0 Bus Based Platforms

Figure 1-19 shows a hypothetical PCI-X 2.0 system. This diagram is the author's best guess as to what a PCI-X 2.0 system will look like. PCI-X 2.0 devices and connectors are

100 %

hardware and software backwards compatible with PCI-X 1.0 as well as PCI devices and connectors. A PCI-X 2.0 bus supports either Dual Data Rate (DDR) or Quad Data Rate (QDR) data transport using a PCI-X 133

MHz

clock and strobes that are phase shifted to provide the necessary clock edges.

A design requiring greater than

1 GByte / \sec

bus bandwidth can implement the DDR or QDR protocol. As indicated in Table 1-2 on page 13, PCI-X 2.0 peak bandwidth capability is

4256 MBytes / \sec

for a 64-bit

533 MHz

effective PCI-X bus. With the aid of a strobe clock, data is transferred two times or four times per 133 MHz clock.

PCI-X 2.0 devices also support ECC generation and checking. This allows auto-correction of single bit errors and detection and reporting of multi-bit errors. Error handling is more robust than PCI and PCI-X 1.0 systems making this bus more suited for high-performance, robust, non-stop server applications.

Some noteworthy points to remember are that with very fast signal timing, it is only possible to support one connector on the PCI-X 2.0 bus. This implies that a PCI-X 2.0 bus essentially becomes a point-to-point connection with no multidrop capability as with its predecessor buses.

PCI-X 2.0 bridges are essentially switches with one primary bus and one or more downstream secondary buses as shown in Figure 1-19 on page 40.

Chapter 1: Architectural Perspective

The PCI Express Way

PCI Express provides a high-speed, high-performance, point-to-point, dual simplex, differential signaling Link for interconnecting devices. Data is transmitted from a device on one set of signals, and received on another set of signals.

The Link A Point-to-Point Interconnect

As shown in Figure 1-20, a PCI Express interconnect consists of either a x1, x2, x4, x8, x12, x16 or x32 point-to-point Link. A PCI Express Link is the physical connection between two devices. A Lane consists of signal pairs in each direction. A

\times 1

Link consists of 1 Lane or 1 differential signal pair in each direction for a total of 4 signals. A x32 Link consists of 32 Lanes or 32 signal pairs for each direction for a total of 128 signals. The Link supports a symmetric number of Lanes in each direction. During hardware initialization, the Link is initialized for Link width and frequency of operation automatically by the devices on opposite ends of the Link. No OS or firmware is involved during Link level initialization.

Figure 1-20: PCI Express Link

Differential Signaling

PCI Express devices employ differential drivers and receivers at each port. Figure 1-21 shows the electrical characteristics of a PCI Express signal. A positive voltage difference between the D+ and D- terminals implies Logical 1. A negative voltage difference between

D +

and

D

- implies a Logical 0 . No voltage difference between

D +

and

D

- means that the driver is in the high-impedance tristate condition, which is referred to as the electrical-idle and low-power state of the Link.

PCI Express System Architecture

The PCI Express Differential Peak-to-Peak signal voltage at the transmitter ranges from

800 mV - 1200 mV

,while the differential peak voltage is one-half these values. The common mode voltage can be any voltage between

0 V

and

3.6 V

. The differential driver is DC isolated from the differential receiver at the opposite end of the Link by placing a capacitor at the driver side of the Link. Two devices at opposite ends of a Link may support different DC common mode voltages. The differential impedance at the receiver is matched with the board impedance to prevent reflections from occurring.

Figure 1-21: PCI Express Differential Signal

Switches Used to Interconnect Multiple Devices

Switches are implemented in systems requiring multiple devices to be interconnected. Switches can range from a 2-port device to an n-port device, where each port connects to a PCI Express Link. The specification does not indicate a maximum number of ports a switch can implement. A switch may be incorporated into a Root Complex device (Host bridge or North bridge equivalent), resulting in a multi-port root complex. Figure 1-23 on page 52 and Figure 1-25 on page 54 are examples of PCI Express systems showing multi-ported devices such as the root complex or switches.

Packet Based Protocol

Rather than bus cycles we are familiar with from PCI and PCI-X architectures, PCI Express encodes transactions using a packet based protocol. Packets are transmitted and received serially and byte striped across the available Lanes of the Link. The more Lanes implemented on a Link the faster a packet is transmitted and the greater the bandwidth of the Link. The packets are used to support the split transaction protocol for non-posted transactions. Various types of packets such as memory read and write requests, IO read and write requests, configuration read and write requests, message requests and completions are defined. Chapter 1: Architectural Perspective

Bandwidth and Clocking

As is apparent from Table 1-3 on page 14, the aggregate bandwidth achievable with PCI Express is significantly higher than any bus available today. The PCI Express 1.0 specification supports 2.5 Gbits/sec/lane/direction transfer rate.

No clock signal exists on the Link. Each packet to be transmitted over the Link consists of bytes of information. Each byte is encoded into a 10-bit symbol. All symbols are guaranteed to have one-zero transitions. The receiver uses a PLL to recover a clock from the 0-to-1 and 1-to-0 transitions of the incoming bit stream.

Address Space

PCI Express supports the same address spaces as PCI: memory, IO and configuration address spaces. In addition, the maximum configuration address space per device function is extended from 256 Bytes to 4 KBytes. New OS, drivers and applications are required to take advantage of this additional configuration address space. Also, a new messaging transaction and address space provides messaging capability between devices. Some messages are PCI Express standard messages used for error reporting, interrupt and power management messaging. Other messages are vendor defined messages.

PCI Express Transactions

PCI Express supports the same transaction types supported by PCI and PCI-X. These include memory read and memory write, I/O read and I/O write, configuration read and configuration write. In addition, PCI Express supports a new transaction type called Message transactions. These transactions are encoded using the packet-based PCI Express protocol described later.

PCI Express Transaction Model

PCI Express transactions can be divided into two categories. Those transactions that are non-posted and those that are posted. Non-posted transactions, such as memory reads, implement a split transaction communication model similar to the PCI-X split transaction protocol. For example, a requester device transmits a non-posted type memory read request packet to a completer. The completer returns a completion packet with the read data to the requester. Posted transactions, such as memory writes, consist of a memory write packet transmitted uni-directionally from requester to completer with no completion packet returned from completer to requester.

Error Handling and Robustness of Data Transfer

CRC fields are embedded within each packet transmitted. One of the CRC fields supports a Link-level error checking protocol whereby each receiver of a packet checks for Link-level CRC errors. Packets transmitted over the Link in error are recognized with a CRC error at the receiver. The transmitter of the packet is notified of the error by the receiver. The transmitter automatically retries sending the packet (with no software involvement), hopefully resulting in auto-correction of the error.

In addition, an optional CRC field within a packet allows for end-to-end data integrity checking required for high availability applications.

Error handling on PCI Express can be as rudimentary as PCI level error handling described earlier or can be robust enough for server-level requirements. A rich set of error logging registers and error reporting mechanisms provide for improved fault isolation and recovery solutions required by RAS (Reliable, Available, Serviceable) applications.

Quality of Service (QoS), Traffic Classes (TCs) and Virtual Channels (VCs)

The Quality of Service feature of PCI Express refers to the capability of routing packets from different applications through the fabric with differentiated priorities and deterministic latencies and bandwidth. For example, it may be desirable to ensure that Isochronous applications, such as video data packets, move through the fabric with higher priority and guaranteed bandwidth, while control data packets may not have specific bandwidth or latency requirements.

PCI Express packets contain a Traffic Class (TC) number between 0 and 7 that is assigned by the device's application or device driver. Packets with different TCs can move through the fabric with different priority, resulting in varying performances. These packets are routed through the fabric by utilizing virtual channel (VC) buffers implemented in switches, endpoints and root complex devices.

Each Traffic Class is individually mapped to a Virtual Channel (a VC can have several TCs mapped to it, but a TC cannot be mapped to multiple VCs). The TC in each packet is used by the transmitting and receiving ports to determine which VC buffer to drop the packet into. Switches and devices are configured to arbitrate and prioritize between packets from different VCs before forwarding. This arbitration is referred to as VC arbitration. In addition, packets arriving at different ingress ports are forwarded to their own VC buffers at the egress port. These transactions are prioritized based on the ingress port number when being Chapter 1: Architectural Perspective

merged into a common VC output buffer for delivery across the egress link. This arbitration is referred to as Port arbitration.

The result is that packets with different TC numbers could observe different performance when routed through the PCI Express fabric.

Flow Control

A packet transmitted by a device is received into a VC buffer in the receiver at the opposite end of the Link. The receiver periodically updates the transmitter with information regarding the amount of buffer space it has available. The transmitter device will only transmit a packet to the receiver if it knows that the receiving device has sufficient buffer space to hold the next transaction. The protocol by which the transmitter ensures that the receiving buffer has sufficient space available is referred to as flow control. The flow control mechanism guarantees that a transmitted packet will be accepted by the receiver, baring error conditions. As such, the PCI Express transaction protocol does not require support of packet retry (unless an error condition is detected in the receiver), thereby improving the efficiency with which packets are forwarded to a receiver via the Link.

MSI Style Interrupt Handling Similar to PCI-X

Interrupt handling is accomplished in-band via PCI-X-like MSI protocol. PCI Express device use a memory write packet to transmit an interrupt vector to the root complex host bridge device, which in-turn interrupts the CPU. PCI Express devices are required to implement the MSI capability register block. PCI Express also supports legacy interrupt handling in-band by encoding interrupt signal transitions (for INTA#, INTB#, INTC# and INTD#) using Message transactions. Only endpoint devices that must support legacy functions and PCI Express-to-PCI bridges are allowed to support legacy interrupt generation.

Power Management

The PCI Express fabric consumes less power because the interconnect consists of fewer signals that have smaller signal swings. Each device's power state is individually managed. PCI/PCI Express power management software determines the power management capability of each device and manages it individually in a manner similar to PCI. Devices can notify software of their current power state, as well as power management software can propagate a wake-up event through the fabric to power-up a device or group of devices. Devices can also signal a wake-up event using an in-band mechanism or a side-band signal. PCI Express System Architecture

With no software involvement, devices place a Link into a power savings state after a time-out when they recognize that there are no packets to transmit over the Link. This capability is referred to as Active State power management.

PCI Express supports device power states: D0, D1, D2, D3-Hot and D3-Cold, where D0 is the full-on power state and D3-Cold is the lowest power state.

PCI Express also supports the following Link power states: L0, L0s, L1, L2 and

L 3

,where

L 0

is the full-on Link state and

L 3

is the Link-Off power state.

Hot Plug Support

PCI Express supports hot plug and surprise hot unplug without usage of sideband signals. Hot plug interrupt messages, communicated in-band to the root complex, trigger hot plug software to detect a hot plug or removal event. Rather than implementing a centralized hot plug controller as exists in PCI platforms, the hot plug controller function is distributed to the port logic associated with a hot plug capable port of a switch or root complex. 2 colored LEDs, a Manually-operated Retention Latch (MRL), MRL sensor, attention button, power control signal and PRSNT2# signal are some of the elements of a hot plug capable port.

PCI Compatible Software Model

PCI Express employs the same programming model as PCI and PCI-X systems described earlier in this chapter. The memory and IO address space remains the same as PCI/PCI-X. The first 256 Bytes of configuration space per PCI Express function is the same as PCI/PCI-X device configuration address space, thus ensuring that current OSs and device drivers will run on a PCI Express system. PCI Express architecture extends the configuration address space to

4 KB

per functional device. Updated OSs and device drivers are required to take advantage and access this additional configuration address space.

PCI Express configuration model supports two mechanisms:

PCI compatible configuration model which is $100 %$ compatible with existing OSs and bus enumeration and configuration software for PCI/PCI-X systems.

PCI Express enhanced configuration mechanism which provides access to additional configuration space beyond the first 256 Bytes and up to 4 KBytes per function.

Mechanical Form Factors

PCI Express architecture supports multiple platform interconnects such as chip-to-chip, board-to-peripheral card via PCI-like connectors and Mini PCI Express form factors for the mobile market. Specifications for these are fully defined. See "Add-in Cards and Connectors" on page 685 for details on PCI Express peripheral card and connector definition.

PCI-like Peripheral Card and Connector. Currently, x1, x4, x8 and x16 PCI-like connectors are defined along with associated peripheral cards. Desktop computers implementing PCI Express can have the same look and feel as current computers with no changes required to existing system form factors. PCI Express motherboards can have an ATX-like motherboard form factor.

Mini PCI Express Form Factor. Mini PCI Express connector and add-in card implements a subset of signals that exist on a standard PCI Express connector and add-in card. The form factor, as the name implies, is much smaller. This form factor targets the mobile computing market. The Mini PCI Express slot supports

\times 1

PCI Express signals including power management signals. In addition, the slot supports LED control signals, a USB interface and an SMBus interface. The Mini PCI Express module is similar but smaller than a PC Card.

Mechanical Form Factors Pending Release

As of May 2003, specifications for two new form factors have not been released. Below is a summary of publicly available information about these form factors.

NEWCARD Form Factor. Another new module form factor that will service both mobile and desktop markets is the NEWCARD form factor. This is a PCMCIA PC card type form factor, but of nearly half the size that will support

\times 1

PCI Express signals including power management signals. In addition,the slot supports USB and SMBus interfaces. There are two size form factors defined, a narrower version and a wider version though the thickness and depth remain the same. Although similar in appearance to Mini PCI Express Module, this is a different form factor.

Server IO Module (SIOM) Form Factor. These are a family of modules that target the workstation and server market. They are designed with future support of larger PCI Express Lane widths and higher frequency bit rates beyond 2.5 Gbits/s Generation 1 transmission rates. Four form factors are under consideration. The base module with single- and double-width modules. Also, the full height with single- and double-width modules. PCI Express System Architecture

PCI Express Topology

Major components in the PCI Express system shown in Figure 1-22 include a root complex, switches, and endpoint devices.

Figure 1-22: PCI Express Topology

The Root Complex denotes the device that connects the CPU and memory subsystem to the PCI Express fabric. It may support one or more PCI Express ports. The root complex in this example supports 3 ports. Each port is connected to an endpoint device or a switch which forms a sub-hierarchy. The root complex generates transaction requests on the behalf of the CPU. It is capable of initiating configuration transaction requests on the behalf of the CPU. It generates both memory and IO requests as well as generates locked transaction requests on the behalf of the CPU. The root complex as a completer does not respond to locked requests. Root complex transmits packets out of its ports and receives packets on its ports which it forwards to memory. A multi-port root complex may also route packets from one port to another port but is NOT required by the specification to do so.

Root complex implements central resources such as: hot plug controller, power management controller, interrupt controller, error detection and reporting logic. The root complex initializes with a bus number, device number and function number which are used to form a requester ID or completer ID. The root complex bus, device and function numbers initialize to all 0 s.

A Hierarchy is a fabric of all the devices and Links associated with a root complex that are either directly connected to the root complex via its port(s) or indirectly connected via switches and bridges. In Figure 1-22 on page 48, the entire PCI Express fabric associated with the root is one hierarchy.

A Hierarchy Domain is a fabric of devices and Links that are associated with one port of the root complex. For example in Figure 1-22 on page 48, there are 3 hierarchy domains.

Endpoints are devices other than root complex and switches that are requesters or completers of PCI Express transactions. They are peripheral devices such as Ethernet, USB or graphics devices. Endpoints initiate transactions as a requester or respond to transactions as a completer. Two types of endpoints exist, PCI Express endpoints and legacy endpoints. Legacy Endpoints may support IO transactions. They may support locked transaction semantics as a completer but not as a requester. Interrupt capable legacy devices may support legacy style interrupt generation using message requests but must in addition support MSI generation using memory write transactions. Legacy devices are not required to support 64-bit memory addressing capability. PCI Express Endpoints must not support IO or locked transaction semantics and must support MSI style interrupt generation. PCI Express endpoints must support 64-bit memory addressing capability in prefetchable memory address space, though their non-prefetchable memory address space is permitted to map the below 4GByte boundary. Both types of endpoints implement Type 0 PCI configuration headers and respond to configuration transactions as completers. Each endpoint is initialized with a device ID (requester ID or completer ID) which consists of a bus number, device number, and function number. Endpoints are always device 0 on a bus.

Multi-Function Endpoints. Like PCI devices, PCI Express devices may support up to 8 functions per endpoint with at least one function number 0 . However, a PCI Express Link supports only one endpoint numbered device 0 .

PCI Express-to-PCI(-X) Bridge is a bridge between PCI Express fabric and a PCI or PCI-X hierarchy. PCI Express System Architecture

A Requester is a device that originates a transaction in the PCI Express fabric. Root complex and endpoints are requester type devices.

A Completer is a device addressed or targeted by a requester. A requester reads data from a completer or writes data to a completer. Root complex and endpoints are completer type devices.

A Port is the interface between a PCI Express component and the Link. It consists of differential transmitters and receivers. An Upstream Port is a port that points in the direction of the root complex. A Downstream Port is a port that points away from the root complex. An endpoint port is an upstream port. A root complex port(s) is a downstream port. An Ingress Port is a port that receives a packet. An Egress Port is a port that transmits a packet.

A Switch can be thought of as consisting of two or more logical PCI-to-PCI bridges, each bridge associated with a switch port. Each bridge implements configuration header 1 registers. Configuration and enumeration software will detect and initialize each of the header 1 registers at boot time. A 4 port switch shown in Figure 1-22 on page 48 consists of 4 virtual bridges. These bridges are internally connected via a non-defined bus. One port of a switch pointing in the direction of the root complex is an upstream port. All other ports pointing away from the root complex are downstream ports.

A switch forwards packets in a manner similar to PCI bridges using memory, IO or configuration address based routing. Switches must forward all types of transactions from any ingress port to any egress port. Switches forward these packets based on one of three routing mechanisms: address routing, ID routing, or implicit routing. The logical bridges within the switch implement PCI configuration header 1. The configuration header contains memory and IO base and limit address registers as well as primary bus number, secondary bus number and subordinate bus number registers. These registers are used by the switch to aid in packet routing and forwarding.

Switches implement two arbitration mechanisms, port arbitration and VC arbitration, by which they determine the priority with which to forward packets from ingress ports to egress ports. Switches support locked requests.

Enumerating the System

Standard PCI Plug and Play enumeration software can enumerate a PCI Express system. The Links are numbered in a manner similar to the PCI depth first search enumeration algorithm. An example of the bus numbering is shown in Figure 1-22 on page 48. Each PCI Express Link is equivalent to a logical PCI

bus. In other words, each Link is assigned a bus number by the bus enumerating software. A PCI Express endpoint is device 0 on a PCI Express Link of a given bus number. Only one device (device 0) exists per PCI Express Link. The internal bus within a switch that connects all the virtual bridges together is also numbered. The first Link associated with the root complex is number bus 1. Bus 0 is an internal virtual bus within the root complex. Buses downstream of a PCI Express-to-PCI(-X) bridge are enumerated the same way as in a PCI(-X) system.

Endpoints and PCI(-X) devices may implement up to 8 functions per device. Only 1 device is supported per PCI Express Link though PCI(-X) buses may theoretically support up to 32 devices per bus. A system could theoretically include up to 256 PCI Express Link and PCI(-X) buses.

PCI Express System Block Diagram

Low Cost PCI Express Chipset

Figure 1-23 on page 52 is a block diagram of a low cost PCI Express based system. As of the writing of this book (April 2003) no real life PCI Express chipset architecture designs were publicly disclosed. The author describes here a practical low cost PCI Express chipset whose architecture is based on existing non-PCI Express chipset architectures. In this solution, AGP which connects MCH to a graphics controller in earlier MCH designs (see Figure 1-14 on page 32) is replaced with a PCI Express Link. The Hub Link that connects MCH to ICH is replaced with a PCI Express Link. And in addition to a PCI bus associated with ICH, the ICH chip supports 4 PCI Express Links. Some of these Links can connect directly to devices on the motherboard and some can be routed to connectors where peripheral cards are installed.

The CPU can communicate with PCI Express devices associated with ICH as well as the PCI Express graphics controller. PCI Express devices can communicate with system memory or the graphics controller associated with MCH. PCI devices may also communicate with PCI Express devices and vice versa. In other words, the chipset supports peer-to-peer packet routing between PCI Express endpoints and PCI devices, memory and graphics. It is yet to be determined if the first generation PCI Express chipsets, will support peer-to-peer packet routing between PCI Express endpoints. Remember that the specification does not require the root complex to support peer-to-peer packet routing between the multiple Links associated with the root complex.

PCI Express System Architecture

This design does not require the use of switches if the number of PCI Express devices to be connected does not exceed the number of Links available in this design.

Figure 1-23: Low Cost PCI Express System

Another Low Cost PCI Express Chipset

Figure 1-24 on page 53 is a block diagram of another low cost PCI Express system. In this design, the Hub Link connects the root complex to an ICH device. The ICH device may be an existing design which has no PCI Express Link associated with it. Instead, all PCI Express Links are associated with the root complex. One of these Links connects to a graphics controller. The other Links directly connect to PCI Express endpoints on the motherboard or connect to PCI Express endpoints on peripheral cards inserted in slots.

Figure 1-25: PCI Express High-End Server System

PCI Express Specifications

As of the writing of this book (May 2003) the following are specifications released by the PCISIG.

PCI Express 1.0a Base Specification released Q2, 2003

PCI Express 1.0a Card Electomechanical Specification released Q2, 2002

PCI Express 1.0 Base Specification released Q2, 2002

PCI Express 1.0 Card Electomechanical Specification released Q2, 2002

Mini PCI Express 1.0 Specification released Q2, 2003

As of May 2003, the specifications pending release are: the PCI Express-to-PCI Bridge specification, Server IO Module specification, Cable specification, Backplane specification, updated Mini PCI Express specification, and NEWCARD specification. 2

Architecture Overview

Previous Chapter

The previous chapter described performance advantages and key features of the PCI Express (PCI-XP) Link. To highlight these advantages, the chapter described performance characteristics and features of predecessor buses such as PCI and PCI-X buses with the goal of discussing the evolution of PCI Express from these predecessor buses. It compared and contrasted features and performance points of PCI, PCI-X and PCI Express buses. The key features of a PCI Express system were described. The chapter in addition described some examples of PCI Express system topologies.

This Chapter

This chapter is an introduction to the PCI Express data transfer protocol. It describes the layered approach to PCI Express device design while describing the function of each device layer. Packet types employed in accomplishing data transfers are described without getting into packet content details. Finally, this chapter outlines the process of a requester initiating a transaction such as a memory read to read data from a completer across a Link.

The Next Chapter

The next chapter describes how packets are routed through a PCI Express fabric consisting of switches. Packets are routed based on a memory address, IO address, device ID or implicitly.

Introduction to PCI Express Transactions

PCI Express employs packets to accomplish data transfers between devices. A root complex can communicate with an endpoint. An endpoint can communicate with a root complex. An endpoint can communicate with another endpoint. Communication involves the transmission and reception of packets called Transaction Layer packets (TLPs).

PCI Express System Architecture

PCI Express transactions can be grouped into four categories:

1) memory, 2) IO, 3) configuration, and 4) message transactions. Memory, IO and configuration transactions are supported in PCI and PCI-X architectures, but the message transaction is new to PCI Express. Transactions are defined as a series of one or more packet transmissions required to complete an information transfer between a requester and a completer. Table 2-1 is a more detailed list of transactions. These transactions can be categorized into non-posted transactions and posted transactions.

Table 2-1: PCI Express Non-Posted and Posted Transactions

Transaction Type	Non-Posted or Posted
Memory Read	Non-Posted
Memory Write	Posted
Memory Read Lock	Non-Posted
IO Read	Non-Posted
IO Write	Non-Posted
Configuration Read (Type 0 and Type 1)	Non-Posted
Configuration Write (Type 0 and Type 1)	Non-Posted
Message	Posted

For Non-posted transactions, a requester transmits a TLP request packet to a completer. At a later time, the completer returns a TLP completion packet back to the requester. Non-posted transactions are handled as split transactions similar to the PCI-X split transaction model described on page 37 in Chapter 1. The purpose of the completion TLP is to confirm to the requester that the completer has received the request TLP. In addition, non-posted read transactions contain data in the completion TLP. Non-Posted write transactions contain data in the write request TLP.

For Posted transactions, a requester transmits a TLP request packet to a completer. The completer however does NOT return a completion TLP back to the requester. Posted transactions are optimized for best performance in completing the transaction at the expense of the requester not having knowledge of successful reception of the request by the completer. Posted transactions may or may not contain data in the request TLP.

PCI Express Transaction Protocol

Table 2-2 lists all of the TLP request and TLP completion packets. These packets are used in the transactions referenced in Table 2-1. Our goal in this section is to describe how these packets are used to complete transactions at a system level and not to describe the packet routing through the PCI Express fabric nor to describe packet contents in any detail.

Table 2-2: PCI Express TLP Packet Types

TLP Packet Types	Abbreviated Name
Memory Read Request	MRd
Memory Read Request - Locked access	MRdLk
Memory Write Request	MWr
IO Read	IORd
IO Write	IOWr
Configuration Read (Type 0 and Type 1)	CfgRd0, CfgRd1
Configuration Write (Type 0 and Type 1)	CfgWr0, CfgWr1
Message Request without Data	Msg
Message Request with Data	MsgD
Completion without Data	Cpl
Completion with Data	CplD
Completion without Data - associated with Locked Memory Read Requests	CplLk
Completion with Data - associated with Locked Memory Read Requests	CplDLk

PCI Express System Architecture

Non-Posted Read Transactions

Figure 2-1 shows the packets transmitted by a requester and completer to complete a non-posted read transaction. To complete this transfer, a requester transmits a non-posted read request TLP to a completer it intends to read data from. Non-posted read request TLPs include memory read request (MRd), IO read request (IORd), and configuration read request type 0 or type 1 (CfgRd0, CfgRd1) TLPs. Requesters may be root complex or endpoint devices (endpoints do not initiate configuration read/write requests however).

The request TLP is routed through the fabric of switches using information in the header portion of the TLP. The packet makes its way to a targeted completer. The completer can be a root complex, switches, bridges or endpoints.

When the completer receives the packet and decodes its contents, it gathers the amount of data specified in the request from the targeted address. The completer creates a single completion TLP or multiple completion TLPs with data (CplD) and sends it back to the requester. The completer can return up to 4 KBytes of data per CplD packet.

The completion packet contains routing information necessary to route the packet back to the requester. This completion packet travels through the same path and hierarchy of switches as the request packet.

Requesters uses a tag field in the completion to associate it with a request TLP of the same tag value it transmitted earlier. Use of a tag in the request and completion TLPs allows a requester to manage multiple outstanding transactions.

If a completer is unable to obtain requested data as a result of an error, it returns a completion packet without data

(Cpl)

and an error status indication. The requester determines how to handle the error at the software layer.

Figure 2-1: Non-Posted Read Transaction Protocol

Legend:

\overset{―}{MRd = Memory}

Read Request

IORd

= 10

Read Request

CfgRd0 = Type 0 Configuration Read Request

CfgRd1 = Type 1 Configuration Read Request

CpID =

Completion with data for normal completion of MRd,IORd,CfgRd0,CfgRd1

Cpl =

Completion without data for error completion of MRd,IORd,CfgRd0,CfgRd1

Non-Posted Read Transaction for Locked Requests

Figure 2-2 on page 60 shows packets transmitted by a requester and completer to complete a non-posted locked read transaction. To complete this transfer, a requester transmits a memory read locked request (MRdLk) TLP. The requester can only be a root complex which initiates a locked request on the behalf of the CPU. Endpoints are not allowed to initiate locked requests.

The locked memory read request TLP is routed downstream through the fabric of switches using information in the header portion of the TLP. The packet makes its way to a targeted completer. The completer can only be a legacy endpoint. The entire path from root complex to the endpoint (for TCs that map to VC0) is locked including the ingress and egress port of switches in the pathway.

PCI Express System Architecture

Legend:

MRdLk = Memory Read Lock Request

CpIDLk = Locked normal Completion with data for normal completion of MRdLk

CplLk = Locked error Completion without data for error completion of MRdLk

When the completer receives the packet and decodes its contents, it gathers the amount of data specified in the request from the targeted address. The completer creates one or more locked completion TLP with data (CplDLk) along with a completion status. The completion is sent back to the root complex requester via the path and hierarchy of switches as the original request.

The CplDLk packet contains routing information necessary to route the packet back to the requester. Requesters uses a tag field in the completion to associate it with a request TLP of the same tag value it transmitted earlier. Use of a tag in the request and completion TLPs allows a requester to manage multiple outstanding transactions.

If the completer is unable to obtain the requested data as a result of an error, it returns a completion packet without data (CpILk) and an error status indication within the packet. The requester who receives the error notification via the CplLk TLP must assume that atomicity of the lock is no longer guaranteed and thus determine how to handle the error at the software layer.

The path from requester to completer remains locked until the requester at a later time transmits an unlock message to the completer. The path and ingress/ egress ports of a switch that the unlock message passes through are unlocked.

Chapter 2: Architecture Overview

Non-Posted Write Transactions

Figure 2-3 on page 61 shows the packets transmitted by a requester and completer to complete a non-posted write transaction. To complete this transfer, a requester transmits a non-posted write request TLP to a completer it intends to write data to. Non-posted write request TLPs include IO write request (IOWr), configuration write request type 0 or type 1 (CfgWr0, CfgWr1) TLPs. Memory write request and message requests are posted requests. Requesters may be a root complex or endpoint device (though not for configuration write requests).

Figure 2-3: Non-Posted Write Transaction Protocol

A request packet with data is routed through the fabric of switches using information in the header of the packet. The packet makes its way to a completer.

When the completer receives the packet and decodes its contents, it accepts the data. The completer creates a single completion packet without data (Cpl) to confirm reception of the write request. This is the purpose of the completion. PCI Express System Architecture

The completion packet contains routing information necessary to route the packet back to the requester. This completion packet will propagate through the same hierarchy of switches that the request packet went through before making its way back to the requester. The requester gets confirmation notification that the write request did make its way successfully to the completer.

If the completer is unable to successfully write the data in the request to the final destination or if the write request packet reaches the completer in error, then it returns a completion packet without data

(Cpl)

but with an error status indication. The requester who receives the error notification via the Cpl TLP determines how to handle the error at the software layer.

Posted Memory Write Transactions

Memory write requests shown in Figure 2-4 are posted transactions. This implies that the completer returns no completion notification to inform the requester that the memory write request packet has reached its destination successfully. No time is wasted in returning a completion, thus back-to-back posted writes complete with higher performance relative to non-posted transactions.

The write request packet which contains data is routed through the fabric of switches using information in the header portion of the packet. The packet makes its way to a completer. The completer accepts the specified amount of data within the packet. Transaction over.

If the write request is received by the completer in error, or is unable to write the posted write data to the final destination due to an internal error, the requester is not informed via the hardware protocol. The completer could log an error and generate an error message notification to the root complex. Error handling software manages the error.

Chapter 2: Architecture Overview

Figure 2-4: Posted Memory Write Transaction Protocol

Legend:

\overset{―}{MWr} =

Memory Write Request. No completions for this transaction

Posted Message Transactions

Message requests are also posted transactions as pictured in Figure 2-5 on page 64. There are two categories of message request TLPs, Msg and MsgD. Some message requests propagate from requester to completer, some are broadcast requests from the root complex to all endpoints, some are transmitted by an endpoint to the root complex. Message packets may be routed to completer(s) based on the message's address, device ID or routed implicitly. Message request routing is covered in Chapter 3.

The completer accepts any data that may be contained in the packet (if the packet is MsgD) and/or performs the task specified by the message.

Message request support eliminates the need for side-band signals in a PCI Express system. They are used for PCI style legacy interrupt signaling, power management protocol, error signaling, unlocking a path in the PCI Express fabric, slot power support, hot plug protocol, and vender defined purposes.

PCI Express System Architecture

Some Examples of Transactions

This section describes a few transaction examples showing packets transmitted between requester and completer to accomplish a transaction. The examples consist of a memory read, IO write, and Memory write.

Memory Read Originated by CPU, Targeting an Endpoint

Figure 2-6 shows an example of packet routing associated with completing a memory read transaction. The root complex on the behalf of the CPU initiates a non-posted memory read from the completer endpoint shown. The root complex transmits an MRd packet which contains amongst other fields, an address, TLP type, requester ID (of the root complex) and length of transfer (in doublewords) field. Switch A which is a 3 port switch receives the packet on its upstream port. The switch logically appears like a 3 virtual bridge device connected by an internal bus. The logical bridges within the switch contain memory and IO base and limit address registers within their configuration space similar to PCI bridges. The MRd packet address is decoded by the switch and compared with the base/limit address range registers of the two downstream logical bridges. The switch internally forwards the MRd packet from the upstream ingress port to the correct downstream port (the left port in this example). The MRd packet is forwarded to switch B. Switch B decodes the address in a similar manner. Assume the MRd packets is forwarded to the right-hand port so that the completer endpoint receives the MRd packet. PCI Express System Architecture

The logical bridges within Switch B compares the bus number field of the requester ID in the CplD packet with the secondary and subordinate bus number configuration registers. The CplD packet is forwarded to the appropriate port (in this case the upstream port). The CpID packet moves to Switch A which forwards the packet to the root complex. The requester ID field of the completion TLP matches the root complex's ID. The root complex checks the completion status (hopefully "successful completion") and accepts the data. This data is returned to the CPU in response to its pending memory read transaction.

Memory Read Originated by Endpoint, Targeting System Memory

In a similar manner, the endpoint device shown in Figure 2-7 on page 67 initiates a memory read request (MRd). This packet contains amongst other fields in the header, the endpoint's requester ID, targeted address and amount of data requested. It forwards the packet to Switch B which decodes the memory address in the packet and compares it with the memory base/limit address range registers within the virtual bridges of the switch. The packet is forwarded to Switch A which decodes the address in the packet and forwards the packet to the root complex completer.

The root complex obtains the requested data from system memory and creates a completion TLP with data (CplD). The bus number portion of the requester ID in the completion TLP is used to route the packet through the switches to the endpoint.

A requester endpoint can also communicate with another peer completer endpoint. For example an endpoint attached to switch B can talk to an endpoint connected to switch

C

. The request TLP is routed using an address. The completion is routed using bus number. Multi-port root complex devices are not required to support port-to-port packet routing. In which case, peer-to-peer transactions between endpoints associated with two different ports of the root complex is not supported.

Figure 2-7: Non-Posted Memory Read Originated by Endpoint and Targeting Memory

IO Write Initiated by CPU, Targeting an Endpoint

IO requests can only be initiated by a root complex or a legacy endpoint. PCI Express endpoints do not initiate IO transactions. IO transactions are intended for legacy support. Native PCI Express devices are not prohibited from implementing IO space, but the specification states that a PCI Express Endpoint must not depend on the operating system allocating I/O resources that are requested.

IO requests are routed by switches in a similar manner to memory requests. Switches route IO request packets by comparing the IO address in the packet with the IO base and limit address range registers in the virtual bridge configuration space associated with a switch

Figure 2-8 on page 68 shows routing of packets associated with an IO write transaction. The CPU initiates an IO write on the Front Side Bus (FSB). The write contains a target IO address and up to 4 Bytes of data. The root complex creates an IO Write request TLP (IOWr) using address and data from the CPU transaction. It uses its own requester ID in the packet header. This packet is routed through switch

A

and

B

. The completer endpoint returns a completion without data (Cpl) and completion status of 'successful completion' to confirm the reception of good data from the requester.

Figure 2-8: IO Write Transaction Originated by CPU, Targeting Legacy Endpoint

Memory Write Transaction Originated by CPU and Targeting an Endpoint

Memory write (MWr) requests (and message requests Msg or MsgD) are posted transactions. This implies that the completer does not return a completion. The MWr packet is routed through the PCI Express fabric of switches in the same manner as described for memory read requests. The requester root complex can write up to

4 KBy

tes of data with one MWr packet.

Figure 2-9 on page 69 shows a memory write transaction originated by the CPU. The root complex creates a MWr TLP on behalf of the CPU using target address and data from the CPU FSB transaction. This packet is routed through switch A and B. The packet reaches the endpoint and the transaction is complete.

Figure 2-9: Memory Write Transaction Originated by CPU, Targeting Endpoint

PCI Express Device Layers

Overview

The PCI Express specification defines a layered architecture for device design as shown in Figure 2-10 on page 70. The layers consist of a Transaction Layer, a Data Link Layer and a Physical layer. The layers can be further divided vertically into two, a transmit portion that processes outbound traffic and a receive portion that processes inbound traffic. However, a device design does not have to implement a layered architecture as long as the functionality required by the specification is supported. PCI Express System Architecture

Transmit Portion of Device Layers

Consider the transmit portion of a device. Packet contents are formed in the Transaction Layer with information obtained from the device core and application. The packet is stored in buffers ready for transmission to the lower layers. This packet is referred to as a Transaction Layer Packet (TLP) described in the earlier section of this chapter. The Data Link Layer concatenates to the packet additional information required for error checking at a receiver device. The packet is then encoded in the Physical layer and transmitted differentially on the Link by the analog portion of this Layer. The packet is transmitted using the available Lanes of the Link to the receiving device which is its neighbor.

Receive Portion of Device Layers

The receiver device decodes the incoming packet contents in the Physical Layer and forwards the resulting contents to the upper layers. The Data Link Layer checks for errors in the incoming packet and if there are no errors forwards the packet up to the Transaction Layer. The Transaction Layer buffers the incoming TLPs and converts the information in the packet to a representation that can be processed by the device core and application.

Device Layers and their Associated Packets

Three categories of packets are defined, each one is associated with one of the three device layers. Associated with the Transaction Layer is the Transaction Layer Packet (TLP). Associated with the Data Link Layer is the Data Link Layer Packet (DLLP). Associated with the Physical Layer is the Physical Layer Packet (PLP). These packets are introduced next.

Transaction Layer Packets (TLPs)

PCI Express transactions employ TLPs which originate at the Transaction Layer of a transmitter device and terminate at the Transaction Layer of a receiver device. This process is represented in Figure 2-11 on page 72. The Data Link Layer and Physical Layer also contribute to TLP assembly as the TLP moves through the layers of the transmitting device. At the other end of the Link where a neighbor receives the TLP, the Physical Layer, Data Link Layer and Transaction Layer disassemble the TLP.

TLP Packet Assembly. A TLP that is transmitted on the Link appears as shown in Figure 2-12 on page 73.

The software layer/device core sends to the Transaction Layer the information required to assemble the core section of the TLP which is the header and data portion of the packet. Some TLPs do not contain a data section. An optional End-to-End CRC (ECRC) field is calculated and appended to the packet. The ECRC field is used by the ultimate targeted device of this packet to check for CRC errors in the header and data portion of the TLP.

The core section of the TLP is forwarded to the Data Link Layer which then appends a sequence ID and another LCRC field. The LCRC field is used by the neighboring receiver device at the other end of the Link to check for CRC errors in the core section of the TLP plus the sequence ID. The resultant TLP is forwarded to the Physical Layer which concatenates a Start and End framing character of 1 byte each to the packet. The packet is encoded and differentially transmitted on the Link using the available number of Lanes.

Chapter 2: Architecture Overview

TLP Packet Disassembly. A neighboring receiver device receives the incoming TLP bit stream. As shown in Figure 2-13 on page 74 the received TLP is decoded by the Physical Layer and the Start and End frame fields are stripped. The resultant TLP is sent to the Data Link Layer. This layer checks for any errors in the TLP and strips the sequence ID and LCRC field. Assume there are no LCRC errors, then the TLP is forwarded up to the Transaction Layer. If the receiving device is a switch, then the packet is routed from one port of the switch to an egress port based on address information contained in the header portion of the TLP. Switches are allowed to check for ECRC errors and even report the errors it finds and error. However, a switch is not allowed to modify the ECRC that way the targeted device of this TLP will detect an ECRC error if there is such an error.

PCI Express System Architecture

The ultimate targeted device of this TLP checks for ECRC errors in the header and data portion of the TLP. The ECRC field is stripped, leaving the header and data portion of the packet. It is this information that is finally forwarded to the Device Core/Software Layer.

Figure 2-13: TLP Disassembly

Data Link Layer Packets (DLLPs)

Another PCI Express packet called DLLP originates at the Data Link Layer of a transmitter device and terminates at the Data Link Layer of a receiver device. This process is represented in Figure 2-14 on page 75. The Physical Layer also contributes to DLLP assembly and disassembly as the DLLP moves from one device to another via the PCI Express Link.

DLLPs are used for Link Management functions including TLP acknowledgement associated with the ACK/NAK protocol, power management, and exchange of Flow Control information.

DLLPs are transferred between Data Link Layers of the two directly connected components on a Link. DLLPs do not pass through switches unlike TLPs which do travel through the PCI Express fabric. DLLPs do not contain routing information. These packets are smaller in size compared to TLPs, 8 bytes to be precise.

Figure 2-14: DLLP Origin and Destination

DLLP Assembly. The DLLP shown in Figure 2-15 on page 76 originates at the Data Link Layer. There are various types of DLLPs some of which include Flow Control DLLPs (FCx), acknowledge/ no acknowledge DLLPs which confirm reception of TLPs (ACK and NAK), and power management DLLPs (PMx). A DLLP type field identifies various types of DLLPs. The Data Link Layer appends a 16-bit CRC used by the receiver of the DLLP to check for CRC errors in the DLLP.

The DLLP content along with a 16-bit CRC is forwarded to the Physical Layer which appends a Start and End frame character of 1 byte each to the packet. The packet is encoded and differentially transmitted on the Link using the available number of Lanes.

Physical Layer Packets (PLPs)

Another PCI Express packet called PLP originates at the Physical Layer of a transmitter device and terminates at the Physical Layer of a receiver device. This process is represented in Figure 2-17 on page 77. The PLP is a very simple packet that starts with a 1 byte COM character followed by 3 or more other characters that define the PLP type as well as contain other information. The PLP is a multiple of 4 bytes in size, an example of which is shown in Figure 2-18 on page 78. The specification refers to this packet as the Ordered-Set. PLPs do not contain any routing information. They are not routed through the fabric and do not propagate through a switch.

Some PLPs are used during the Link Training process described in "Ordered-Sets Used During Link Training and Initialization" on page 504. Another PLP is used for clock tolerance compensation. PLPs are used to place a Link into the electrical idle low power state or to wake up a link from this low power state.

Figure 2-17: PLP Origin and Destination

PCI Express System Architecture

Figure 2-18: PLP or Ordered-Set Structure

Function of Each PCI Express Device Layer

Figure 2-19 on page 79 is a more detailed block diagram of a PCI Express Device's layers. This block diagram is used to explain key functions of each layer and explain the function of each layer as it relates to generation of outbound traffic and response to inbound traffic. The layers consist of Device Core/Software Layer, Transaction Layer, Data Link Layer and Physical Layer.

Device Core / Software Layer

The Device Core consists of, for example, the root complex core logic or an endpoint core logic such as that of an Ethernet controller, SCSI controller, USB controller, etc. To design a PCI Express endpoint, a designer may reuse the Device Core logic from a PCI or PCI-X core logic design and wrap around it the PCI Express layered design described in this section.

Transmit Side. The Device Core logic in conjunction with local software provides the necessary information required by the PCI Express device to generate TLPs. This information is sent via the Transmit interface to the Transaction Layer of the device. Example of information transmitted to the Transaction Layer includes: transaction type to inform the Transaction Layer what type of TLP to generate, address, amount of data to transfer, data, traffic class, message index etc.

Receive Side. The Device Core logic is also responsible to receive information sent by the Transaction Layer via the Receive interface. This information includes: type of TLP received by the Transaction Layer, address, amount of data received, data, traffic class of received TLP, message index, error conditions etc.

PCI Express System Architecture

The transaction layer contains virtual channel buffers (VC Buffers) to store outbound TLPs that await transmission and also to store inbound TLPs received from the Link. The flow control protocol associated with these virtual channel buffers ensures that a remote transmitter does not transmit too many TLPs and cause the receiver virtual channel buffers to overflow. The Transaction Layer also orders TLPs according to ordering rules before transmission. It is this layer that supports the Quality of Service (QoS) protocol.

The Transaction Layer supports 4 address spaces: memory address, IO address, configuration address and message space. Message packets contain a message.

Transmit Side. The Transaction Layer receives information from the Device Core and generates outbound request and completion TLPs which it stores in virtual channel buffers. This layer assembles Transaction Layer Packets (TLPs). The major components of a TLP are: Header, Data Payload and an optional ECRC (specification also uses the term Digest) field as shown in Figure 2-20.

Figure 2-20: TLP Structure at the Transaction Layer

Transaction Layer Packet (TLP)

Header Data Rayload ECRC

The Header is 3 doublewords or 4 doublewords in size and may include information such as; Address, TLP type, transfer size, requester ID/completer ID, tag, traffic class, byte enables, completion codes, and attributes (including "no snoop" and "relaxed ordering" bits). The TLP types are defined in Table 2-2 on page 57.

The address is a 32-bit memory address or an extended 64-bit address for memory requests. It is a 32-bit address for IO requests. For configuration transactions the address is an ID consisting of Bus Number, Device Number and Function Number plus a configuration register address of the targeted register. For completion TLPs, the address is the requester ID of the device that originally made the request. For message transactions the address used for routing is the destination device's ID consisting of Bus Number, Device Number and Function Number of the device targeted by the message request. Message requests could also be broadcast or routed implicitly by targeting the root complex or an upstream port.

The transfer size or length field indicates the amount of data to transfer calculated in doublewords (DWs). The data transfer length can be between 1 to 1024 DWs. Write request TLPs include data payload in the amount indicated by the length field of the header. For a read request TLP, the length field indicates the amount of data requested from a completer. This data is returned in one or more completion packets. Read request TLPs do not include a data payload field. Byte enables specify byte level address resolution.

Request packets contain a requester ID (bus#, device#, function #) of the device transmitting the request. The tag field in the request is memorized by the completer and the same tag is used in the completion.

A bit in the Header (TD = TLP Digest) indicates whether this packet contains an ECRC field also referred to as Digest. This field is 32-bits wide and contains an End-to-End CRC (ECRC). The ECRC field is generated by the Transaction Layer at time of creation of the outbound TLP. It is generated based on the entire TLP from first byte of header to last byte of data payload (with the exception of the EP bit, and bit 0 of the Type field. These two bits are always considered to be a 1 for the ECRC calculation). The TLP never changes as it traverses the fabric (with the exception of perhaps the two bits mentioned in the earlier sentence). The receiver device checks for an ECRC error that may occur as the packet moves through the fabric.

Receiver Side. The receiver side of the Transaction Layer stores inbound TLPs in receiver virtual channel buffers. The receiver checks for CRC errors based on the ECRC field in the TLP. If there are no errors, the ECRC field is stripped and the resultant information in the TLP header as well as the data payload is sent to the Device Core.

Flow Control. The Transaction Layer ensures that it does not transmit a TLP over the Link to a remote receiver device unless the receiver device has virtual channel buffer space to accept TLPs (of a given traffic class). The protocol for guaranteeing this mechanism is referred to as the "flow control" protocol. If the transmitter device does not observe this protocol, a transmitted TLP will cause the receiver virtual channel buffer to overflow. Flow control is automatically managed at the hardware level and is transparent to software. Software is only involved to enable additional buffers beyond the default set of virtual channel buffers (referred to as VC 0 buffers). The default buffers are enabled automatically after Link training, thus allowing TLP traffic to flow through the fabric immediately after Link training. Configuration transactions use the default virtual channel buffers and can begin immediately after the Link training process. Link training process is described in Chapter 14, entitled "Link Initialization & Training," on page 499.

PCI Express System Architecture

Refer to Figure 2-21 on page 82 for an overview of the flow control process. A receiver device transmits DLLPs called Flow Control Packets (FCx DLLPs) to the transmitter device on a periodic basis. The FCx DLLPs contain flow control credit information that updates the transmitter regarding how much buffer space is available in the receiver virtual channel buffer. The transmitter keeps track of this information and will only transmit TLPs out of its Transaction Layer if it knows that the remote receiver has buffer space to accept the transmitted TLP.

Figure 2-21: Flow Control Process

Quality of Service (QoS). Consider Figure 2-22 on page 83 in which the video camera and SCSI device shown need to transmit write request TLPs to system DRAM. The camera data is time critical isochronous data which must reach memory with guaranteed bandwidth otherwise the displayed image will appear choppy or unclear. The SCSI data is not as time sensitive and only needs to get to system memory correctly without errors. It is clear that the video data packet should have higher priority when routed through the PCI Express fabric, especially through switches. QoS refers to the capability of routing packets from different applications through the fabric with differentiated priorities and deterministic latencies and bandwidth. PCI and PCI-X systems do not support QoS capability.

Consider this example. Application driver software in conjunction with the OS assigns the video data packets a traffic class of 7 (TC7) and the SCSI data packet a traffic class of

0

(TC0). These TC numbers are embedded in the TLP header. Configuration software uses TC/VC mapping device configuration registers to map TC0 related TLPs to virtual channel 0 buffers (VC0) and TC7 related TLPs to virtual channel 7 buffers (VC7). PCI Express System Architecture

Figure 2-22: Example Showing QoS Capability of PCI Express

As TLPs from these two applications (video and SCSI applications) move through the fabric, the switches post incoming packets moving upstream into their respective VC buffers (VC0 and VC7). The switch uses a priority based arbitration mechanism to determine which of the two incoming packets to forward with greater priority to a common egress port. Assume VC7 buffer contents are configured with higher priority than VC0. Whenever two incoming packets are to be forwarded to one upstream port, the switch will always pick the VC7 packet, the video data, over the VC0 packet, the SCSI data. This guarantees greater bandwidth and reduced latency for video data compared to SCSI data.

A PCI Express device that implements more than one set of virtual channel

buffers has the ability to arbitrate between TLPs from different VC buffers. VC buffers have configurable priorities. Thus traffic flowing through the system in different VC buffers will observe differentiated performances. The arbitration mechanism between TLP traffic flowing through different VC buffers is referred to as VC arbitration.

Also, multi-port switches have the ability to arbitrate between traffic coming in on two ingress ports but using the same VC buffer resource on a common egress port. This configurable arbitration mechanism between ports supported by switches is referred to as Port arbitration.

Traffic Classes (TCs) and Virtual Channels (VCs). TC is a TLP header field transmitted within the packet unmodified end-to-end through the fabric. Local application software and system software based on performance requirements decides what TC label a TLP uses. VCs are physical buffers that provide a means to support multiple independent logical data flows over the physical Link via the use of transmit and receiver virtual channel buffers.

PCI Express devices may implement up to

8 VC

buffers (VC0-VC7). The TC field is a 3-bit field that allows differentiation of traffic into 8 traffic classes (TC0-TC7). Devices must implement VC0. Similarly, a device is required to support TC0 (best effort general purpose service class). The other optional TCs may be used to provide differentiated service through the fabric. Associated with each implemented VC ID, a transmit device implements a transmit buffer and a receive device implements a receive buffer.

Devices or switches implement TC-to-VC mapping logic by which a TLP of a given TC number is forwarded through the Link using a particular VC numbered buffer. PCI Express provides the capability of mapping multiple TCs onto a single VC, thus reducing device cost by means of providing limited number of VC buffer support. TC/VC mapping is configured by system software through configuration registers. It is up to the device application software to determine TC label for TLPs and TC/VC mapping that meets performance requirements. In its simplest form TC/VC mapping registers can be configured with a one-to-one mapping of TC to VC.

Consider the example illustrated in Figure 2-23 on page 85. The TC/VC mapping registers in Device A are configured to map, TLPs with TC[2:0] to VC0 and TLPs with TC[7:3] to VC1. The TC/VC mapping registers in receiver Device B must also be configured identically as Device A. The same numbered VC buffers are enabled both in transmitter Device A and receiver Device B.

If Device A needs to transmit a TLP with TC label of 7 and another packet with TC label of 0 , the two packets will be placed in VC1 and VC0 buffers, respectively. The arbitration logic arbitrates between the two VC buffers. Assume VC1 buffer is configured with higher priority than VC0 buffer. Thus, Device A will forward the TC7 TLPs in VC1 to the Link ahead of the TC0 TLPs in VC0.

When the TLPs arrive in Device B, the TC/VC mapping logic decodes the TC label in each TLP and places the TLPs in their associated VC buffers.

In this example, TLP traffic with TC[7:3] label will flow through the fabric with higher priority than TC[2:0] traffic. Within each TC group however, TLPs will flow with equal priority.

Figure 2-23: TC Numbers and VC Buffers

Port Arbitration and VC Arbitration. The goals of arbitration support in the Transaction Layer are:

To provide differentiated services between data flows within the fabric.

To provide guaranteed bandwidth with deterministic and smallest end-to-end transaction latency.

Packets of different TCs are routed through the fabric of switches with different priority based on arbitration policy implemented in switches. Packets coming in from ingress ports heading towards a particular egress port compete for use of that egress port.

PCI Express System Architecture

Switches implement two types of arbitration for each egress port: Port Arbitration and VC Arbitration. Consider Figure 2-24 on page 86.

Port arbitration is arbitration between two packets arriving on different ingress ports but that map to the same virtual channel (after going through TC-to-VC mapping) of the common egress port. The port arbiter implements round-robin, weighted round-robin or programmable time-based round-robin arbitration schemes selectable through configuration registers.

VC arbitration takes place after port arbitration. For a given egress port, packets from all VCs compete to transmit on the same egress port. VC arbitration resolves the order in which TLPs in different VC buffers are forwarded on to the Link. VC arbitration policies supported include, strict priority, round-robin and weighted round-robin arbitration schemes selectable through configuration registers.

Independent of arbitration, each VC must observe transaction ordering and flow control rules before it can make pending TLP traffic visible to the arbitration mechanism.

Figure 2-24: Switch Implements Port Arbitration and VC Arbitration Logic

Endpoint devices and a root complex with only one port do not support port arbitration. They only support VC arbitration in the Transaction Layer.

Transaction Ordering. PCI Express protocol implements PCI/PCI-X compliant producer-consumer ordering model for transaction ordering with provision to support relaxed ordering similar to PCI-X architecture. Transaction ordering rules guarantee that TLP traffic associated with a given traffic class is routed through the fabric in the correct order to prevent potential deadlock or live-lock conditions from occurring. Traffic associated with different TC labels have no ordering relationship. Chapter 8, entitled "Transaction Ordering," on page 315 describes these ordering rules.

The Transaction Layer ensures that TLPs for a given TC are ordered correctly with respect to other TLPs of the same TC label before forwarding to the Data Link Layer and Physical Layer for transmission.

Power Management. The Transaction Layer supports ACPI/PCI power management, as dictated by system software. Hardware within the Transaction Layer autonomously power manages a device to minimize power during full-on power states. This automatic power management is referred to as Active State Power Management and does not involve software. Power management software associated with the OS power manages a device's power states though power management configuration registers. Power management is described in Chapter 16.

Configuration Registers. A device's configuration registers are associated with the Transaction Layer. The registers are configured during initialization and bus enumeration. They are also configured by device drivers and accessed by runtime software/OS. Additionally, the registers store negotiated Link capabilities, such as Link width and frequency. Configuration registers are described in Part 6 of the book.

Data Link Layer

Refer to Figure 2-19 on page 79 for a block diagram of a device's Data Link Layer. The primary function of the Data Link Layer is to ensure data integrity during packet transmission and reception on each Link. If a transmitter device sends a TLP to a remote receiver device at the other end of a Link and a CRC error is detected, the transmitter device is notified with a NAK DLLP. The transmitter device automatically replays the TLP. This time hopefully no error occurs. With error checking and automatic replay of packets received in error, PCI Express ensures very high probability that a TLP transmitted by one device will make its way to the final destination with no errors. This makes PCI Express ideal for low error rate, high-availability systems such as servers.

PCI Express System Architecture

Transmit Side. The Transaction Layer must observe the flow control mechanism before forwarding outbound TLPs to the Data Link Layer. If sufficient credits exist, a TLP stored within the virtual channel buffer is passed from the Transaction Layer to the Data Link Layer for transmission.

Consider Figure 2-25 on page 88 which shows the logic associated with the ACK-NAK mechanism of the Data Link Layer. The Data Link Layer is responsible for TLP CRC generation and TLP error checking. For outbound TLPs from transmit Device A, a Link CRC (LCRC) is generated and appended to the TLP. In addition, a sequence ID is appended to the TLP. Device A's Data Link Layer preserves a copy of the TLP in a replay buffer and transmits the TLP to Device B. The Data Link Layer of the remote Device B receives the TLP and checks for CRC errors.

Figure 2-25: Data Link Layer Replay Mechanism

If there is no error, the Data Link Layer of Device B returns an ACK DLLP with a sequence ID to Device A. Device A has confirmation that the TLP has reached Device B (not necessarily the final destination) successfully. Device A clears its replay buffer of the TLP associated with that sequence ID.

If on the other hand a CRC error is detected in the TLP received at the remote Device B, then a NAK DLLP with a sequence ID is returned to Device A. An error has occurred during TLP transmission. Device A's Data Link Layer replays associated TLPs from the replay buffer. The Data Link Layer generates error indications for error reporting and logging mechanisms.

In summary, the replay mechanism uses the sequence ID field within received ACK/NAK DLLPs to associate it with outbound TLPs stored in the replay buffer. Reception of ACK DLLPs cause the replay buffer to clear TLPs from the buffer. Receiving NAK DLLPs cause the replay buffer to replay associated TLPs.

For a given TLP in the replay buffer, if the transmitter device receives a NAK 4 times and the TLP is replayed 3 additional times as a result, then the Data Link Layer logs the error, reports a correctable error, and re-trains the Link.

Receive Side. The receive side of the Data Link Layer is responsible for LCRC error checking on inbound TLPs. If no error is detected, the device schedules an ACK DLLP for transmission back to the remote transmitter device. The receiver strips the TLP of the LCRC field and sequence ID.

If a CRC error is detected, it schedules a NAK to return back to the remote transmitter. The TLP is eliminated.

The receive side of the Data Link Layer also receives ACKs and NAKs from a remote device. If an ACK is received the receive side of the Data Link layer informs the transmit side to clear an associated TLP from the replay buffer. If a NAK is received, the receive side causes the replay buffer of the transmit side to replay associated TLPs.

The receive side is also responsible for checking the sequence ID of received TLPs to check for dropped or out-of-order TLPs.

Data Link Layer Contribution to TLPs and DLLPs. The Data Link Layer concatenates a 12-bit sequence ID and 32-bit LCRC field to an outbound TLP that arrives from the Transaction Layer. The resultant TLP is shown in Figure 2-26 on page 90 . The sequence ID is used to associate a copy of the outbound TLP stored in the replay buffer with a received ACK/NAK DLLP inbound from a neighboring remote device. The ACK/NAK DLLP confirms arrival of the outbound TLP in the remote device.

PCI Express System Architecture

The 32-bit LCRC is calculated based on all bytes in the TLP including the sequence ID.

A DLLP shown in Figure 2-26 on page 90 is a 4 byte packet with a 16-bit CRC field. The 8-bit DLLP Type field indicates various categories of DLLPs. These include: ACK, NAK, Power Management related DLLPs (PM_Enter_L1, PM_Enter_L23, PM_Active_State_Request_L1, PM_Request_Ack) and Flow Control related DLLPs (InitFC1-P, InitFC1-NP, InitFC1-Cpl, InitFC2-P, InitFC2-NP, InitFC2-Cpl, UpdateFC-P, UpdateFC-NP, UpdateFC-Cpl). The 16-bit CRC is calculated using all 4 bytes of the DLLP. Received DLLPs which fail the CRC check are discarded. The loss of information from discarding a DLLP is self repairing such that a successive DLLP will supersede any information lost. ACK and NAK DLLPs contain a sequence ID field (shown as Misc. field in Figure 2-26) used by the device to associate an inbound ACK/NAK DLLP with a stored copy of a TLP in the replay buffer.

Figure 2-26: TLP and DLLP Structure at the Data Link Layer

Non-Posted Transaction Showing ACK-NAK Protocol. Next the steps required to complete a memory read request between a requester and a completer on the far end of a switch are described. Figure 2-27 on page 91 shows the activity on the Link to complete this transaction:

Step 1a: Requester transmits a memory read request TLP (MRd). Switch receives the MRd TLP and checks for CRC error using the LCRC field in the MRd TLP.

Step 1b: If no error then switch returns ACK DLLP to requester. Requester discards copy of the TLP from its replay buffer.

Step 2a: Switch forwards the MRd TLP to the correct egress port using memory address for routing. Completer receives MRd TLP. Completer checks for CRC errors in received MRd TLP using LCRC.

Step 2b: If no error then completer returns ACK DLLP to switch. Switch discards copy of the MRd TLP from its replay buffer.

Step 3a: Completer checks for CRC error using optional ECRC field in MRd TLP. Assume no End-to-End error. Completer returns Completion (CplD) with Data TLP whenever it has the requested data. Switch receives CplD TLP and checks for CRC error using LCRC.

Step 3b: If no error then switch returns ACK DLLP to completer. Completer discards copy of the CplD TLP from its replay buffer.

Step 4a: Switch decodes Requester ID field in CplD TLP and routes the packet to the correct egress port. Requester receives CplD TLP. Requester checks for CRC errors in received CplD TLP using LCRC.

Step 4b: If no error then requester returns ACK DLLP to switch. Switch discards copy of the CplD TLP from its replay buffer. Requester determines if there is error in CplD TLP using CRC field in optional ECRC field. Assume no End-to-End error. Requester checks completion error code in CplD. Assume completion code of 'Successful Completion'. To associate the completion with the original request, requester matches tag in CplD with original tag of MRd request. Requester accepts data.

Figure 2-27: Non-Posted Transaction on Link

PCI Express System Architecture

Posted Transaction Showing ACK-NAK Protocol. Below are the steps involved in completing a memory write request between a requester and a completer on the far end of a switch. Figure 2-28 on page 92 shows the activity on the Link to complete this transaction:

Step 1a: Requester transmits a memory write request TLP (MWr) with data. Switch receives MWr TLP and checks for CRC error with LCRC field in the TLP.

Step 1b: If no error then switch returns ACK DLLP to requester. Requester discards copy of the TLP from its replay buffer.

Step 2a: Switch forwards the MWr TLP to the correct egress port using memory address for routing. Completer receives MWr TLP. Completer checks for CRC errors in received MRd TLP using LCRC.

Step 2b: If no error then completer returns ACK DLLP to switch. Switch discards copy of the MWr TLP from its replay buffer. Completer checks for CRC error using optional digest field in MWr TLP. Assume no End-to-End error. Completer accepts data. There is no completion associated with this transaction.

Figure 2-28: Posted Transaction on Link

Other Functions of the Data Link Layer. Following power-up or Reset, the flow control mechanism described earlier is initialized by the Data Link Layer. This process is accomplished automatically at the hardware level and has no software involvement.

Flow control for the default virtual channel VC0 is initialized first. In addition, when additional VCs are enabled by software, the flow control initialization process is repeated for each newly enabled VC. Since VC0 is enabled before all other VCs, no TLP traffic will be active prior to initialization of VC0.

Physical Layer

Refer to Figure 2-19 on page 79 for a block diagram of a device's Physical Layer. Both TLP and DLLP type packets are sent from the Data Link Layer to the Physical Layer for transmission over the Link. Also, packets are received by the Physical Layer from the Link and sent to the Data Link Layer.

The Physical Layer is divided in two portions, the Logical Physical Layer and the Electrical Physical Layer. The Logical Physical Layer contains digital logic associated with processing packets before transmission on the Link, or processing packets inbound from the Link before sending to the Data Link Layer. The Electrical Physical Layer is the analog interface of the Physical Layer that connects to the Link. It consists of differential drivers and receivers for each Lane.

Transmit Side. TLPs and DLLPs from the Data Link Layer are clocked into a buffer in the Logical Physical Layer. The Physical Layer frames the TLP or DLLP with a Start and End character. The symbol is a framing code byte which a receiver device uses to detect the start and end of a packet. The Start and End characters are shown appended to a TLP and DLLP in Figure 2-29 on page 94. The diagram shows the size of each field in a TLP or DLLP.

The transmit logical sub-block conditions the received packet from the Data Link Layer into the correct format for transmission. Packets are byte striped across the available Lanes on the Link.

Each byte of a packet is then scrambled with the aid of Linear Feedback Shift Register type scrambler. By scrambling the bytes, repeated bit patterns on the Link are eliminated, thus reducing the average EMI noise generated.

The resultant bytes are encoded into a

10 b

code by the

8 b / 10 b

encoding logic. The primary purpose of encoding

8 b

characters to

10 b

symbols is to create sufficient 1-to-0 and 0-to-1 transition density in the bit stream to facilitate recreation of a receive clock with the aid of a PLL at the remote receiver device. Note that data is not transmitted along with a clock. Instead, the bit stream contains sufficient transitions to allow the receiver device to recreate a receive clock.

The parallel-to-serial converter generates a serial bit stream of the packet on each Lane and transmits it differentially at

2.5 Gbits / s

Receive Side. The receive Electrical Physical Layer clocks in a packet arriving differentially on all Lanes. The serial bit stream of the packet is converted into a

10 b

parallel stream using the serial-to-parallel converter. The receiver logic also includes an elastic buffer which accommodates for clock frequency

PCI Express System Architecture

variation between a transmit clock with which the packet bit stream is clocked into a receiver and the receiver clock. The 10b symbol stream is decoded back to the

8 b

representation of each symbol with the

8 b / 10 b

decoder. The

8 b

characters are de-scrambled. The Byte unstriping logic, re-creates the original packet stream transmitted by the remote device.

Figure 2-29: TLP and DLLP Structure at the Physical Layer

Link Training and Initialization. An additional function of the Physical Layer is Link initialization and training. Link initialization and training is a Physical Layer controlled process that configures and initializes each Link for normal operation. This process is automatic and does not involve software. The following are determined during the Link initialization and training process:

Link width

Link data rate

Lane reversal

Polarity inversion.

Bit lock per Lane

Symbol lock per Lane

Lane-to-Lane de-skew within a multi-Lane Link.

Link width. Two devices with a different number of Lanes per Link may be connected. E.g. one device has

x 2

port and it is connected to a device with

x 4

port. After initialization the Physical Layer of both devices determines and sets the Link width to the minimum Lane width of

\times 2

. Other Link negotiated behaviors include Lane reversal and splitting of ports into multiple Links.

Lane reversal if necessary is an optional feature. Lanes are numbered. A designer may not wire the correct Lanes of two ports correctly. In which case training allows for the Lane numbers to be reversed so that the Lane numbers of adjacent ports on each end of the Link match up. Part of the same process may allow for a multi-Lane Link to be split into multiple Links.

Polarity inversion. The D+ and D- differential pair terminals for two devices may not be connected correctly. In which case the training sequence receiver reverses the polarity on the differential receiver.

Link data rate. Training is completed at data rate of 2.5 Gbit/s. In the future, higher data rates of 5 Gbit/s and 10 Gbit/s will be supported. During training, each node advertises its highest data rate capability. The Link is initialized with the highest common frequency that devices at opposite ends of a Link support.

Lane-to-Lane De-skew. Due to Link wire length variations and different driver/receiver characteristics on a multi-Lane Link, bit streams on each Lane will arrive at a receiver skewed with respect to other Lanes. The receiver circuit must compensate for this skew by adding/removing delays on each Lane. Relaxed routing rules allow Link wire lengths in the order of

20^{''} - 30^{''}

Link Power Management. The normal power-on operation of a Link is called the L0 state. Lower power states of the Link in which no packets are transmitted or received are L0s, L1, L2 and L3 power states. The L0s power state is automatically entered when a time-out occurs after a period of inactivity on the Link. Entering and exiting this state does not involve software and the exit latency is the shortest. L1 and L2 are lower power states than L0s, but exit latencies are greater. The L3 power state is the full off power state from which a device cannot generate a wake up event.

Reset. Two types of reset are supported:

Cold/warm reset also called a Fundamental Reset which occurs following a device being powered-on (cold reset) or due to a reset without circulating power (warm reset).

Hot reset sometimes referred to as protocol reset is an in-band method of propagating reset. Transmission of an ordered-set is used to signal a hot reset. Software initiates hot reset generation.

On exit from reset (cold, warm, or hot), all state machines and configuration registers (hot reset does not reset sticky configuration registers) are initialized.

PCI Express System Architecture

Electrical Physical Layer. The transmitter of one device is AC coupled to the receiver of another device at the opposite end of the Link as shown in Figure 2-30. The AC coupling capacitor is between

75 - 200 nF

. The transmitter DC common mode voltage is established during Link training and initialization. The DC common mode impedance is typically

50 ohms

while the differential impedance is 100 ohms typical. This impedance is matched with a standard FR4 board.

Figure 2-30: Electrical Physical Layer Showing Differential Transmitter and Receiver

Example of a Non-Posted Memory Read Transaction

Let us put our knowledge so far to describe the set of events that take place from the time a requester device initiates a memory read request, until it obtains the requested data from a completer device. Given that such a transaction is a non-posted transaction, there are two phases to the read process. The first phase is the transmission of a memory read request TLP from requester to completer. The second phase is the reception of a completion with data from the completer.

Memory Read Request Phase

Refer to Figure 2-31. The requester Device Core or Software Layer sends the following information to the Transaction Layer:

32-bit or 64-bit memory address, transaction type of memory read request, amount of data to read calculated in doublewords, traffic class if other than TC0, byte enables, attributes to indicate if 'relaxed ordering' and 'no snoop' attribute bits should be set or clear. PCI Express System Architecture

Figure 2-31: Memory Read Request Phase

The Transaction layer uses this information to build a MRd TLP. The exact TLP packet format is described in a later chapter. A 3 DW or 4 DW header is created depending on address size (32-bit or 64-bit). In addition, the Transaction Layer adds its requester ID (bus#, device#, function#) and an 8-bit tag to the header. It sets the TD (transaction digest present) bit in the TLP header if a 32-bit

End-to-End CRC is added to the tail portion of the TLP. The TLP does not have a data payload. The TLP is placed in the appropriate virtual channel buffer ready for transmission. The flow control logic confirms there are sufficient "credits" available (obtained from the completer device) for the virtual channel associated with the traffic class used.

Only then the memory read request TLP is sent to the Data Link Layer. The Data Link Layer adds a 12-bit sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A copy of the TLP with sequence ID and LCRC is stored in the replay buffer.

This packet is forwarded to the Physical Layer which tags on a Start symbol and an End symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10 bit encoded. Finally the packet is converted to a serial bit stream on all Lanes and transmitted differentially across the Link to the neighbor completer device.

The completer converts the incoming serial bit stream back to

10 b

symbols while assembling the packet in an elastic buffer. The

10 b

symbols are converted back to bytes and the bytes from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and removed. The resultant TLP is sent to the Data Link Layer.

The completer Data Link Layer checks for LCRC errors in the received TLP and checks the Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer creates an ACK DLLP which contains the same sequence ID as contained in the memory read request TLP received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the Physical Layer which transmits the ACK DLLP to the requester.

The requester Physical Layer reformulates the ACK DLLP and sends it up to the Data Link Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer. The stored memory read request TLP associated with the ACK received is discarded from the replay buffer. If a NAK DLLP was received by the requester instead, it would re-send a copy of the stored memory read request TLP.

In the mean time the Data Link Layer of the completer strips the sequence ID and LCRC field from the memory read request TLP and forwards it to the Transaction Layer.

The Transaction Layer receives the memory read request TLP in the appropriate virtual channel buffer associated with the TC of the TLP. The Transaction layer checks for ECRC error. It forwards the contents of the header (address, Chapter 2: Architecture Overview

requester ID, memory read transaction type, amount of data requested, traffic class etc.) to the completer Device Core/Software Layer.

Completion with Data Phase

Refer to Figure 2-32 on page 99 during the following discussion. To service the memory read request, the completer Device Core/Software Layer sends the following information to the Transaction Layer:

Requester ID and Tag copied from the original memory read request, transaction type of completion with data (CplD), requested amount of data with data length field, traffic class if other than TC0, attributes to indicate if 'relaxed ordering' and 'no snoop' bits should be set or clear (these bits are copied from the original memory read request). Finally, a completion status of successful completion (SC) is sent. PCI Express System Architecture

Figure 2-32: Completion with Data Phase

The Transaction layer uses this information to build a CpID TLP. The exact TLP packet format is described in a later chapter. A 3 DW header is created. In addition, the Transaction Layer adds its own completer ID to the header. The TD (transaction digest present) bit in the TLP header is set if a 32-bit End-to-End CRC is added to the tail portion of the TLP. The TLP includes the data payload. The flow control logic confirms sufficient "credits" are available (obtained from the requester device) for the virtual channel associated with the traffic class used.

Only then the CplD TLP is sent to the Data Link Layer. The Data Link Layer adds a 12-bit sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A copy of the TLP with sequence ID and LCRC is stored in the replay buffer.

This packet is forwarded to the Physical Layer which tags on a Start symbol and an End symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10 bit encoded. Finally the CplD packet is converted to a serial bit stream on all Lanes and transmitted differentially across the Link to the neighbor requester device.

The requester converts the incoming serial bit stream back to 10 bymbols while assembling the packet in an elastic buffer. The

10 b

symbols are converted back to bytes and the bytes from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and removed. The resultant TLP is sent to the Data Link Layer.

The Data Link Layer checks for LCRC errors in the received CplD TLP and checks the Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer creates an ACK DLLP which contains the same sequence ID as contained in the CplD TLP received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the Physical Layer which transmits the ACK DLLP to the completer.

The completer Physical Layer reformulates the ACK DLLP and sends it up to the Data Link Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer. The stored CpID TLP associated with the ACK received is discarded from the replay buffer. If a NAK DLLP was received by the completer instead, it would re-send a copy of the stored CplD TLP.

In the mean time, the requester Transaction Layer receives the CplD TLP in the appropriate virtual channel buffer mapped to the TLP TC. The Transaction Layer uses the tag in the header of the CplD TLP to associate the completion with the original request. Transaction layer checks for ECRC error. It forwards

Chapter 2: Architecture Overview

the header contents and data payload including the Completion Status to the requester Device Core/Software Layer. Memory read transaction DONE.

Hot Plug

PCI Express supports native hot-plug though hot-plug support in a device is not mandatory. Some of the elements found in a PCI Express hot plug system are:

Indicators which show the power and attention state of the slot.

Manually-operated Retention Latch (MRL) that holds add-in cards in place.

MRL Sensor that allow the port and system software to detect the MRL being opened.

Electromechanical Interlock which prevents removal of add-in cards while slot is powered.

Attention Button that allows user to request hot-plug operations.

Software User Interface that allows user to request hot-plug operations.

Slot Numbering for visual identification of slots.

When a port has no connection or a removal event occurs, the port transmitter moves to the electrical high impedance detect state. The receiver remains in the electrical low impedance state.

PCI Express Performance and Data Transfer Efficiency

As of May 2003, no realistic performance and efficiency numbers were available. However, Table 2-3 shows aggregate bandwidth numbers for various Link widths after factoring the overhead of

8 b / 10 b

encoding.

Table 2-3: PCI Express Aggregate Throughput for Various Link Widths

PCI Express Link Width	x1	x2	x4	x8	x12	x16	x32
Aggregate Band- width (GBytes/sec)	0.5	1	2	4	6	8	16

DLLPs are 2 doublewords in size. The ACK/NAK and flow control protocol utilize DLLPs, but it is not expected that these DLLPs will use up a significant portion of the bandwidth.

PCI Express System Architecture

The remainder of the bandwidth is available for TLPs. Between 6-7 doublewords of the TLP is overhead associated with Start and End framing symbols, sequence ID, TLP header, ECRC and LCRC fields. The remainder of the TLP contains between 0-1024 doublewords of data payload. It is apparent that the bus efficiency is significantly low if small size packets are transmitted. The efficiency numbers are very high if TLPs contain significant amounts of data payload.

Packets can be transmitted back-to-back without the Link going idle. Thus the Link can be 100% utilized.

The switch does not introduce any arbitration overhead when forwarding incoming packets from multiple ingress ports to one egress port. However, it is yet to be seen what the effect is of the Quality of Service protocol on actual bandwidth numbers for given applications.

There is overhead associated with the split transaction protocol, especially for read transactions. For a read request TLP, the data payload is contained in the completion. This factor has to be accounted for when determining the effective performance of the bus. Posted write transactions improve the efficiency of the fabric.

Switches support cut-through mode. That is to say that an incoming packet can be immediately forwarded to an egress port for transmission without the switch having to buffer up the packet. The latency for packet forwarding through a switch can be very small allowing packets to travel from one end of the PCI Express fabric to another end with very small latency.

Address Spaces & Transaction Routing

The Previous Chapter

The previous chapter introduced the PCI Express data transfer protocol. It described the layered approach to PCI Express device design while describing the function of each device layer. Packet types employed in accomplishing data transfers were described without getting into packet content details. Finally, this chapter outlined the process of a requester initiating a transaction such as a memory read to read data from a completer across a Link.

This Chapter

This chapter describes the general concepts of PCI Express transaction routing and the mechanisms used by a device in deciding whether to accept, forward, or reject a packet arriving at an ingress port. Because Data Link Layer Packets (DLLPs) and Physical Layer ordered set link traffic are never forwarded, the emphasis here is on Transaction Layer Packet (TLP) types and the three routing methods associated with them: address routing, ID routing, and implicit routing. Included is a summary of configuration methods used in PCI Express to set up PCI-compatible plug-and-play addressing within system IO and memory maps, as well as key elements in the PCI Express packet protocol used in making routing decisions.

The Next Chapter

The next chapter details the two major classes of packets are Transaction Layer Packets (TLPs), and Data Link Layer Packets (DLLPs). The use and format of each TLP and DLLP packet type is covered, along with definitions of the field within the packets.

Chapter 3: Address Spaces & Transaction Routing

As illustrated in Figure 3-1 on page 106, a PCI Express topology consists of independent, point-to-point links connecting each device with one or more neighbors. As traffic arrives at the inbound side of a link interface (called the ingress port), the device checks for errors, then makes one of three decisions:

Accept the traffic and use it internally.

Forward the traffic to the appropriate outbound (egress) port.

Reject the traffic because it is neither the intended target nor an interface to it (note that there are also other reasons why traffic may be rejected)

Receivers Check For Three Types of Link Traffic

Assuming a link is fully operational, the physical layer receiver interface of each device is prepared to monitor the logical idle condition and detect the arrival of the three types of link traffic: Ordered Sets, DLLPs, and TLPs. Using control (K) symbols which accompany the traffic to determine framing boundaries and traffic type, PCI Express devices then make a distinction between traffic which is local to the link vs. traffic which may require routing to other links (e.g. TLPs). Local link traffic, which includes Ordered Sets and Data Link Layer Packets (DLLPs), isn't forwarded and carries no routing information. Transaction Layer Packets (TLPs) can and do move from link to link, using routing information contained in the packet headers.

Multi-port Devices Assume the Routing Burden

It should be apparent in Figure 3-1 on page 106 that devices with multiple PCI Express ports are responsible for handling their own traffic as well as forwarding other traffic between ingress ports and any enabled egress ports. Also note that while peer-peer transaction support is required of switches, it is optional for a multi-port Root Complex. It is up to the system designer to account for peer-to-peer traffic when selecting devices and laying out a motherboard.

Endpoints Have Limited Routing Responsibilities

It should also be apparent in Figure 3-1 on page 106 that endpoint devices have a single link interface and lack the ability to route inbound traffic to other links. For this reason, and because they don't reside on shared busses, endpoints never expect to see ingress port traffic which is not intended for them (this is different than shared-bus

PCI (X)

,where devices commonly decode addresses

PCI Express System Architecture

and commands not targeting them). Endpoint routing is limited to accepting or rejecting transactions presented to them.

System Routing Strategy Is Programmed

Before transactions can be generated by a requester, accepted by the completer, and forwarded by any devices in the path between the two, all devices must be configured to enforce the system transaction routing scheme. Routing is based on traffic type, system memory and IO address assignments, etc. In keeping with PCI plug-and-play configuration methods, each PCI express device is discovered, memory and IO address resources are assigned to them, and switch/ bridge devices are programmed to forward transactions on their behalf. Once routing is programmed, bus mastering and target address decoding are enabled. Thereafter, devices are prepared to generate, accept, forward, or reject transactions as necessary.

Two Types of Local Link Traffic

Local traffic occurs between the transmit interface of one device and the receive interface of its neighbor for the purpose of managing the link itself. This traffic is never forwarded or flow controlled; when sent, it must be accepted. Local traffic is further classified as Ordered Sets exchanged between the Physical Layers of two devices on a link or Data Link Layer packets (DLLPs) exchanged between the Data Link Layers of the two devices.

Ordered Sets

These are sent by each physical layer transmitter to the physical layer of the corresponding receiver to initiate link training, compensate for clock tolerance, or transition a link to and from the Electrical Idle state. As indicated in Table 3-1 on page 109, there are five types of Ordered Sets.

Each ordered set is constructed of 10-bit control (K) symbols that are created within the physical layer. These symbols have a common name as well as a alph-numeric code that defines the 10 bits pattern of

1 s

and

0 s

,of which they are comprised. For example, the SKP (Skip) symbol has a 10-bit value represented as K28.0.

Chapter 3: Address Spaces & Transaction Routing

Figure 3-2 on page 110 illustrates the transmission of Ordered Sets. Note that each ordered set is fixed in size, consisting of 4 or 16 characters. Again, the receiver is required to consume them as they are sent. Note that the COM control symbol (K28.5) is used to indicate the start of any ordered set.

Refer to the "8b/10b Encoding" on page 419 for a thorough discussion of Ordered Sets.

Table 3-1: Ordered Set Types

Ordered Set Type	Symbols	Purpose
Fast Training Sequence (FTS)	COM, 3 FTS	Quick synchronization of bit stream when leaving L0s power state.
Training Sequence One (TS1)	COM, Lane ID, 14 more	Used in link training, to align and synchronize the incoming bit stream at startup, convey reset, other func- tions.
Training Sequence Two (TS2)	COM, Lane ID, 14 more	See TS1.
Electrical Idle (IDLE)	COM, 3 IDL	Indicates that link should be brought to a lower power state (L0s, L1, L2).
Skip	COM, 3 SKP	Inserted periodically to compensate for clock tolerances.

Data Link Layer Packets (DLLPs)

The other type of local traffic sent by a device transmit interface to the corresponding receiver of the device attached to it are Data Link Layer Packets (DLLPs). These are also used in link management, although they are sourced at the device Data Link Layer rather that the Physical Layer. The main functions of DLLPs are to facilitate Link Power Management, TLP Flow Control, and the acknowledgement of successful TLP delivery across the link.

Table 3-2: Data Link Layer Packet (DLLP) Types

DLLP	Purpose
Acknowledge (Ack)	Receiver Data Link Layer sends Ack to indicate that no CRC or other errors have been encountered in received TLP(s). Transmitter retains copy of TLPs until Ack’d
No Acknowledge (Nak)	Receiver Data Link Layer sends Nak to indicate that a TLP was received with a CRC or other error. All TLPs remaining in the transmitter's Retry Buffer must be resent, in the original order.
PM_Enter_L1; PM_Enter_L23	Following a software configuration space access to cause a device power management event, a downstream device requests entry to link L1 or Level 2-3 state
PM_Active_State_Req_L1	Downstream device autonomously requests L1 Active State
PM_Request_Ack	Upstream device acknowledges transition to L1 State
Vendor-Specific DLLP	Reserved for vendor-specific purposes
InitFC1-P InitFC1-NP InitFC1-Cpl	Flow Control Initialization Type One DLLP awarding posted (P), nonposted (NP), or completion (Cpl) flow control credits.
InitFC2-P InitFC2-NP InitFC2-Cpl	Flow Control Initialization Type Two DLLP confirming award of InitFC1 posted (P), nonposted (NP), or com- pletion (Cpl) flow control credits.
UpdateFC-P UpdateFC-NP UpdateFC-Cpl	Flow Control Credit Update DLLP awarding posted (P), nonposted (NP), or completion (Cpl) flow control cred- its.

Chapter 3: Address Spaces & Transaction Routing

Note that unlike Ordered Sets, DLLPs always carry a 16-bit CRC which is verified by the receiver before carrying out the required operation. If an error is detected by the receiver of a DLLP, it is dropped. Even though DLLPs are not acknowledged, time-out mechanisms built into the specification permit recovery from dropped DLLPs due to CRC errors.

Refer to "Data Link Layer Packets" on page 198 for a thorough discussion of Data Link Layer packets.

Transaction Layer Packet Routing Basics

The third class of link traffic originates in the Transaction Layer of one device and targets the Transaction Layer of another device. These Transaction Layer Packets (TLPs) are forwarded from one link to another as necessary, subject to the routing mechanisms and rules described in the following sections. Note that other chapters in this book describe additional aspects of Transaction Layer Packet handling, including Flow Control, Quality Of Service, Error Handling, Ordering rules, etc. The term transaction is used here to describe the exchange of information using Transaction Layer Packets. Because Ordered Sets and DLLPs carry no routing information and are not forwarded, the routing rules described in the following sections apply only to TLPs.

TLPs Used to Access Four Address Spaces

As transactions are carried out between PCI Express requesters and completers, four separate address spaces are used: Memory, IO, Configuration, and Message. The basic use of each address space is described in Table 3-3 on page 113.

Table 3-3: PCI Express Address Space And Transaction Types

Address Space	Transaction Types	Purpose
Memory	Read, Write	Transfer data to or from a location in the system memory map
IO	Read, Write	Transfer data to or from a location in the system IO map
Configuration	Read, Write	Transfer data to or from a location in the configuration space of a PCI-com- patible device.

Table 3-3: PCI Express Address Space And Transaction Types (Continued)

Address Space	Transaction Types	Purpose
Message	Baseline, Vendor-specific	General in-band messaging and event reporting (without consuming mem- ory or IO address resources)

Split Transaction Protocol Is Used

Accesses to the four address spaces in PCI Express are accomplished using split-transaction requests and completions.

Split Transactions: Better Performance, More Overhead

The split transaction protocol is an improvement over earlier bus protocols (e.g. PCI) which made extensive use of bus wait-states or delayed transactions (retries) to deal with latencies in accessing targets. In PCI Express, the completion following a request is initiated by the completer only when it has data and/ or status ready for delivery. The fact that the completion is separated in time from the request which caused it also means that two separate TLPs are generated, with independent routing for the request TLP and the Completion TLP. Note that while a link is free for other activity in the time between a request and its subsequent completion, a split-transaction protocol involves some additional overhead as two complete TLPs must be generated to carry out a single transaction.

Figure 3-4 on page 115 illustrates the request-completion phases of a PCI Express split transaction. This example represents an endpoint read from system memory.

Figure 3-4: PCI Express Transaction Request And Completion TLPs

Write Posting: Sometimes a Completion Isn't Needed

To mitigate the penalty of the request-completion latency, messages and some write transactions in PCI Express are posted, meaning the write request (including data) is sent, and the transaction is over from the requester's perspective as soon as the request is sent out of the egress port; responsibility for delivery is now the problem of the next device. In a multi-level topology, this has the advantage of being much faster than waiting for the entire request-completion transit, but - as in all posting schemes - uncertainty exists concerning when (and if) the transaction completed successfully at the ultimate recipient.

PCI Express System Architecture

In PCI Express, write posting to memory is considered acceptable in exchange for the higher performance. On the other hand, writes to IO and configuration space may change device behavior, and write posting is not permitted. A completion will always be sent to report status of the IO or configuration write operation.

Table 3-4 on page 116 lists PCI Express posted and non-posted transactions.

Table 3-4: PCI Express Posted and Non-Posted Transactions

Request	How Request Is Handled
Memory Write	All Memory Write requests are posted. No completion is expected or sent.
Memory Read Memory Read Lock	All memory read requests are non-posted. A completion with data (CplD or CplDLK) will be returned by the com- pleter with requested data and to report status of the mem- ory read
IO Write	All IO Write requests are non-posted. A completion without data (Cpl) will be returned by the completer to report status of the IO write operation.
IO Read	All IO read requests are non-posted. A completion with data (CplD) will be returned by the completer with requested data and to report status of the IO read operation.
Configuration Write Type 0 and Type 1	All Configuration Write requests are non-posted. A comple- tion without data (Cpl) will be returned by the completer to report status of the configuration space write operation.
Configuration Read Type 0 and Type 1	All configuration read requests are non-posted. A comple- tion with data (CplD) will be returned by the completer witl requested data and to report status of the read operation.
Message Message With Data	While the routing method varies, all message transactions are handled in the same manner as memory writes in that they are considered posted requests

Chapter 3: Address Spaces & Transaction Routing

Three Methods of TLP Routing

All of the TLP variants, targeting any of the four address spaces, are routed using one of the three possible schemes: Address Routing, ID Routing, and Implicit Routing. Table 3-5 on page 117 summarizes the PCI Express TLP header type variants and the routing method used for each. Each of these is described in the following sections.

Table 3-5: PCI Express TLP Variants And Routing Options

TLP Type	Routing Method Used
Memory Read (MRd), Memory Read Lock (MRdLk), Memory Write (MWr)	Address Routing
IO Read (IORd), IO Write (IOWr)	Address Routing
Configuration Read Type 0 (CfgRd0), Configuration Read Type 1 (CfgRd1) Configuration Write Type 0 (CfgWr0), Configuration Write Type 1(CfgWr1)	ID Routing
Message (Msg), Message With Data (MsgD)	Address Routing, ID Rout- ing, or Implicit routing
Completion (Cpl), Completion With Data (CplD)	ID Routing

PCI Express Routing Is Compatible with PCI

As indicated in Table 3-5 on page 117, memory and IO transactions are routed through the PCI Express topology using address routing to reference system memory and IO maps, while configuration cycles use ID routing to reference the completer's (target's) logical position within the PCI-compatible bus topology (using Bus Number, Device Number, Function Number in place of a linear address). Both address routing and ID routing are completely compatible with routing methods used in the PCI and PCIX protocols when performing memory, IO, or configuration transactions. PCI Express completions also use the ID routing scheme. PCI Express System Architecture

PCI Express Adds Implicit Routing for Messages

PCI Express adds the third routing method, implicit routing, which is an option when sending messages. In implicit routing, neither address or ID routing information applies; the packet is routed based on a code in the packet header indicating it is destined for device(s) with known, fixed locations (the Root Complex, the next receiver, etc.).

While limited in the cases it can support, implicit routing simplifies routing of messages. Note that messages may optionally use address or ID routing instead.

Why Were Messages Added to PCI Express Protocol? PCI and PCI-

X

protocols support load and store memory and IO read-write transactions, which have the following features:

The transaction initiator drives out a memory or IO start address selecting a location within the desired target.

The target claims the transaction based on decoding and comparing the transaction start address with ranges it has been programmed to respond to in its configuration space Base Address Registers.

If the transaction involves bursting, then addresses are indexed after each data transfer.

While PCI Express also supports load and store transactions with its memory and IO transactions, it adds in-band messages. The main reason for this is that the PCI Express protocol seeks to (and does) eliminate many of the sideband signals related to interrupts, error handling, and power management which are found in

PCI (X)

-based systems. Elimination of signals is very important in an architecture with the scalability possible with PCI Express. It would not be efficient to design a PCI Express device with a two lane link and then saddle it with numerous additional signals to handle auxiliary functions.

The PCI Express protocol replaces most sideband signals with a variety of in-band packet types; some of these are conveyed as Data Link Layer packets (DLLPs) and some as Transaction Layer packets (TLPs).

How Implicit Routing Helps with Messages. One side effect of using in-band messages in place of hard-wired sideband signals is the problem of delivering the message to the proper recipient in a topology consisting of numerous point-to-point links. The PCI Express protocol provides maximum flexibility in routing message TLPs; they may use address routing, ID routing, or the third method, implicit routing. Implicit routing takes advantage of the fact that, due to their architecture, switches and other multi-port

Chapter 3: Address Spaces & Transaction Routing

devices have a fundamental sense of upstream and downstream, and where the Root Complex is to be found. Because of this, a message header can be routed implicitly with a simple code indicating that it is intended for the Root Complex, a broadcast downstream message, should terminate at the next receiver, etc.

The advantage of implicit routing is that it eliminates the need to assign a set of memory mapped addresses for all of the possible message variants and program all of the devices to use them.

Header Fields Define Packet Format and Routing

As depicted in Figure 3-5 on page 119, each Transaction Layer Packet contains a three or four double word (12 or 16 byte) header. Included in the 3DW or 4DW header are two fields, Type and Format (Fmt), which define the format of the remainder of the header and the routing method to be used on the entire TLP as it moves between devices in the PCI Express topology.

Figure 3-5: Transaction Layer Packet Generic 3DW And 4DW Headers

Using TLP Header Information: Overview

General

As TLPs arrive at an ingress port, they are first checked for errors at both the physical and data link layers of the receiver. Assuming there are no errors, TLP routing is performed; basic steps include:

The TLP header Type and Format fields in the first DWord are examined to determine the size and format of the remainder of the packet.

Depending on the routing method associated with the packet, the device will determine if it is the intended recipient; if so, it will accept (consume) the TLP. If it is not the recipient, and it is a multi-port device, it will forward the TLP to the appropriate egress port-subject to the rules for ordering and flow control for that egress port.

If it is neither the intended recipient nor a device in the path to it, it will generally reject the packet as an Unsupported Request (UR).

Header Type/Format Field Encodings

Table 3-6 on page 120 below summarizes the encodings used in TLP header Type and Format fields. These two fields, used together, indicate TLP format and routing to the receiver.

Table 3-6: TLP Header Type and Format Field Encodings

TLP	FMT[1:0]	TYPE [4:0]
Memory Read Request (MRd)	00 = 3DW, no data 01 = 4DW, no data	0 0000
Memory Read Lock Request (MRdLk)	00 = 3DW, no data 01 = 4DW, no data	0 0001
Memory Write Request (MWr)	10 = 3DW, w/ data 11 = 4DW, w/ data	0 0000
IO Read Request (IORd)	00 = 3DW, no data	00010
IO Write Request (IOWr)	10 = 3DW, w/ data	0 0010

Chapter 3: Address Spaces & Transaction Routing

Table 3-6: TLP Header Type and Format Field Encodings (Continued)

TLP	FMT[1:0]	TYPE [4:0]
Config Type 0 Read Request (CfgRd0)	00 = 3DW, no data	0 0100
Config Type 0 Write Request (CfgWr0)	10 = 3DW, w/ data	0 0100
Config Type 1 Read Request (CfgRd1)	00 = 3DW, no data	0 0101
Config Type 1 Write Request (CfgWr1)	10 = 3DW, w/ data	0 0101
Message Request (Msg)	01 = 4DW, no data	1 0 RRR* (for RRR, see routing subfield)
Message Request W/Data (MsgD)	11 = 4DW, w/ data	1 0 RRR* (for RRR, see routing subfield)
Completion (Cpl)	00 = 3DW, no data	0 1010
Completion W/Data (CplD)	10 = 3DW, w/ data	0 1010
Completion-Locked (CplLk)	00 = 3DW, no data	0 1011
Completion W/Data (CplDLk)	10 = 3DW, w/ data	01011

Applying Routing Mechanisms

Once configuration of the system routing strategy is complete and transactions are enabled, PCI Express devices decode inbound TLP headers and use corresponding fields in configuration space Base Address Registers, Base/Limit registers, and Bus Number registers to apply address, ID, and implicit routing to the packet. Note that there are actually two levels of decision: the device first determines if the packet targets an internal location; if not, and the device is a switch, it will evaluate the packet to see if it should be forwarded out of an egress port. A third possibility is that the packet has been received in error or is malformed; in this case, it will be handled as a receive error. There are a number of cases when this may happen, and a number of ways it may be handled. Refer to "PCI Express Error Checking Mechanisms" on page 356 for a description of error checking and handling. The following sections describe the basic features of each routing mechanism; we will assume no errors are encountered.

Address Routing

PCI Express transactions using address routing reference the same system memory and IO maps that PCI and PCIX transactions do. Address routing is used to transfer data to or from memory, memory mapped IO, or IO locations. Memory transaction requests may carry either 32 bit addresses using the 3DW TLP header format, or 64 bit addresses using the 4DW TLP header format. IO transaction requests are restricted to 32 bits of address using the 3DW TLP header format, and should only target legacy devices.

Memory and IO Address Maps

Figure 3-6 on page 122 depicts generic system memory and IO maps. Note that the size of the system memory map is a function of the range of addresses that devices are capable of generating (often dictated by the CPU address bus). As in PCI and PCI-X, PCI Express permits either 32 bit or 64 bit memory addressing. The size of the system IO map is limited to 32 bits (4GB), although in many systems only the lower

16

bits

(64 KB)

are used.

Figure 3-6: Generic System Memory And IO Address Maps

Chapter 3: Address Spaces & Transaction Routing

Key TLP Header Fields in Address Routing

If the Type field in a received TLP indicates address routing is to be used, then the Address Fields in the header are used to performing the routing check. There are two cases: 32-bit addresses and 64-bit addresses.

TLPs with 3DW, 32-Bit Address. For IO or a 32-bit memory requests, only 32 bits of address are contained in the header. Devices targeted with these TLPs will reside below the

4 GB

memory or IO address boundary. Figure 3-7 on page 123 depicts this case.

Figure 3-7: 3DW TLP Header Address Routing Fields

Chapter 3: Address Spaces & Transaction Routing

An Endpoint Checks an Address-Routed TLP

If the Type field in a received TLP indicates address routing is to be used, then an endpoint device simply checks the address in the packet header against each of its implemented BARs in its Type 0 configuration space header. As it has only one link interface, it will either claim the packet or reject it. Figure 3-9 on page 125 illustrates this case.

Figure 3-9: Endpoint Checks Routing Of An Inbound TLP Using Address Routing

A Switch Receives an Address Routed TLP: Two Checks

General. If the Type field in a received TLP indicates address routing is to be used, then a switch first checks to see if it is the intended completer. It compares the header address against target addresses programmed in its two BARs. If the address falls within the range, it consumes the packet. This

Other Notes About Switch Address-Routing. The following notes also apply to switch address routing:

If the address-routed packet address falls in the range of one of its secondary bridge interface Base/Limit register sets, it will forward the packet downstream.

If the address-routed packet was moving downstream (was received on the primary interface) and it does not map to any BAR or downstream link Base/Limit registers, it will be handled as an unsupported request on the primary link.

Upstream address-routed packets are always forwarded to the upstream link if they do not target an internal location or another downstream link.

ID Routing

ID routing is based on the logical position (Bus Number, Device Number, Function Number) of a device function within the PCI bus topology. ID routing is compatible with routing methods used in the PCI and PCIX protocols when performing Type 0 or Type 1 configuration transactions. In PCI Express, it is also used for routing completions and may be used in message routing as well.

ID Bus Number, Device Number, Function Number Limits

PCI Express supports the same basic topology limits as PCI and PCI-X:

A maximum of 256 busses/links in a system. This includes busses created by bridges to other PCI-compatible protocols such as PCI, PCI-X, AGP, etc.

A maximum of 32 devices per bus/link. Of course, While a PCI(X) bus or the internal bus of a switch may host more than one downstream bridge interface, external PCI Express links are always point-to-point with only two devices per link. The downstream device on an external link is device 0 .

A maximum of 8 internal functions per device.

A significant difference in PCI Express over PCI is the provision for extending the amount of configuration space per function from 256 bytes to

4 KB

. Refer to the "Configuration Overview" on page 711 for a detailed description of the compatible and extended areas of PCI Express configuration space.

An Endpoint Checks an ID-Routed TLP

If the Type field in a received TLP indicates ID routing is to be used, then an endpoint device simply checks the ID field in the packet header against its own Bus Number, Device Number, and Function Number(s). In PCI Express, each device "captures" (and remembers) its own Bus Number and Device Number contained in TLP header bytes 8-9 each time a configuration write (Type 0) is detected on its primary link. At reset, all bus and device numbers in the system revert to 0 , so a device will not respond to transactions other than configuration cycles until at least one configuration write cycle (Type 0) has been performed. Note that the PCI Express protocol does not define a configuration space location where the device function is required to store the captured Bus Number and Device Number information, only that it must do it.

Once again, as it has only one link interface, an endpoint will either claim an ID-routed packet or reject it. Figure 3-11 on page 128 illustrates this case.

A Switch Receives an ID-Routed TLP: Two Checks

If the Type field in a received TLP indicates ID routing is to be used, then a switch first checks to see if it is the intended completer. It compares the header ID field against its own Bus Number, Device Number, and Function Number(s). This is indicated by (1) in Figure 3-13 on page 131. As in the case of an endpoint, a switch captures its own Bus Number and Device number each time a configuration write (Type 0) is detected on i's primary link interface. If the header ID agrees with the ID of the switch, it consumes the packet. If the ID field does not match i's own, it then checks the Secondary-Subordinate Bus Number registers in the configuration space for each downstream link. This check is indicated by (2) in Figure 3-13 on page 131.

Other Notes About Switch ID Routing

If the ID-routed packet matches the range of one of its secondary bridge interface Secondary-Subordinate registers, it will forward the packet downstream.

If the ID-routed packet was moving downstream (was received on the primary interface) and it does not map to any downstream interface, it will be handled as an unsupported request on the primary link.

Upstream ID-routed packets are always forwarded to the upstream link if they do not target an internal location or another downstream link.

Chapter 3: Address Spaces & Transaction Routing

Figure 3-13: Switch Checks Routing Of An Inbound TLP Using ID Routing

header is shown. Each link interface has its own.

Implicit Routing

Implicit routing is based on the intrinsic knowledge PCI Express devices are required to have concerning upstream and downstream traffic and the existence of a single PCI Express Root Complex at the top of the PCI Express topology. This awareness allows limited routing of packets without the need to assign and include addresses with certain message packets. Because the Root Complex generally implements power management and interrupt controllers, as well as system error handling, it is either the source or recipient of most PCI Express messages. PCI Express System Architecture

Only Messages May Use Implicit Routing

With the elimination of many sideband signals in the PCI Express protocol, alternate methods are required to inform the host system when devices need service with respect to interrupts, errors, power management, etc. PCI Express addresses this by defining a number of special TLPs which may be used as virtual wires in conveying sideband events. Message groups currently defined include:

Power Management

INTx legacy interrupt signaling

Error signaling

Locked Transaction support

Hot Plug signaling

Vendor-specific messages

Slot Power Limit messages

Messages May Also Use Address or ID Routing

In systems where all or some of this event traffic should target the system memory map or a logical location in the PCI bus topology, address routing and ID routing may be used in place of implicit routing. If address or ID routing is chosen for a message, then the routing mechanisms just described are applied in the same way as they would for other posted write packets.

Routing Sub-Field in Header Indicates Routing Method

As a message TLP moves between PCI Express devices, packet header fields indicate both that it is a message, and whether it should be routed using address, ID, or implicitly.

Key TLP Header Fields in Implicit Routing

If the Type field in a received message TLP indicates implicit routing is to be used, then the routing sub-field in the header is also used to determine the message destination when the routing check is performed. Figure 3-14 on page 133 illustrates a message TLP using implicit routing.

Table 3-7: Message Request Header Type Field Usage

Type Field Bits	Description
Bit 4:3	Defines the type of transaction: $10 b$ = Message Transaction
Bit 2:0	Message Routing Subfield R[2:0], used to select message routing: - $000 b =$ Route to Root Complex - 001b = Use Address Routing - 010b = Use ID Routing - 011b = Route as a Broadcast Message from Root Complex - $100 b =$ Local message; terminate at receiver (INTx messages) - 101b = Gather & route to Root Complex (PME_TO_Ack mes- sage)

An Endpoint Checks a TLP Routed Implicitly

If the Type field in a received message TLP indicates implicit routing is to be used, then an endpoint device simply checks that the routing sub-field is appropriate for it. For example, an endpoint may accept a broadcast message or a message which terminates at the receiver; it won't accept messages which implicitly target the Root Complex.

A Switch Receives a TLP Routed Implicitly

If the Type field in a received message TLP indicates implicit routing is to be used, then a switch device simply considers the ingress port it arrived on and whether the routing sub-field code is appropriate for it. Some examples:

The upstream link interface of a switch may legitimately receive a broadcast message routed implicitly from the Root Complex. If it does, it will forward it intact onto all downstream links. It should not see an implicitly routed broadcast message arrive on a downstream ingress port, and will handle this as a malformed TLP.

The switch may accept messages indicating implicit routing to the root complex on secondary links; it will forward all of these upstream because it "knows" the location of the Root Complex is on its primary side. It would not accept messages routed implicitly to the Root Complex if they arrived on the primary link receive interface.

Chapter 3: Address Spaces & Transaction Routing

If the implicitly-routed message arrives on either upstream or downstream ingress ports, the switch may consume the packet if routing indicates it should terminate at receiver.

If messages are routed using address or ID methods, a switch will simply perform normal address checks in deciding whether to accept or forward it.

Plug-And-Play Configuration of Routing Options

PCI-compatible configuration space and PCI Express extended configuration space are covered in detail in the Part 6. For reference, the programming of three sets of configuration space registers related to routing is summarized here.

Routing Configuration Is PCI-Compatible

PCI Express supports the basic 256 byte PCI configuration space common to all compatible devices, including the Type 0 and Type 1 PCI configuration space header formats used by non-bridge and switch/bridge devices, respectively. Devices may implement basic PCI-equivalent functionality with no change to drivers or Operating System software.

Two Configuration Space Header Formats: Type 0, Type 1

PCI Express endpoint devices support a single PCI Express link and use the Type 0 (non-bridge) format header. Switch/bridge devices support multiple links, and implement a Type 1 format header for each link interface. Figure 3-15 on page 136 illustrates a PCI Express topology and the use of configuration space Type 0 and Type 1 header formats.

Routing Registers Are Located in Configuration Header

As with PCI, registers associated with transaction routing are located in the first 64 bytes (16 DW) of configuration space (referred to in PCI Express as the PCI 2.3 compatible header area). The three sets of registers of principal interest are:

Base Address Registers (BARs) found in Type 0 and Type 1 headers.

Three sets of Base/Limit Register pairs supported in the Type 1 header of switch/bridge devices.

Three Bus Number Registers, also found in Type 1 headers of bridge/devices.

Figure 3-16 on page 137 illustrates the Type 0 and Type 1 PCI Express Configuration Space header formats. Key routing registers are indicated.

Figure 3-15: PCI Express Devices And Type 0 And Type 1 Header Use Base Address Registers (BARs): Type 0, 1 Headers

General

The first of the configuration space registers related to routing are the Base Address Registers (BARs) These are marked "<1" in Figure 3-16 on page 137, and are implemented by all devices which require system memory, IO, or memory mapped IO (MMIO) addresses allocated to them as targets. The location and use of BARs is compatible with PCI and PCI-X. As shown in Figure 3-16 on page 137, a Type 0 configuration space header has 6 BARs available for the device designer (at DW 4-9), while a Type 1 header has only two BARs (at DW 4-5).

After discovering device resource requirements, system software programs each BAR with start address for a range of addresses the device may respond to as a completer (target). Set up of BARs involves several things:

The device designer uses a BAR to hard-code a request for an allocation of one block of prefetchable or non-prefetchable memory, or of IO addresses in the system memory or IO map. A pair of adjacent BARs are concatenated if a 64-bit memory request is being made.

During enumeration, all PCI-compatible devices are discovered and the BARs are examined by system software to decode the request. Once the system memory and IO maps are established, software programs upper bits in implemented BARs with the start address for the block allocated to the target.

BAR Setup Example One: 1MB, Prefetchable Memory Request

Figure 3-17 depicts the basic steps in setting up a BAR which is being used to track a

1 MB

block of prefetchable addresses for a device residing in the system memory map. In the diagram, the BAR is shown at three points in the configuration process:

The uninitialized BAR in Figure 3-17 is as it looks after power-up or a reset. While the designer has tied lower bits to indicate the request type and size, there is no requirement about how the upper bits (which are read-write) must come up in a BAR, so these bits are indicated with XXXXX. System software will first write all 1 's to the BAR to set all read-write bits $= 1$ . Of course, the hard-coded lower bits are not affected by the configuration write.

The second view of the BAR shown in Figure 3-17 is as it looks after configuration software has performed the write of all 1 's to it. The next step in configuration is a read of the BAR to check the request. Table 3-8 on page 140 summarizes the results of this configuration read.

The third view of the BAR shown in Figure 3-17 on page 139 is as it looks after configuration software has performed another configuration write (Type 0) to program the start address for the block. In this example, the device start address is $2 GB$ ,so bit 31 is written $= 1 (2^{31} = 2 GB)$ and all other upper bits are written $= 0^{'} s$ .

At this point the configuration of the BAR is complete. Once software enables memory address decoding in the PCI command register, the device will claim memory transactions in the range

2 GB

2 GB + 1 MB

PCI Express System Architecture

Table 3-8: Results Of Reading The BAR after Writing All "1s" To It

BAR Bits	Meaning
0	Read back as a " 0 ", indicating a memory request
2:1	Read back as 00b indicating the target only supports a 32 bit address decoder
3	Read back as a " 1 ", indicating request is for prefetchable memor)
19:4	All read back as " 0 ", used to help indicate the size of the request (also see bit 20)
31:20	All read back as " 1 " because software has not yet programmed the upper bits with a start address for the block. Note that because bit 20 was the first bit (above bit 3) to read back as written $(= 1)$ ; this indicates the memory request size is $1 MB (2^{20} = 1 MB)$ .

BAR Setup Example Two: 64-Bit, 64MB Memory Request

Figure 3-18 on page 141 depicts the basic steps in setting up a pair of BARs being used to track a

64 MB

block of prefetchable addresses for a device residing in the system memory map. In the diagram, the BARs are shown at three points in the configuration process:

The uninitialized BARs are as they look after power-up or a reset. The designer has hard-coded lower bits of the lower BAR to indicate the request type and size; the upper BAR bits are all read-write. System software will first write all 1 's to both BARs to set all read-write bits $= 1$ . Of course,the hard-coded bits in the lower BAR are unaffected by the configuration write.

The second view of the BARs in Figure 3-18 on page 141 shows them as they look after configuration software has performed the write of all 1 's to both. The next step in configuration is a read of the BARs to check the request. Table 3-9 on page 142 summarizes the results of this configuration read.

The third view of the BAR pair Figure 3-18 on page 141 indicates conditions after configuration software has performed two configuration writes (Type $0)$ to program the two halves of the 64 bit start address for the block. In this example, the device start address is 16GB, so bit 1 of the Upper BAR (address bit 33 in the BAR pair) is written $= 1 (2^{33} = 16 GB)$ ; all other read-write bits in both BARs are written $= 0^{'} s$ .

PCI Express System Architecture

Table 3-9: Results Of Reading The BAR Pair after Writing All "1s" To Both

BAR	BAR Bits	Meaning
Lower	0	Read back as a " 0 ", indicating a memory request
Lower	2:1	Read back as 10 b indicating the target supports a 64 bit address decoder, and that the first BAR is concatenated with the next
Lower	3	Read back as a " 1 ", indicating request is for prefetchable mem- ory
Lower	25:4	All read back as " 0 ", used to help indicate the size of the request (also see bit 26)
Lower	31:26	All read back as "1" because software has not yet programmed the upper bits with a start address for the block. Note that because bit 26 was the first bit (above bit 3) to read back as writ ten $(= 1)$ ; this indicates the memory request size is $64 MB (2^{26} =$ 64MB).
Upper	31:0	All read back as "1". These bits will be used as the upper 32 bits of the 64-bit start address programmed by system software.

BAR Setup Example Three: 256-Byte IO Request

Figure 3-19 on page 143 depicts the basic steps in setting up a BAR which is being used to track a 256 byte block of IO addresses for a legacy PCI Express device residing in the system IO map. In the diagram, the BAR is shown at three points in the configuration process:

The uninitialized BAR in Figure 3-19 is as it looks after power-up or a reset. System software first writes all 1 ’s to the BAR to set all read-write bits $= 1$ . Of course, the hard-coded bits are unaffected by the configuration write.

The second view of the BAR shown in Figure 3-19 on page 143 is as it looks after configuration software has performed the write of all 1 's to it. The next step in configuration is a read of the BAR to check the request. Table 3-10 on page 144 summarizes the results of this configuration read.

The third view of the BAR shown Figure 3-19 on page 143 is as it looks after configuration software has performed another configuration write (Type 0) to program the start address for the IO block. In this example, the device start address is $16 KB$ ,so bit 14 is written $= 1 (2^{14} = 16 KB)$ ; all other upper bits are written $= 0^{'} s$ .

PCI Express System Architecture

Table 3-10: Results Of Reading The IO BAR after Writing All "1s" To It

BAR Bits	Meaning
0	Read back as a "1", indicating an IO request
1	Reserved. Tied low and read back as "0".
7:2	All read back as " 0 ", used to help indicate the size of the request (also see bit 8)
31:8	All read back as " 1 " because software has not yet programmed the upper bits with a start address for the block. Note that because bit 8 wa the first bit (above bit 1) to read back as written $(= 1)$ ; this indicates the IO request size is 256 bytes $(2^{8} = 256)$ .

Base/Limit Registers, Type 1 Header Only

General

The second set of configuration registers related to routing are also found in Type 1 configuration headers and used when forwarding address-routed TLPs. Marked "<2" in Figure 3-16 on page 137, these are the three sets of Base/Limit registers programmed in each bridge interface to enable a switch/bridge to claim and forward address-routed TLPs to a secondary bus. Three sets of Base/ Limit Registers are needed because transactions are handled differently (e.g. prefetching, write-posting, etc.) in the prefetchable memory, non-prefetchable memory (MMIO), and IO address domains. The Base Register in each pair establishes the start address for the community of downstream devices and the Limit Register defines the upper address for that group of devices. The three sets of Base/Limit Registers include:

Prefetchable Memory Base and Limit Registers

Non-Prefetchable Memory Base and Limit Register

I/O Base and Limit Registers

Prefetchable Memory Base/Limit Registers

The Prefetchable Memory Base/Limit registers are located at DW 9 and Prefetchable Memory Base/Limit Upper registers at DW 10-11 within the header 1. These registers track all downstream prefetchable memory devices. Either 32 bit or 64 bit addressing can be supported by these registers. If the

Chapter 3: Address Spaces & Transaction Routing

Upper Registers are not implemented, only 32 bits of memory addressing is available, and the TLP headers mapping to this space will be the 3DW format. If the Upper registers and system software maps the device above the

4 GB

boundary, TLPs accessing the device will carry the 4DW header format. In the example shown in Figure 3-20 on page 145, a 6GB prefetchable address range is being set up for the secondary link of a switch.

Figure 3-20: 6GB, 64-Bit Prefetchable Memory Base/Limit Register Set Up

Table 3-11: 6 GB, 64-Bit Prefetchable Base/Limit Register Setup

Register	Value	Use
Prefetchable Memory Base	8001h	Upper 3 nibbles (800h) are used to pro- vide most significant 3 digits of the 32- bit Base Address for Prefetchable Mem- ory behind this switch. The lower 5 dig- its of the address are assumed to be 00000h. The least significant nibble of this register value (1h) indicates that a 64 bit address decoder is supported and that the Upper Base/Limit Registers are also used.
Prefetchable Memory Limit	FFF1h	Upper 3 nibbles (FFFh) are used to pro- vide most significant 3 digits of the 32- bit Limit Address for Prefetchable Mem- ory behind this switch. The lower 5 dig- its of the address are assumed to be FFFFFh. The least significant nibble of this register value (1h) indicates that a 64 bit address decoder is supported and that the Upper Base/Limit Registers are also used.
Prefetchable Memory Base Upper 32 Bits	00000001h	Upper 32 bits of the 64-bit Base address for Prefetchable Memory behind this switch.
Prefetchable Memory Limit Upper 32 Bits	00000002h	Upper 32 bits of the 64-bit Limit address for Prefetchable Memory behind this switch.

Non-Prefetchable Memory Base/Limit Registers

Non-Prefetchable Memory Base/Limit (at DW 8). These registers are used to track all downstream non-prefetchable memory (memory mapped IO) devices. Non-prefetchable memory devices are limited to 32 bit addressing; TLPs targeting them always use the 3DW header format.

Table 3-12: 2MB, 32-Bit Non-Prefetchable Base/Limit Register Setup

Register	Value	Use
Memory Base (Non-Prefetchable)	1210h	Upper 3 nibbles (121h) are used to pro- vide most significant 3 digits of the 32- bit Base Address for Non-Prefetchable Memory behind this switch. The lower 5 digits of the address are assumed to be $00000$ h. The least significant nibble of this regis- ter value (0h) is reserved and should be set $= 0$ .
Memory Limit (Non-Prefetchable)	1220h	Upper 3 nibbles (122h) are used to pro- vide most significant 3 digits of the 32- bit Limit Address for Prefetchable Mem- ory behind this switch. The lower 5 dig- its of the address are assumed to be FFFFFh. The least significant nibble of this register value (0h) is reserved and should be set = 0.

IO Base/Limit Registers

IO Base/Limit (at DW 7) and IO Base/Limit Upper registers (at DW 12). These registers are used to track all downstream IO target devices. If the Upper Registers are used, then IO address space may be extended to a full 32 bits (4GB). If they are not implemented,then IO address space is limited to

16

bits

(64 KB)

. In either case, TLPs targeting these IO devices always carry the 3DW header format.

Table 3-13: 256 Byte IO Base/Limit Register Setup

Register	Value	Use
IO Base	$21 h$	Upper nibble (2h) specifies the most sig- nificant hex digit of the 32 bit IO Base address (the lower digits are 000h) The lower nibble (1h) indicates that the device supports 32 bit IO behind the bridge interface. This also means the device implements the Upper IO Base/ Limit register set, and those registers will be concatenated with Base/Limit.
IO Limit	$41 h$	Upper nibble (4h) specifies the most sig- nificant hex digit of the 32 bit IO Limit address (the lower digits are FFFh). The lower nibble (1h) indicates that the device supports 32 bit IO behind the bridge interface. This also means the device implements the Upper IO Base/ Limit register set, and those registers will be concatenated with Base/Limit.
IO Base Upper 16 Bits	0000h	Upper 16 bits of the 32-bit Base address for IO behind this switch.
IO Limit Upper 16 Bits	0000h	Upper 16 bits of the 32-bit Limit address for IO behind this switch.

Bus Number Registers, Type 1 Header Only

The third set of configuration registers related to routing are used when forwarding ID-routed TLPs, including configuration cycles and completions and optionally messages. These are marked "<3" in Figure 3-16 on page 137. As in PCI, a switch/bridge interface requires three registers: Primary Bus Number, Secondary Bus Number, and Subordinate bus number. The function of these registers is summarized here.

Chapter 3: Address Spaces & Transaction Routing

Primary Bus Number

The Primary Bus Number register contains the bus (link) number to which the upstream side of a bridge (switch) is connected. In PCI Express, the primary bus is the one in the direction of the Root Complex and host processor.

Secondary Bus Number

The Secondary Bus Number register contains the bus (link) number to which the downstream side of a bridge (switch) is connected.

Subordinate Bus Number

The Subordinate Bus Number register contains the highest bus (link) number on the downstream side of a bridge (switch). The Subordinate and Secondary Bus Number registers will contain the same value unless there is another bridge (switch) on the secondary side.

A Switch Is a Two-Level Bridge Structure

Because PCI does not natively support bridges with multiple downstream ports, PCI Express switch devices appear logically as two-level PCI bridge structures, consisting of a single bridge to the primary link and an internal PCI bus which hosts one or more virtual bridges to secondary interfaces. Each bridge interface has an independent Type 1 format configuration header with its own sets of Base/Limit Registers and Bus Number Registers. Figure 3-23 on page 152 illustrates the bus numbering associated with the external links and internal bus of a switch. Note that the secondary bus on the primary link interface is the internal virtual bus, and that the primary interface of all downstream link interfaces connect to the internal bus logically.

4 Packet-Based Transactions

The Previous Chapter

The previous chapter described the general concepts of PCI Express transaction routing and the mechanisms used by a device in deciding whether to accept, forward, or reject a packet arriving at an ingress port. Because Data Link Layer Packets (DLLPs) and Physical Layer ordered set link traffic are never forwarded, the emphasis here is on Transaction Layer Packet (TLP) types and the three routing methods associated with them: address routing, ID routing, and implicit routing. Included is a summary of configuration methods used in PCI Express to set up PCI-compatible plug-and-play addressing within system IO and memory maps, as well as key elements in the PCI Express packet protocol used in making routing decisions.

This Chapter

Information moves between PCI Express devices in packets, and the two major classes of packets are Transaction Layer Packets (TLPs), and Data Link Layer Packets (DLLPs). The use, format, and definition of all TLP and DLLP packet types and their related fields are detailed in this chapter.

The Next Chapter

The next chapter discusses the Ack/Nak Protocol that verifies the delivery of TLPs between each port as they travel between the requester and completer devices. This chapter details the hardware retry mechanism that is automatically triggered when a TLP transmission error is detected on a given link.

Introduction to the Packet-Based Protocol

The PCI Express protocol improves upon methods used by earlier busses (e.g. PCI) to exchange data and to signal system events. In addition to supporting basic memory, IO, and configuration read/write transactions, the links eliminate many sideband signals and replaces them with in-band messages.

With the exception of the logical idle indication and physical layer Ordered Sets, all information moves across an active PCI Express link in fundamental chunks called packets which are comprised of 10 bit control (K) and data (D) symbols. The two major classes of packets exchanged between two PCJ Express devices are high level Transaction Layer Packets (TLPs), and low-level link maintenance packets called Data Link Layer Packets (DLLPs). Collectively, the various TLPs and DLLPs allow two devices to perform memory, IO, and Configuration Space transactions reliably and use messages to initiate power management events, generate interrupts, report errors, etc. Figure 4-1 on page 155 depicts TLPs and DLLPs on a PCI Express link.

Why Use A Packet-Based Transaction Protocol

There are some distinct advantages in using a packet-based protocol, especially when it comes to data integrity. Three important aspects of PCI Express packet protocol help promote data integrity during link transmission:

Packet Formats Are Weil Defined

Some early bus protocols (e.g. PCI) allow transfers of indeterminate (and unlimited) size, making identification of payload boundaries impossible until the end of the transfer. In addition, an early transaction end might be signaled by either agent (e.g. target disconnect on a write or pre-emption of the initiator during a read), resulting in a partial transfer. In these cases, it is difficult for the sender of data to calculate and send a checksum or CRC covering an entire payload, when it may terminate unexpectedly. Instead, PCI uses a simple parity scheme which is applied and checked for each bus phase completed.

In contrast, each PCI Express packet has a known size and format, and the packet header-positioned at the beginning of each DLLP and TLP packet- indicates the packet type and presence of any optional fields. The size of each packet field is either fixed or defined by the packet type. The size of any data payload is conveyed in the TLP header Length field. Once a transfer commences, there are no early transaction terminations by the recipient. This structured

Framing Symbols Indicate Packet Boundaries

Each TLP and DLLP packet sent is framed with a Start and End control symbol, clearly defining the packet boundaries to the receiver. Note that the Start and End control (K) symbols appended to packets by the transmitting device are 10 bits each. This is a big improvement over PCI and PCI-X which use the assertion and de-assertion of a single FRAME# signal to indicate the beginning and end of a transaction. A glitch on the FRAME# signal (or any of the other PCI/PCIX control signals) could cause a target to misconstrue bus events. In contrast, a PCI Express receiver must properly decode a complete 10 bit symbol before concluding link activity is beginning or ending. Unexpected or unrecognized control symbols are handled as errors.

CRC Protects Entire Packet

Unlike the side-band parity signals used by PCI devices during the address and each data phase of a transaction, the in-band 16-bit or 32-bit PCI Express CRC value "protects" the entire packet (other than framing symbols). In addition to CRC, TLP packets also have a packet sequence number appended to them by the transmitter so that if an error is detected at the receiver, the specific packet(s) which were received in error may be resent. The transmitter maintains a copy of each TLP sent in a Retry Buffer until it is checked and acknowledged by the receiver. This TLP acknowledgement mechanism (sometimes referred to as the Ack/Nak protocol) forms the basis of link-level TLP error correction and is very important in deep topologies where devices may be many links away from the host in the event an error occurs and CPU intervention would otherwise be needed.

Transaction Layer Packets

In PCI Express terminology, high-level transactions originate at the device core of the transmitting device and terminate at the core of the receiving device. The Transaction Layer is the starting point in the assembly of outbound Transaction Layer Packets (TLPs), and the end point for disassembly of inbound TLPs at the receiver. Along the way, the Data Link Layer and Physical Layer of each device contribute to the packet assembly and disassembly as described below.

TLPs Are Assembled And Disassembled

Figure 4-2 on page 158 depicts the general flow of TLP assembly at the transmit side of a link and disassembly at the receiver. The key stages in Transaction Layer Packet protocol are listed below. The numbers correspond to those in Figure 4-2.

Device B's core passes a request for service to the PCI Express hardware interface. How this done is not covered by the PCI Express Specification, and is device-specific. General information contained in the request would include:

The PCI Express command to be performed

Start address or ID of target (if address routing or ID routing are used)

Transaction type (memory read or write, configuration cycle, etc.)

Data payload size (and the data to send, if any)

Virtual Channel/Traffic class information

Attributes of the transfer: No Snoop bit set?, Relaxed Ordering set?, etc.

The Transaction Layer builds the TLP header, data payload, and digest based on the request from the core. Before sending a TLP to the Data Link Layer, flow control credits and ordering rules must be applied.

When the TLP is received at the Data Link Layer, a Sequence Number is assigned and a Link CRC is calculated for the TLP (includes Sequence Number). The TLP is then passed on to the Physical Layer.

At the Physical Layer, byte striping, scrambling, encoding, and serialization are performed. STP and END control (K) characters are appended to the packet. The packet is sent out on the transmit side of the link.

At the Physical Layer receiver of Device A, de-serialization, framing symbol check, decoding, and byte un-striping are performed. Note that at the Physical Layer, the first level or error checking is performed (on the control codes).

The Data Link Layer of the receiver calculates CRC and checks it against the received value. It also checks the Sequence Number of the TLP for violations. If there are no errors, it passes the TLP up to the Transaction Layer of the receiver. The information is decoded and passed to the core of Device A. The Data Link Layer of the receiver will also notify the transmitter of the success or failure in processing the TLP by sending an Ack or Nak DLLP to the transmitter. In the event of a Nak (No Acknowledge), the transmitter will re-send all TLPs in its Retry Buffer.

Device Core Requests Access to Four Spaces

Transactions are carried out between PCI Express requesters and completers, using four separate address spaces: Memory, IO, Configuration, and Message. (See Table 4-1.)

Table 4-1: PCI Express Address Space And Transaction Types

Address Space	Transaction Types	Purpose
Memory	Read, Write	Transfer data to or from a location in the system memory map. The proto- col also supports a locked memory read transaction
IO	Read, Write	Transfer data to or from a location in the system IO map. PCI Express IO address assignment to legacy devices. IO addressing is not permitted for Native PCI Express devices.
Configuration	Read, Write	Transfer data to or from a location in the configuration space of a PCI Express device. As in PCI, configura tion is used to discover device capa- bilities, program plug-and-play features, and check status using the 4KB PCI Express configuration space.
Message	Baseline, Vendor-specific	Provides in-band messaging and event reporting (without consuming memory or IO address resources). These are handled the same as posted write transactions.

TLP Transaction Variants Defined

In accessing the four address spaces, PCI Express Transaction Layer Packets (TLPs) carry a header field, called the Type field, which encodes the specific command variant to be used. Table 4-2 on page 160 summarizes the allowed transactions:

Table 4-2: TLP Header Type Field Defines Transaction Variant

TLP Type	Acronym
Memory Read Request	(MRd)
Memory Read Lock Request	(MRdLk)
Memory Write Request	(MWr)
IO Read Request	(IORd)
IO Write Request	(IOWr)
Config Type 0 Read Request	(CfgRd0)
Config Type 0 Write Request	(CfgWr0)
Config Type 1 Read Request	(CfgRd1)
Config Type 1 Write Request	(CfgWr1)
Message Request	(Msg)
Message Request W/Data	(MsgD)
Completion	(Cpl)
Completion W/Data	(CplD)
Completion-Locked	(CplLk)
Completion W/Data	(CplDLk)

TLP Structure

The basic usage of each component of a Transaction Layer Packet is defined in Table 4-3 on page 161.

Table 4-3: TLP Header Type Field Defines Transaction Variant

TLP Component	Protocol Layer	Component Use
Header	Transaction Layer	3DW or 4DW (12 or 16 bytes) in size. Format varies with type, but Header defines transaction parame- ters: - Transaction type - Intended recipient address, ID, etc. - Transfer size (if any), Byte Enables - Ordering attribute - Cache coherency attribute - Traffic Class
Data	Transaction Layer	Optional field. 0-1024 DW Payload, which may be further qualified with Byte Enables to get byte address and byte transfer size resolution.
Digest	Transaction Layer	Optional field. If present, always 1 DW in size. Used for end-to-end CRC (ECRC) and data poisoning.

Generic TLP Header Format

Figure 4-3 on page 162 illustrates the format and contents of a generic TLP 3DW header. In this section, fields common to nearly all transactions are summarized. In later sections, header format differences associated with the specific transaction types are covered.

Table 4-4: Generic Header Field Summary

Header Field	Header Location	Field Use
Length [9:0]	Byte 3 Bit 7:0 Byte 2 Bit 1:0	TLP data payload transfer size, in DW. Maximum transfer size is 10 bits, $2^{10} = 1024 DW (4 KB)$ . Encod- ing: $[\begin{matrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{matrix}] b = 1 D W$ $[\begin{matrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0 \\ b \end{matrix}] = 2 D W$ . 11 1111 1111b = 1023 DW $000000000 b = 1024 DW$
Attr (Attributes)	Byte 2 Bit 5:4	Bit 5 = Relaxed ordering When set $= 1$ ,PCI-X relaxed ordering is enabled for this TLP. If set = 0, then strict PCI ordering is used. Bit 4 = No Snoop When set $= 1$ ,requester is indicating that no host cache coherency issues exist with respect to this TLP. System hardware is not required to cause processor cache snoop for coherency. When set $= 0$ ,PCI -type cache snoop protection is required.
EP (Poisoned Data)	Byte 2 Bit 6	If set $= 1$ ,the data accompanying this data should be considered invalid although the transaction is being allowed to complete normally.
TD (TLP Digest Field Present)	Byte 2 Bit 7	If set $= 1$ ,the optional 1 DW TLP Digest field is included with this TLP that contains an ECRC value. Some rules: Presence of the Digest field must be checked by all receivers (using this bit). - A TLP with TD = 1, but no Digest field is handled as a Malformed TLP. - If a device supports checking ECRC and TD=1, it must perform the ECRC check. - If a device does not support checking ECRC (optional) at the ultimate destination, the device must ignore the digest.

Table 4-4: Generic Header Field Summary

Header Field	Header Location	Field Use
TC (Traffic Class)	Byte 1 Bit 6:4	These three bits are used to encode the traffic class to be applied to this TLP and to the completion associ- ated with it (if any). 000b = Traffic Class 0 (Default) . . 111b = Traffic Class 7 TC 0 is the default class, and TC 1-7 are used in pro- viding differentiated services. See "Traffic Classes and Virtual Channels” on page 256 for additional information.
Type[4:0]	Byte 0 Bit 4:0	These 5 bits encode the transaction variant used with this TLP. The Type field is used with Fmt [1:0] field to specify transaction type, header size, and whether data payload is present. See below for additional information of Type/Fmt encoding for each transac- tion type.
Fmt[1:0] Format	Byte 0 Bit 6:5	These two bits encode information about header size and whether a data payload will be part of the TLP: 00b 3DW header, no data 01b 4DW header, no data 10b 3DW header, with data 11b 4DW header, with data See below for additional information of Type/Fmt encoding for each transaction type.
First DW Byte Enables	Byte 7 Bit 3:0	These four high-true bits map one-to-one to the bytes within the first double word of payload. Bit 3 = 1: Byte 3 in first DW is valid; otherwise not Bit 2 = 1: Byte 2 in first DW is valid; otherwise not Bit 1 = 1: Byte 1 in first DW is valid; otherwise not Bit $0 = 1$ : Byte 0 in first DW is valid; otherwise not See below for details on Byte Enable use.

Table 4-4: Generic Header Field Summary

Header Field	Header Location	Field Use
Last DW Byte Enables	Byte 7 Bit 7:4	These four high-true bits map one-to-one to the bytes within the last double word of payload. Bit $7 = 1$ : Byte 3 in last DW is valid; otherwise not Bit 6 = 1: Byte 2 in last DW is valid; otherwise not Bit 5 = 1: Byte 1 in last DW is valid; otherwise not Bit $4 = 1$ : Byte 0 in last DW is valid; otherwise not See below for details on Byte Enable use.

Header Type/Format Field Encodings

Table 4-5 on page 165 summarizes the encodings used in TLP header Type and Format (Fmt) fields.

Table 4-5: TLP Header Type and Format Field Encodings

TLP	FMT[1:0]	TYPE [4:0]
Memory Read Request (MRd)	00 = 3DW, no data 01 = 4DW, no data	0 0000
Memory Read Lock Request (MRdLk)	00 = 3DW, no data 01 = 4DW, no data	0 0001
Memory Write Request (MWr)	10 = 3DW, w/ data 11 = 4DW, w/ data	0 0000
IO Read Request (IORd)	00 = 3DW, no data	00010
IO Write Request (IOWr)	10 = 3DW, w/ data	0 0010
Config Type 0 Read Request (CfgRd0)	00 = 3DW, no data	0 0100
Config Type 0 Write Request (CfgWr0)	10 = 3DW, w/ data	0 0100
Config Type 1 Read Request (CfgRd1)	00 = 3DW, no data	0 0101
Config Type 1 Write Request (CfgWr1)	10 = 3DW, w/ data	0 0101

Table 4-5: TLP Header Type and Format Field Encodings

TLP	FMT[1:0]	TYPE [4:0]
Message Request (Msg)	01 = 4DW, no data	1 0 rrr* (for rrr, see routing subfield)
Message Request W/Data (MsgD)	11 = 4DW, w/ data	1 0rrr* (for rrr, see routing subfield)
Completion (Cpl)	00 = 3DW, no data	0 1010
Completion W/Data (CplD)	10 = 3DW, w/ data	01010
Completion-Locked (CplLk)	$00 = 3 DW$ ,no data	01011
Completion W/Data (CplDLk)	10 = 3DW, w/ data	01011

The Digest and ECRC Field

The digest field and End-to-End CRC (ECRC) is optional as is a device's ability to generate and check ECRC. If supported and enabled by software, devices must calculate and apply ECRC for all TLPs that the device originates. Also, devices that support ECRC checking must also support Advanced Error Reporting.

ECRC Generation and Checking. This book does not detail the algorithm and process of calculating ECRC, but is defined within the specification. ECRC covers all fields that do not change as the TLP is forwarded across the fabric. The ECRC includes all invariant fields of the TLP header and the data payload, if present. All variant fields are set to 1 for calculating the ECRC, include:

Bit 0 of the Type field is variant - this bit changes when the transaction type is altered for a packet. For example, a configuration transaction being forwarded to a remote link (across one or more switches) begins as a type 1 configuration transaction. When the transaction reaches the destination link, it is converted to a type 0 configuration transaction by changing bit 0 of the type field.

Error/Poisoned (EP) bit - this bit can be set as a TLP traverses the fabric in the event that the data field associated with the packet has been corrupted. This is also referred to as error forwarding.

Who Can Check ECRC? The ECRC check is intended for the device that is the ultimate receipient of the TLP. Link CRC checking verifies that a TLP traverses a given link before being forwarded to the next link, but ECRC is intended to verify that the packet send has not been altered in its journey between the Requester and Completer. Switches in the path must maintain the integrity of the TD bit because corruption of TD will cause an error at the ultimate target device.

The specification makes two statements regarding a Switch's role in ECRC checking:

A switch that supports ECRC checking performs this check on TLPs destined to a location within the Switch itself. "On all other TLPs a Switch must preserve the ECRC (forward it untouched) as an integral part of the TLP."

"Note that a Switch may perform ECRC checking on TLPs passing through the Switch. ECRC Errors detected by the Switch are reported in the same way any other device would report them, but do not alter the TLPs passage through the Switch."

These statements may appear to contradict each other. However, the first statement does not explicitly state that an ECRC check cannot be made in the process of forwarding the TLP untouched. The second statement clarifies that it is possible for switches, as well as the ultimate target device, to check and report ECRC.

Using Byte Enables

As in the PCI protocol, PCI Express requires a mechanism for reconciling its DW addressing and data transfers with the need, at times, for byte resolution in transfer sizes and transaction start/end addresses. To achieve byte resolution, PCI Express makes use of the two Byte Enable fields introduced earlier in Figure 4-3 on page 162 and in Table 4-4 on page 163.

The First DW Byte Enable field and the Last DW Byte Enable fields allow the requester to qualify the bytes of interest within the first and last double words transferred; this has the effect of allowing smaller transfers than a full double word and offsetting the start and end addresses from DW boundaries.

Byte Enable Rules.

Byte enable bits are high true. A value of " 0 " indicates the corresponding byte in the data payload should not be written by the completer. A value of "1", indicates it should.

If the valid data transferred is all within a single aligned double word, the PCI Express System Architecture Last DW Byte enable field must be $= 0000 b$ .

If the header Length field indicates a transfer is more than 1DW, the First DW Byte Enable must have at least one bit enabled.

If the Length field indicates a transfer of 3DW or more, then neither the First DW Byte Enable field or the Last DW Byte Enable field may have discontinuous byte enable bits set. In these cases, the Byte Enable fields are only being used to offset the effective start address of a burst transaction.

Discontinuous byte enable bit patterns in the First DW Byte enable field are allowed if the transfer is 1DW.

Discontinuous byte enable bit patterns in both the First and Second DW Byte enable fields are allowed only if the transfer is Quadword aligned (2DWs).

A write request with a transfer length of 1DW and no byte enables set is legal, but has no effect on the completer.

If a read request of $1 DW$ is done with no byte enable bits set,the completer returns a 1DW data payload of undefined data. This may be used as a Flush mechanism. Because of ordering rules, a flush may be used to force all previously posted writes to memory before the completion is returned.

An example of byte enable use in this case is illustrated in Figure 4-4 on page 168. Note that the transfer length must extend from the first DW with any valid byte enabled to the last DW with any valid bytes enabled. Because the transfer is more than

2 DW

,the byte enables may only be used to specify the start address location (2d) and end address location (34d) of the transfer.

Figure 4-4: Using First DW and Last DW Byte Enable Fields

Transaction Descriptor Fields

As transactions move between requester and completer, it is important to uniquely identify a transaction, since many split transactions may be pending at any instant. To this end, the specification defines several important header fields that when used together form a unique Transaction Descriptor as illustrated in Figure 4-5.

Figure 4-5: Transaction Descriptor Fields

While the Transaction Descriptor fields are not in adjacent header locations, collectively they describe key transaction attributes, including:

Transaction ID. This is comprised of the Bus, Device, and Function Number of the TLP requester AND the Tag field of the TLP.

Traffic Class. Traffic Class (TC 0 -7) is inserted in the TLP by the requester, and travels unmodified through the topology to the completer. At every link, Traffic Class is mapped to one of the available virtual channels.

Transaction Attributes. These consist of the Relaxed Ordering and No Snoop bits. These are also set by the requester and travel with the packet to the completer.

Additional Rules For TLPs With Data Payloads

The following rules apply when a TLP includes a data payload.

The Length field refers to data payload only; the Digest field (if present) is not included in the Length.

The first byte of data in the payload (immediately after the header) is always associated with the lowest (start) address.

The Length field always represents an integral number of doublewords (DW) transferred. Partial doublewords are qualified using First and Last Byte Enable fields.

The PCI Express specification states that when multiple transactions are returned by a completer in response to a single memory request, that each intermediate transaction must end on naturally-aligned 64 and 128 byte address boundaries for a root complex (this is termed the Read Completion Boundary, or RCB). All other devices must break such transactions at naturally-aligned 128 byte boundaries. This behavior promotes system performance related to cache lines.

The Length field is reserved when sending message TLPs using the transaction Msg. The Length field is valid when sending the message with data variant MsgD.

PCI Express supports load tuning of links. This means that the data payload of a TLP must not exceed the current value in the Max_Payload_Size field of the Device Control Register. Only write transactions have data payloads, so this restriction does not apply to reads. A receiver is required to check for violations of the Max_Payload_Size limit during writes; violations are handled as Malformed TLPs.

Receivers also must check for discrepancies between the value in the Length field and the actual amount of data transferred in a TLP with data. Violations are also handled as Malformed TLPs.

Requests must not mix combinations of start address and transfer length which will cause a memory space access to cross a $4 KB$ boundary. While checking is optional in this case, receivers checking for violations of this rule will report it as a Malformed TLP.

Building Transactions: TLP Requests & Completions

In this section, the format of 3DW and 4DW headers used to accomplish specific transaction types are described. Many of the generic fields described previously apply, but an emphasis is placed on the fields which are handled differently between transaction types.

IO Requests

While the PCI Express specification discourages the use of IO transactions, an allowance is made for legacy devices and software which may rely on a compatible device residing in the system IO map rather than the memory map. While the IO transactions can technically access a 32-bit IO range, in reality many systems (and CPUs) restrict IO access to the lower 16 bits (

64 KB

) of this range. Figure 4-6 on page 171 depicts the system IO map and the 16/32 bit address boundaries. PCI Express non-legacy devices are memory-mapped, and not permitted to make requests for IO address allocation in their configuration Base Address Registers.

Figure 4-6: System IO Map

The entire 32 bit I/O address (4GB) space may be accessed using the 3DW request header format.

Definitions Of IO Request Header Fields. Table 4-6 on page 173

describes the location and use of each field in an IO request header.

Table 4-6: IO Request Header Fields

Field Name	Header Byte/Bit	Function
Length 9:0	Byte 3 Bit 7:0 Byte 2 Bit 1:0	Indicates data payload size in DW. For IO requests, this field is always = 1. Byte Enables are used to qualify bytes within DW.
Attr 1:0 (Attributes)	Byte 2 Bit 5:4	Attribute 1: Relaxed Ordering Bit Attribute 0: No Snoop Bit Both of these bits are always $= 0$ in IO requests.
EP	Byte 2 Bit 6	If $= 1$ ,indicates the data payload (if present) is poisoned.
TD	Byte 2 Bit 7	If $= 1$ ,indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
TC 2:0 (Transfer Class)	Byte 2 Bit 6:4	Indicates transfer class for the packet. TC is = 0 for all IO requests.
Type 4:0	Byte 0 Bit 4:0	TLP packet type field. Always set to 00010b for IO requests
Fmt 1:0 (Format)	Byte 0 Bit 6:5	Packet Format. IO requests are: 00b = IO Read (3DW without data) 10b = IO Write (3DW with data)
1st DW BE 3:0 (First DW Byte Enables)	Byte 7 Bit 3:0	These high true bits map one-to-one to qualify bytes within the DW pay- load. For IO requests, any bit combi- nation is valid (including none)
Last BE 3:0 (Last DW Byte Enables)	Byte 7 Bit 7:4	These high true bits map one-to-one to qualify bytes within the last DW transferred. For IO requests, these bits must be 0000b. (Single DW)

Table 4-6: IO Request Header Fields

Field Name	Header Byte/Bit	Function
Tag 7:0	Byte 6 Bit 7:0	These bits are used to identify each outstanding request issued by the requester. As non-posted requests are sent, the next sequential tag is assigned. Default: only bits 4:0 are used (32 out- standing transactions at a time) If Extended Tag bit in PCI Express Control Register is set $= 1$ ,then all 8 bits may be used (256 tags).
Requester ID 15:0	Byte 5 Bit 7:0 Byte 4 Bit 7:0	Identifies the requester so a comple- tion may be returned, etc. Byte 4, 7:0 = Bus Number Byte 5, 7:3 = Device Number Byte 5, 2:0 = Function Number
Address 31:2	Byte 8 Bit 7:2 Byte 7 Bit 7:0 Byte 6 Bit 7:0 Byte 5 Bit 7:0	The upper 30 bits of the 32-bit start address for the IO transfer. Note that the lower two bits of the 32 bit address are reserved (00b), forcing the start address to be DW aligned.

Memory Requests

PCI Express memory transactions include two classes: Read Request/Completion and Write Request. Figure 4-8 on page 175 depicts the system memory map and the 3DW and 4DW memory request packet formats. When request memory data transfer it is important to remember that memory transactions are never permitted to cross

4 KB

boundaries.

PCI Express System Architecture

Description of 3DW And 4DW Memory Request Header Fields.

The location and use of each field in a 4DW memory request header is listed in Table 4-7 on page 176.

Note: The difference between a 3DW header and a 4DW header is the location and size of the starting Address field:

For a 3DW header (32 bit addressing): Address bits 31:2 are in Bytes 8-11, and 12-15 are not used.

For a 4DW header (64 bit addressing): Address bits 31:2 are in Bytes 12-15, and address bits 63:32 are in Bytes 8-11.

Otherwise the header fields are the same.

Table 4-7: 4DW Memory Request Header Fields

Field Name	Header Byte/Bit	Function
Length [9:0]	Byte 3 Bit 7:0 Byte 2 Bit 1:0	TLP data payload transfer size, in DW. Maximum transfer size is 10 bits, $2^{10}$ = 1024 DW (4KB). Encoding: $[\begin{matrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ b \end{matrix}] = [\begin{matrix} 1 \\ D \\ W \end{matrix}]$ 00 0000 0010b = 2DW 11 1111 1111b = 1023 DW $0000000000 b = 1024 DW$
Attr (Attributes)	Byte 2 Bit 5:4	Bit 5 = Relaxed ordering. When set $= 1,$ PCI-X relaxed ordering is enabled for this TLP. If set $= 0$ ,then strict PCI ordering is used. Bit 4 = No Snoop. When set $= 1$ ,requester is indicating that no host cache coherency issues exist with respect to this TLP. System hardware is not required to cause pro- cessor cache snoop for coherency. When set $= 0$ ,PCI -type cache snoop protection is required.

Table 4-7: 4DW Memory Request Header Fields

Field Name	Header Byte/Bit	Function
EP (Poisoned Data)	Byte 2 Bit 6	If set $= 1$ ,the data accompanying this data should be considered invalid although the transaction is being allowed to complete normally.
TD (TLP Digest Field Present)	Byte 2 Bit 7	If set = 1, the optional 1 DW TLP Digest field is included with this TLP. Some rules: Presence of the Digest field must be checked by all receivers (using this bit) - A TLP with TD = 1, but no Digest field is handled as a Malformed TLP. - If a device supports checking ECRC and TD=1, it must perform the ECRC check. - If a device does not support check- ing ECRC (optional) at the ulti- mate destination, the device must ignore the digest field.
TC (Traffic Class)	Byte 1 Bit 6:4	These three bits are used to encode the traffic class to be applied to this TLP and to the completion associated with it (if any). 000b = Traffic Class 0 (Default) . . 111b = Traffic Class 7 TC 0 is the default class, and TC 1-7 are used in providing differentiated services. See“Traffic Classes and Vir- tual Channels” on page 256 for addi- tional information.

Table 4-7: 4DW Memory Request Header Fields

Field Name	Header Byte/Bit	Function
Type[4:0]	Byte 0 Bit 4:0	TLP packet Type field: $00000 b =$ Memory Read or Write $00001 b =$ Memory Read Locked Type field is used with Fmt [1:0] field to specify transaction type, header size, and whether data payload is present.
Fmt 1:0 (Format)	Byte 0 Bit 6:5	Packet Format: 00b = Memory Read (3DW w/o data) 10b = Memory Write (3DW w/ data) 01b = Memory Read (4DW w/o data 11b = Memory Write (4DW w/ data)
1st DW BE 3:0 (First DW Byte Enables)	Byte 7 Bit 3:0	These high true bits map one-to-one to qualify bytes within the DW pay- load.
Last BE 3:0 (Last DW Byte Enables)	Byte 7 Bit 7:4	These high true bits map one-to-one to qualify bytes within the last DW transferred.
Tag 7:0	Byte 6 Bit 7:0	These bits are used to identify each outstanding request issued by the requester. As non-posted requests are sent, the next sequential tag is assigned. Default: only bits 4:0 are used (32 out- standing transactions at a time) If Extended Tag bit in PCI Express Control Register is set $= 1$ ,then all 8 bits may be used (256 tags).
Requester ID 15:0	Byte 5 Bit 7:0 Byte 4 Bit 7:0	Identifies the requester so a comple- tion may be returned, etc. Byte 4, 7:0 = Bus Number Byte 5, 7:3 = Device Number Byte 5, 2:0 = Function Number

Table 4-7: 4DW Memory Request Header Fields

Field Name	Header Byte/Bit	Function
Address 31:2	Byte 15 Bit 7:2 Byte 14 Bit 7:0 Byte 13 Bit 7:0 Byte 12 Bit 7:0	The lower 32 bits of the 64 bit start address for the memory transfer. Note that the lower two bits of the 32 bit address are reserved (00b), forcing the start address to be DW aligned.
Address 63:32	Byte 11 Bit 7:2 Byte 10 Bit 7:0 Byte 9 Bit 7:0 Byte 8 Bit 7:0	The upper 32 bits of the 64-bit start address for the memory transfer.

Memory Request Notes. Features of memory requests include:

Memory transfers are never permitted to cross a $4 KB$ boundary.

All memory mapped writes are posted, resulting in much higher performance.

Either 32 bit or 64 bit addressing may be used. The 3DW header format supports 32 bit addresses and the 4DW header supports 64 bits.

The full capability of burst transfers is available with a transfer length of 0- 1024 DW (0-4KB).

Advanced PCI Express Quality of Service features, including up to 8 transfer classes and virtual channels may be implemented.

The No Snoop attribute bit in the header may be set $= 1$ ,relieving the system hardware from the burden of snooping processor caches when PCI Express transactions target main memory. Optionally, the bit may be deas-serted in the packet, providing PCI-like cache coherency protection.

The Relaxed Ordering bit may also be set $= 1$ ,permitting devices in the path between the packet and its destination to apply the relaxed ordering rules available in PCI-X. If deasserted, strong PCI producer-consumer ordering is enforced.

Configuration Requests

To maintain compatibility with PCI, PCI Express supports both Type 0 and Type 1 configuration cycles. A Type 1 cycle propagates downstream until it reaches the bridge interface hosting the bus (link) that the target device resides on. The configuration transaction is converted on the destination link from Type 1 to Type 0 by the bridge. The bridge forwards and converts configuration cycles using previously programmed Bus Number registers that specify its primary,

PCI Express System Architecture

secondary, and subordinate buses. Refer to the "PCI-Compatible Configuration Mechanism" on page 723 for a discussion of routing these transactions.

Figure 4-9 on page 180 illustrates a Type 1 configuration cycle making its way downstream. At the destination link, it is converted to Type 0 and claimed by the endpoint device. Note that unlike PCI, only one device (other than the bridge) resides on a link. For this reason, no IDSEL or other hardware indication is required to instruct the device to claim the Type 0 cycle; any Type 0 configuration cycle a device sees on its primary link will be claimed.

Figure 4-9: 3DW Configuration Request And Header Format

Definitions Of Configuration Request Header Fields. Table 4-8 on

page 181 describes the location and use of each field in the configuration request header illustrated in Figure 4-9 on page 180.

Table 4-8: Configuration Request Header Fields

Field Name	Header Byte/Bit	Function
Length 9:0	Byte 3 Bit 7:0 Byte 2 Bit 1:0	Indicates data payload size in DW. For configuration requests, this field is always = 1. Byte Enables are used to qualify bytes within DW (any combi- nation is legal)
Attr 1:0 (Attributes)	Byte 2 Bit 5:4	Attribute 1: Relaxed Ordering Bit Attribute 0: No Snoop Bit Both of these bits are always $= 0$ in configuration requests.
EP	Byte 2 Bit 6	If $= 1$ ,indicates the data payload (if present) is poisoned.
TD	Byte 2 Bit 7	If $= 1$ ,indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
TC 2:0 (Transfer Class)	Byte 2 Bit 6:4	Indicates transfer class for the packet. TC is = 0 for all Configuration requests.
Type 4:0	Byte 0 Bit 4:0	TLP packet type field. Set to: $00100 b =$ Type 0 config request $00101 b =$ Type 1 config request
Fmt 1:0 (Format)	Byte 0 Bit 6:5	Packet Format. Always a 3DW header $00 b =$ configuration read (no data) $10 b =$ configuration write (with data)
1st DW BE 3:0 (First DW Byte Enables)	Byte 7 Bit 3:0	These high true bits map one-to-one to qualify bytes within the DW payload. For config requests, any bit combina- tion is valid (including none)

Table 4-8: Configuration Request Header Fields

Field Name	Header Byte/Bit	Function
Last BE 3:0 (Last DW Byte Enables)	Byte 7 Bit 7:4	These high true bits map one-to-one to qualify bytes within the last DW trans- ferred. For config requests, these bits must be 0000b. (Single DW)
Tag 7:0	Byte 6 Bit 7:0	These bits are used to identify each outstanding request issued by the requester. As non-posted requests are sent, the next sequential tag is assigned. Default: only bits 4:0 are used (32 out- standing transactions at a time) If Extended Tag bit in PCI Express Control Register is set $= 1$ ,then all 8 bits may be used (256 tags).
Requester ID 15:0	Byte 5 Bit 7:0 Byte 4 Bit 7:0	Identifies the requester so a comple- tion may be returned, etc. Byte 4, 7:0 = Bus Number Byte 5, 7:3 = Device Number Byte 5, 2:0 = Function Number
Register Number	Byte 11 Bit 7:2	These bits provide the lower 6 bits of DW configuration space offset. The Register Number is used in conjunc- tion with Ext Register Number to pro- vide the full 10 bits of offset needed for the 1024 DW (4096 byte) PCI Express configuration space.
Ext Register Number (Extended Register Number)	Byte 10 Bit 3:0	These bits provide the upper 4 bits of DW configuration space offset. The Ext Register Number is used in conjunc- tion with Register Number to provide the full 10 bits of offset needed for the 1024 DW (4096 byte) PCI Express con- figuration space. For compatibility, this field can be set $= 0$ ,and only the lower 64DW (256 bytes will be seen) when indexing the Register Number.

Table 4-8: Configuration Request Header Fields

Field Name	Header Byte/Bit	Function
Completer ID 15:0	Byte 9 Bit 7:0 Byte 8 Bit 7:0	Identifies the completer being accessed with this configuration cycle. The Bus and Device numbers in this field are "captured" by the device on each con- figuration Type 0 write. Byte 8, 7:0 = Bus Number Byte 9, 7:3 = Device Number Byte 9, 2:0 = Function Number

Configuration Request Notes. Configuration requests always use the 3DW header format and are routed by the contents of the ID field.

All devices "capture" the Bus Number and Device Number information provided by the upstream device during each Type 0 configuration write cycle. Information is contained in Byte 8-9 (Completer ID) of configuration request.

Completions

Completions are returned following each non-posted request:

Memory Read request may result in completion with data (CplD)

IO Read request may result in a completion with or without data (CplD)

IO Write request may result in a completion without data (Cpl)

Configuration Read request may result in a completion with data (CplD)

Configuration Write request may result in a completion without data (Cpl)

Many of the fields in the completion must have the same values as the associated request, including Traffic Class, Attribute bits, and the original Requester ID which is used to route the completion back to the original requester. Figure 4-10 on page 184 depicts a completion returning after a non-posted request, as well as the 3DW completion header format.

Definitions Of Completion Header Fields. Table 4-9 on page 185

describes the location and use of each field in a completion header.

Table 4-9: Completion Header Fields

Field Name	Header Byte/Bit	Function
Length 9:0	Byte 3 Bit 7:0 Byte 2 Bit 1:0	Indicates data payload size in DW. For completions, this field reflects the size of the data payload associated with this completion.
Attr 1:0 (Attributes)	Byte 2 Bit 5:4	Attribute 1: Relaxed Ordering Bit Attribute 0: No Snoop Bit For a completion, both of these bits are set to same state as in the request.
EP	Byte 2 Bit 6	If $= 1$ ,indicates the data payload is poi- soned.
TD	Byte 2 Bit 7	If $= 1$ ,indicates the presence of a digest field (1 DW) at the end of the TLI (preceding LCRC and END)
TC 2:0 (Transfer Class)	Byte 2 Bit 6:4	Indicates transfer class for the packet. For a completion, TC is set to same value as in the request.
Type 4:0	Byte 0 Bit 4:0	TLP packet type field. Always set to $01010 b$ for a completion.
Fmt 1:0 (Format)	Byte 0 Bit 6:5	Packet Format. Always a 3DW header $00 b =$ Completion without data (Cpl) $10 b =$ Completion with data (CplD)
Byte Count	Byte 7 Bit 7:0 Byte 6 Bit 3:0	This is the remaining byte count until a read request is satisfied. Generally, it is derived from the original request Length field. See “Data Returned For Read Requests:" on page 188 for special cases caused by multiple completions.

Table 4-9: Completion Header Fields

Field Name	Header Byte/Bit	Function
BCM (Byte Count Modified)	Byte 6 Bit 4	Set $= 1$ only by PCI-X completers. Indi- cates that the byte count field (see previ- ous field) reflects the first transfer payload rather than total payload remaining. See “Using The Byte Count Modified Bit” on page 188.
CS 2:0 (Completion Status Code)	Byte 6 Bit 7:5	These bits encoded by the completer to indicate success in fulfilling the request $000 b = Successful Completion (SC)$ $001 b = Unsupported Request (UR)$ 010b = Config Req Retry Status (CR S) $100 b$ = Completer abort. (CA) others: reserved. See “Summary of Completion Status Codes:” on page 187.
Completer ID 15:0	Byte 5 Bit 7:0 Byte 4 Bit 7:0	Identifies the completer. While not needed for routing a completion, this information may be useful if debugging bus traffic. Byte 4 7:0 = Completer Bus # Byte 5 7:3 = Completer Dev # Byte 5 2:0 = Completer Function #
Lower Address 6:0	Byte 11 Bit 6:0	The lower 7 bits of address for the first enabled byte of data returned with a read. Calculated from request Length and Byte enables, it is used to determine next legal Read Completion Boundary. See “Calculating Lower Address Field” on page 187.
Tag 7:0	Byte 10 Bit 7:0	These bits are set to reflect the Tag received with the request. The requester uses them to associate inbound comple- tion with an outstanding request.

Table 4-9: Completion Header Fields

Field Name	Header Byte/Bit	Function
Requester ID 15:0	Byte 9 Bit 7:0 Byte 8 Bit 7:0	Copied from the request into this field to be used in routing the completion back to the original requester. Byte 4, 7:0 = Requester Bus # Byte 5, 7:3 = Requester Device # Byte 5, 2:0 = Requester Function #

Summary of Completion Status Codes: (Refer to Completion Status field in table Table 4-9 on page 185).

000b (SC) Successful Completion code indicates the original request completed properly at the target.

001b (UR) Unsupported Request code indicates original request failed at the target because it targeted an unsupported address, carried an unsupported address or request, etc. This is handled as an uncorrectable error. See the "Unsupported Request" on page 365 for details.

010b (CRS) Configuration Request Retry Status indicates target was temporarily off-line and the attempt should be retried. (e.g. initialization delay after reset, etc.).

100b (CA) Completer Abort code indicates that completer is off-line due to an error (much like target abort in PCI). The error will be logged and handled as an uncorrectable error.

Calculating The Lower Address Field (Byte 11, bits 7:0): Refer to the Lower Address field in Table 4-9 on page 185. The Lower Address field is set up by the completer during completions with data (CplD) to reflect the address of the first enabled byte of data being returned in the completion payload. This must be calculated in hardware by considering both the DW start address and the byte enable pattern in the First DW Byte Enable field provided in the original request. Basically, the address is an offset from the DW start address:

If the First DW Byte Enable field is 1111b, all bytes are enabled in the first DW and the offset is 0 . The byte start address is $=$ DW start address.

If the First DW Byte Enable field is $1110 b$ ,the upper three bytes are enabled in the first DW and the offset is 1 . The byte start address is $=$ DW start address +1 . PCI Express System Architecture

If the First DW Byte Enable field is 1100b, the upper two bytes are enabled in the first DW and the offset is 2 . The byte start address is $=$ DW start address + 2.

If the First DW Byte Enable field is 1000b, only the upper byte is enabled in the first DW and the offset is 3 . The byte start address is $=$ DW start address + 3.

Once calculated, the lower 7 bits are placed in the Lower Address field of the completion header in the event the start address was not aligned on a Read Completion Boundary (RCB) and the read completion must break off at the first RCB. Knowledge of the RCB is necessary because breaking a transaction must be done on RCBs which are based on start address-not transfer size.

Using The Byte Count Modified Bit. Refer to the Byte Count Modified Bit in Table 4-9 on page 185. This bit is only set by a PCI-X completer (e.g. a bridge from PCI Express to PCI-X) in a particular circumstance. Rules for its assertion include:

It is only set $= 1$ by a PCI-X completer if a read request is going to be broken into multiple completions

The BCM bit is only set for the first completion of the series. It is set to indicate that the first completion contains a Byte Count field that reflects the first completion payload rather than the total remaining (as it would in normal PCI Express protocol). The receiver then recognizes that the completion will be followed by others to satisfy the original request as required.

For the second and any other completions in the series, the BCM bit must be deasserted and the Byte Count field will reflect the total remaining count-- just as in normal PCI Express protocol.

PCI Express devices receiving completions with the BCM bit set must interpret this case properly.

The Lower Address field is set up by the completer during completions with data (CplD) to reflect the address of the first enabled byte of data being returned

Data Returned For Read Requests:

Completions for read requests may be broken into multiple completions, but total data transfer must equal size of original request

Completions for multiple requests may not be combined

IO and Configuration reads are always 1 DW, so will always be satisfied with a single completion

A completion with a Status Code other than SC (successful completion) terminates a transaction.

The Read Completion Boundary (RCB) must be observed when handling a read request with multiple completions. The RCB is 64 bytes or 128 bytes for the root complex; the value used should be visible in a configuration register.

Bridges and endpoints may implement a bit for selecting the RCB size (64 or 128 bytes) under software control.

Completions that do not cross an aligned RCB boundary must complete in one transfer.

Multiple completions for a single read request must return data in increasing address order.

Receiver Completion Handling Rules:

A completion received without a match to an outstanding request is an Unexpected Completion. It will be handled as an error.

Completions with a completion status other than Successful Completion (SC) or Configuration Request Retry Status (CRS) will be handled as an error and buffer space associated with them will be released.

When the Root Complex receivers a CRS status during a configuration cycle, its handling of the event is not defined except after reset (when a period is defined when it must allow it).

If CRS is received for a request other than configuration, it is handled as a Malformed TLP.

Completions received with status $=$ a reserved code alias to Unsupported Requests.

If a read completion is received with a status other than Successful Completion (SC), no data is received with the completion and a CPI (or CpILk) is returned in place of a CplD (or CplDLk).

In the event multiple completions are being returned for a read request, a completion status other than Successful Completion (SC) immediately ends the transaction. Device handling of data received prior to the error is implementation-specific.

In maintaining compatibility with PCI, a Root Complex may be required to synthesize a read value of a "1 's" when a configuration cycle ends with a completion indicating an Unsupported Request. (This is analogous to master aborts which occur when PCI enumeration probes devices which are not in the system).

PCI Express System Architecture

Message Requests

Message requests replace many of the interrupt, error, and power management sideband signals used on earlier bus protocols. All message requests use the 4DW header format, and are handled much the same as posted memory write transactions. Messages may be routed using address, ID, or implicit routing. The routing subfield in the packet header indicates the routing method to apply, and which additional header registers are in use (address registers, etc.). Figure 4-11 on page 190 depicts the message request header format.

Figure 4-11: 4DW Message Request Header Format

Definitions Of Message Request Header Fields. Table 4-10 on page 191 describes the location and use of each field in a message request header.

Table 4-10: Message Request Header Fields

Field Name	Header Byte/Bit	Function
Length 9:0	Byte 3 Bit 7:0 Byte 2 Bit 1:0	Indicates data payload size in DW. For message requests, this field is always 0 (no data) or 1 (one DW of data)
Attr 1:0 (Attributes)	Byte 2 Bit 5:4	Attribute 1: Relaxed Ordering Bit Attribute 0: No Snoop Bit Both of these bits are always $= 0$ in message requests.
EP	Byte 2 Bit 6	If $= 1$ ,indicates the data payload (if present) is poisoned.
TD	Byte 2 Bit 7	If $= 1$ ,indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
TC 2:0 (Transfer Class)	Byte 2 Bit 6:4	Indicates transfer class for the packet. TC is = 0 for all message requests.
Type 4:0	Byte 0 Bit 4:0	TLP packet type field. Set to: Bit 4:3: $10 b = Msg$ Bit 2:0 (Message Routing Subfield) $000 b$ = Routed to Root Complex $001 b =$ Routed by address $010 b =$ Routed by ID $011 b =$ Root Complex Broadcast Msg $100 b =$ Local; terminate at receiver $101 b =$ Gather/route to Root Complex 0thers = reserved
Fmt 1:0 (Format)	Byte 0 Bit 6:5	Packet Format. Always a 4DW header $01 b$ = message request without data 11b = message request with data

Table 4-10: Message Request Header Fields

Field Name	Header Byte/Bit	Function
Message Code 7:0	Byte 7 Bit 7:0	This field contains the code indicating the type of message being sent. 0000 0000b = Unlock Message $0001 xx xx b =$ Power Mgmt Message 0010 0xxxb = INTx Message 0011 00xxb = Error Message 0100 xxxxb = Hot Plug Message 0101 0000b = Slot Power Message 0111 111xb = Vendor Type 0 Message 0111 1111b = Vendor Type 1 Message
Tag 7:0	Byte 6 Bit 7:0	As all message requests are posted, no tag is assigned to them. These bits should be $= 0$ .
Requester ID 15:0	Byte 5 Bit 7:0 Byte 4 Bit 7:0	Identifies the requester sending the message. Byte 4, 7:0 = Requester Bus # Byte 5, 7:3 = Requester Device # Byte 5, 2:0 = Requester Function #
Address 31:2	Byte 11 Bit 7:2 Byte 10 Bit 7:0 Byte 9 Bit 7:0 Byte 8 Bit 7:0	If address routing was selected for the message (see Type 4:0 field above), then this field contains the lower part of the 64-bit starting address. Other- wise, this field is not used.
Address 63:32	Byte 15 Bit 7:2 Byte 14 Bit 7:0 Byte 13 Bit 7:0 Byte 12 Bit 7:0	If address routing was selected for the message (see Type 4:0 field above), then this field contains the upper 32 bits of the 64 bit starting address. Oth- erwise, this field is not used.

Message Notes: The following tables specify the message coding used for each of the seven message groups, and is based on the message code field listed in Table 4-10 on page 191. The defined groups include:

INTx Interrupt Signaling

Power Management

Error Signaling

Lock Transaction Support

Slot Power Limit Support

Vendor Defined Messages

Hot Plug Signaling

INTx Interrupt Signaling. While many devices are capable of using the PCI 2.3 Message Signaled Interrupt (MSI) method of delivering interrupts, some devices may not support it. PCI Express defines a virtual wire alternative in which devices simulate the assertion and deassertion of the INTx (INTA-INTD) interrupt signals seen in PCI-based systems. Basically, a message is sent to inform the upstream device an interrupt has been asserted. After servicing, the device which sent the interrupt sends a second message indicating the virtual interrupt signal is being released. Refer to the "Message Signaled Interrupts" on page 331 for details. Table 4-11 summarizes the INTx message coding at the packet level.

Table 4-11: INTx Interrupt Signaling Message Coding

INTx Message	Message Code 7:0	Routing 2:0
Assert_INTA	0010 0000b	100b
Assert_INTB	0010 0001b	100b
Assert_INTC	0010 0010b	100b
Assert_INTD	0010 0011b	100b
Deassert_INTA	0010 0100b	100b
Deassert_INTB	0010 0101b	100b
Deassert_INTC	0010 0110b	100b
Deassert_INTD	0010 0111b	100b

PCI Express System Architecture

Other INTx Rules:

The INTx Message type does not include a data payload. The Length field is reserved.

Assert_INTx and Deassert_INTx are only issued by upstream ports. Checking violations of this rule is optional. If checked, a TLP violation is handled as a Malformed TLP.

These messages are required to use the default traffic class, TC0. Receivers must check for violation of this rule (handled as Malformed TLPs).

Components at both ends of the link must track the current state of the four virtual interrupts. If the logical state of one of the interrupts changes at the upstream port, the port must send the appropriate INTx message to the downstream port on the same link.

INTx signaling is disabled when the Interrupt Disable bit of the Command Register is set $= 1$ (just as it would be if physical interrupt lines are used).

If any virtual INTx signals are active when the Interrupt Disable bit is set in the device, the device must transmit a corresponding Deassert_INTx message onto the link.

Switches must track the state of the four INTx signals independently for each downstream port and combine the states for the upstream link.

The Root Complex must track the state of the four INTx lines independently and convert them into system interrupts in a system-specific way.

Because of switches in the path, the Requester ID in an INTx message may be the last transmitter, not the original requester.

Power Management Messages. PCI Express is compatible with PCI power management, and adds the PCI Express active link management mechanism. Refer to Chapter 16, entitled "Power Management," on page 567 for a description of power management. Table 4-12 on page 194 summarizes the four power management message types.

Table 4-12: Power Management Message Coding

Power Management Message	Message Code 7:0	Routing 2:0
PM_Active_State_Nak	0001 0100b	100b
PM_PME	0001 1000b	000b
PM_Turn_Off	0001 1001b	011b
PME_TO_Ack	0001 1011b	101b

Other Power Management Message Rules:

Power Management Message type does not include a data payload. The Length field is reserved.

These messages are required to use the default traffic class, TC0. Receivers must check for violation of this rule (handled as Malformed TLPs).

PM_PME is sent upstream by component requesting event.

PM_Turn_Off is broadcast downstream

PME_TO_Ack is sent upstream by endpoint. For switch with devices attached to multiple downstream ports, this message won't be sent upstream until all it is first received from all downstream ports.

Error Messages. Error messages are sent upstream by enabled devices that detect correctable, non-fatal uncorrectable, and fatal non-correctable errors. The device detecting the error is defined by the Requester ID field in the message header. Table 4-13 on page 195 describes the three error message types.

Table 4-13: Error Message Coding

Error Message	Message Code 7:0	Routing 2:0
ERR_COR	0011 0000b	000b
ERR_NONFATAL	0011 0001b	000b
ERR_FATAL	0011 0011b	000b

Other Error Signaling Message Rules:

These messages are required to use the default traffic class, TC0. Receivers must check for violation of this rule (handled as Malformed TLPs).

This message type does not include a data payload. The Length field is reserved.

The Root Complex converts error messages into system-specific events.

PCI Express System Architecture

Unlock Message. The Unlock message is sent to a completer to release it from lock as part of the PCI Express Locked Transaction sequence. Table 4-14 on page 196 summarizes the coding for this message.

Table 4-14: Unlock Message Coding

Unlock Message	Message Code 7:0	Routing 2:0
Unlock	0000 0000b	011b

Other Unlock Message Rules:

These messages are required to use the default traffic class, TC0. Receivers must check for violation of this rule (handled as Malformed TLPs).

This message type does not include a data payload. The Length field is reserved.

Slot Power Limit Message. This message is sent from a downstream switch or Root Complex port to the upstream port of the device attached to it. It conveys a slot power limit which the downstream device then copies into the Device Capabilities Register for its upstream port. Table 4-15 summarizes the coding for this message.

Table 4-15: Slot Power Limit Message Coding

Unlock Message	Message Code 7:0	Routing 2:0
Set_Slot_Power_Limit	0101 0000b	100b

Other Set_Slot_Power_Limit Message Rules:

These messages are required to use the default traffic class, TC0. Receivers must check for violation of this rule (handled as Malformed TLPs).

This message type carries a data payload of $1 DW$ . The Length field is set $=$ 1. Only the lower 10 bits of the 32-bit data payload is used for slot power scaling; the upper bits in the data payload must be set $= 0$ .

This message is sent automatically anytime the link transitions to DL_Up status or if a configuration write to the Slot Capabilities Register occurs when the Data Link Layer reports DL_Up status.

If a card in a slot consumes less power than the power limit specified for the card/form factor, it may ignore the message.

Hot Plug Signaling Message. These messages are passed between downstream ports of switches and Root Ports that support Hot Plug Event signaling. Table 4-16 summarizes the Hot Plug message types.

Table 4-16: Hot Plug Message Coding

Error Message	Message Code 7:0	Routing 2:0
Attention_Indicator_On	0100 0001b	100b
Attention_Indicator_Blink	0100 0011b	100b
Attention_Indicator_Off	0100 0000b	100b
Power_Indicator_On	0100 0101b	100b
Power_Indicator_Blink	0100 0111b	100b
Power_Indicator_Off	0100 0100b	100b
Attention_Button_Pressed	0100 1000b	100b

Other Hot Plug Message Rules:

The Attention and Power indicator messages are all driven by the switch/ root complex port to the card.

The Attention Button message is driven upstream by a slot device that implements a switch.

PCI Express System Architecture

Data Link Layer Packets

The primary responsibility of the PCI Express Data Link Layer is to assure that integrity is maintained when TLPs move between two devices. It also has link initialization and power management responsibilities, including tracking of the link state and passing messages and status between the Transaction Layer above and the Physical Layer below.

In performing its role, the Data Link Layer exchanges traffic with its neighbor using Data Link Layer Packets (DLLPs). DLLPs originate and terminate at the Data Link Layer of each device, without involvement of the Transaction Layer. DLLPs and TLPs are interleaved on the link. Figure 4-12 on page 198 depicts the transmission of a DLLP from one device to another.

Figure 4-12: Data Link Layer Sends A DLLP

Types Of DLLPs

There are three important groups of DLLPs used in managing a link:

TLP Acknowledgement Ack/Nak DLLPs

Power Management DLLPs

Flow Control Packet DLLPs

In addition, the specification defines a vendor-specific DLLP.

DLLPs Are Local Traffic

DLLPs have a simple packet format. Unlike TLPs, they carry no target information because they are used for nearest-neighbor communications only.

Receiver handling of DLLPs

The following rules apply when a DLLP is sent from transmitter to receiver:

As DLLPs arrive at the receiver, they are immediately processed. They cannot be flow controlled.

All received DLLPs are checked for errors. This includes a control symbol check at the Physical Layer after deserialization, followed by a CRC check at the receiver Data Link Layer. A 16 bit CRC is calculated and sent with the packet by the transmitter; the receiver calculates its own DLLP checksum and compares it to the received value.

Any DLLPs that fail the CRC check are discarded. There are several reportable errors associated with DLLPs.

Unlike TLPs, the is no acknowledgement protocol for DLLPs. The PCI Express specification has time-out mechanisms which are intended to allow recovery from lost or discarded DLLPs.

Assuming no errors occur, the DLLP type is determined and it is passed to the appropriate internal logic:

Power Management DLLPs are passed to the device power management logic

Flow Control DLLPs are passed to the Transaction Layer so credits may be updated.

Ack/Nak DLLPs are routed to the Data Link Layer transmit interface so TLPs in the retry buffer may be discarded or resent.

Chapter 4: Packet-Based Transactions

Fixed DLLP Packet Size: 8 Bytes

All Data Link Layer Packets consist of the following components:

A 1 DW core (4 bytes) consisting of the one byte Type field and three additional bytes of attributes. The attributes vary with the DLLP type.

A 16 bit CRC value which is calculated based on the DW core contents, then appended to it.

These 6 bytes are then passed to the Physical Layer where a Start Of DLLP (SDP) control symbol and an End Of Packet (END) control symbol are added to it. Before transmission, the Physical Layer encodes the 8 bytes of information into eight 10-bit symbols for transmission to the receiver.

Note that there is never a data payload with a DLLP; all information of interest is carried in the Type and Attribute fields.

DLLP Packet Types

The three groups of DLLPs are defined with a number of variants. Table 4-17 summarizes each variant as well as their DLLP Type field coding.

Table 4-17: DLLP Packet Types

DLLP Type	Type Field Encoding	Purpose
Ack (TLP Acknowledge)	0000 0000b	TLP transmission integrity
Nak (TLP No Acknowledge)	0001 0000b	TLP transmission integrity
PM_Enter_L1	0010 0000b	Power Management
PM_Enter_L23	0010 0001b	Power Management
PM_Active_State_Request_L1	0010 0011b	Power Management
PM_Request_Ack	0010 0100b	Power Management
Vendor Specific	0011 0000b	Vendor
InitFC1-P $xxx = VC #$	0100 0xxxb	TLP Flow Control

Table 4-17: DLLP Packet Types

DLLP Type		Type Field Encoding	Purpose
InitFC1-NP	$xxx = VC #$	0101 0xxxb	TLP Flow Control
InitFC1-Cpl	$xxx = VC #$	0110 0xxxb	TLP Flow Control
InitFC2-P	$xxx = VC #$	1100 0xxxb	TLP Flow Control
InitFC2-NP	$xxx = VC #$	1101 0xxxb	TLP Flow Control
InitFC2-Cpl	$xxx = VC #$	1110 0xxxb	TLP Flow Control
UpdateFC-P	$xxx = VC #$	1000 0xxxb	TLP Flow Control
UpdateFC-NP	$xxx = VC #$	1001 0xxxb	TLP Flow Control
UpdateFC-Cpl	$xxx = VC #$	1010 0xxxb	TLP Flow Control
Reserved		Others	Reserved

Ack Or Nak DLLP Packet Format

The format of the DLLP used by a receiver to Ack or Nak the delivery of a TLP is illustrated in Figure 4-14.

Figure 4-14: Ack Or Nak DLLP Packet Format

Definitions Of Ack Or Nak DLLP Fields. Table 4-18 describes the

fields contained in an Ack or Nak DLLP.

Table 4-18: Ack or Nak DLLP Fields

Field Name	Header Byte/Bit	DLLP Function
AckNak_Seq_Num [11:0]	Byte 3 Bit 7:0 Byte 2 Bit 3:0	For an ACK DLLP: - For good TLPs received with Sequence Number = NEXT_RCV_SEQ count (count before incrementing), use NEXT_RCV_SEQ count - 1 (count after incrementing minus $1$ ). - For TLP received with Sequence Num- ber earlier than NEXT_RCV_SEQ count (duplicate TLP), use NEXT_RCV_SEQ count - 1 For a NAK DLLP: - Associated with a TLP that failed the CRC check, use NEXT_RCV_SEQ count - $1$ . - For a TLP received with Sequence Number later than NEXT_RCV_SEQ count, use NEXT_RCV_SEQ count - 1. Upon receipt, the transmitter will purge TLPs with equal to and earlier Sequence Numbers and replay the remainder TLPs.
Type 7:0	Byte 0 Bit 7:0	Indicates the type of DLLP. For the Ack/ Nak DLLPs: - 00000000b = ACK DLLP. - 0001 0000b = NAK DLLP.
16-bit CRC	Byte 5 Bit 7:0 Byte 4 Bit 7:0	16-bit CRC used to protect the contents of this DLLP. Calculation is made on Bytes 0- 3 of the ACK/NAK.

PCI Express System Architecture

Power Management DLLP Packet Format

PCI Express power management DLLPs and TLPs replace most signals associated with power management state changes. The format of the DLLP used for power management is illustrated in Figure 4-15.

Figure 4-15: Power Management DLLP Packet Format

Definitions Of Power Management DLLP Fields.

Table 4-19

describes the fields contained in a Power Management DLLP.

Table 4-19: Power Management DLLP Fields

Field Name	Header Byte/Bit	DLLP Function
Type 7:0	Byte 0 Bit 7:0	This field indicates type of DLLP. For the Power Man- agement DLLPs: 0010 0000b = PM_Enter_L1 0010 0001b = PM_Enter_L2 0010 0011b = PM_Active_State_Request 0010 0100b = PM_Request_Ack
Link CRC	Byte 5 Bit 7:0 Byte 4 Bit 7:0	16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of whether fields are used.

Chapter 4: Packet-Based Transactions

Flow Control Packet Format

PCI Express eliminates many of the inefficiencies of earlier bus protocols through the use of a credit-based flow control scheme. This topic is covered in detail in Chapter 7, entitled "Flow Control," on page 285. Three slightly different DLLPs are used to initialize the credits and to update them as receiver buffer space becomes available. The two flow control initialization packets are referred to as InitFC1 and InitFC2. The Update DLLP is referred to as UpdateFC.

The generic DLLP format for all three flow control DLLP variants is illustrated in Figure 4-16 on page 205.

Figure 4-16: Flow Control DLLP Packet Format

PCI Express System Architecture

Definitions Of Flow Control DLLP Fields. Table 4-20 on page 206

describes the fields contained in a flow control DLLP.

Table 4-20: Flow Control DLLP Fields

Field Name	Header Byte/Bit	DLLP Function
DataFC 11:0	Byte 3 Bit 7:0 Byte 2 Bit 3:0	This field contains the credits associated with data storage. Data credits are in units of 16 bytes per credit, and are applied to the flow control counter for the virtual channel indicated in $V [2 : 0]$ ,and for the traffic type indicated by the code in Byte 0 , Bits 7:4.
HdrFC 11:0	Byte 2 Bit 7:6 Byte 1 Bit 5:0	This field contains the credits associated with header storage. Data credits are in units of 1 header (including digest) per credit, and are applied to the flow control counter for the virtual channel indicated in $V [2 : 0]$ ,and for the traffic type indicated by the code in Byte 0, Bits 7:4.
VC [2:0]	Byte 0 Bit 2:0	This field indicates the virtual channel (VC 0-7) receiving the credits.
Type 3:0	Byte 0 Bit 7:4	This field contains a code indicating the type of FC DLLP: 0100b = InitFC1-P (Posted Requests) 0101b = InitFC1-NP (Non-Posted Requests 0110b = InitFC1-Cpl (Completions) 0101b = InitFC2-P (Posted Requests) 1101b = InitFC2-NP (Non-Posted Requests) 1110b = InitFC2-Cpl $(Completions)$ 1000b = UpdateFC-P (Posted Requests) 1001b = UpdateFC-NP (Non-Posted Requests) $1010 b = UpdateFC$ - $Cpl (Completions)$
Link CRC	Byte 5 Bit 7:0 Byte 4 Bit 7:0	16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of whether fields are used.

Chapter 4: Packet-Based Transactions

Vendor Specific DLLP Format

PCI Express reserves a DLLP type for vendor specific use. Only the Type code is defined. The Vendor DLLP is illustrated in Figure 4-17.

Figure 4-17: Vendor Specific DLLP Packet Format

Definitions Of Vendor Specific DLLP Fields. Table 4-21 on page 207

describes the fields contained in a Vendor-Specific DLLP 5

Table 4-21: Vendor-Specific DLLP Fields

Field Name	Header Byte/Bit	DLLP Function
Type 3:0	Byte 0 Bit 7:4	This field contains a code indicating the Vendor- specific DLLP: 0011 0000b = Vendor specific DLLP
Link CRC	Byte 5 Bit 7:0 Byte 4 Bit 7:0	16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of whether fields are used.

ACKINAK Protocol

The Previous Chapter

Information moves between PCI Express devices in packets. The two major classes of packets are Transaction Layer Packets (TLPs), and Data Link Layer Packets (DLLPs). The use, format, and definition of all TLP and DLLP packet types and their related fields were detailed in that chapter.

This Chapter

This chapter describes a key feature of the Data Link Layer: 'reliable' transport of TLPs from one device to another device across the Link. The use of ACK DLLPs to confirm reception of TLPs and the use of NAK DLLPs to indicate error reception of TLPs is explained. The chapter describes the rules for replaying TLPs in the event that a NAK DLLP is received.

The Next Chapter

The next chapter discusses Traffic Classes, Virtual Channels, and Arbitration that support Quality of Service concepts in PCI Express implementations. The concept of Quality of Service in the context of PCI Express is an attempt to predict the bandwidth and latency associated with the flow of different transaction streams traversing the PCI Express fabric. The use of QoS is based on application-specific software assigning Traffic Class (TC) values to transactions, which define the priority of each transaction as it travels between the Requester and Completer devices. Each TC is mapped to a Virtual Channel (VC) that is used to manage transaction priority via two arbitration schemes called port and VC arbitration. PCI Express System Architecture

The ACK/NAK protocol associated with the Data Link Layer is described with the aid of Figure 5-2 on page 211 which shows sub-blocks with greater detail. For every TLP that is sent from one device (Device A) to another (Device B) across one Link, the receiver checks for errors in the TLP (using the TLP's LCRC field). The receiver Device B notifies transmitter Device A on good or bad reception of TLPs by returning an ACK or a NAK DLLP. Reception of an ACK DLLP by the transmitter indicates that the receiver has received one or more TLP(s) successfully. Reception of a NAK DLLP by the transmitter indicates that the receiver has received one or more TLP(s) in error. Device A which receives a NAK DLLP then re-sends associated TLP(s) which will hopefully, arrive at the receiver successfully without error.

The error checking capability in the receiver and the transmitter's ability to resend TLPs if a TLP is not received correctly is the core of the ACK/NAK protocol described in this chapter.

Definition: As used in this chapter, the term Transmitter refers to the device that sends TLPs.

Definition: As used in this chapter, the term Receiver refers to the device that receives TLPs. PCI Express System Architecture

Figure 5-2: Overview of the ACK/NAK Protocol

Transmitter Elements of the ACK/NAK Protocol

Figure 5-4 on page 215 illustrates the transmitter Data Link Layer elements associated with processing of outbound TLPs and inbound ACK/NAK DLLPs.

Replay Buffer

The replay buffer stores TLPs with all fields including the Data Link Layer-related Sequence Number and LCRC fields. The TLPs are saved in the order of arrival from the Transaction Layer before transmission. Each TLP in the Replay Buffer contains a Sequence Number which is incrementally greater than the sequence number of the previous TLP in the buffer.

When the transmitter receives acknowledgement via an ACK DLLP that TLPs have reached the receiver successfully, it purges the associated TLPs from the Replay Buffer. If, on the other hand, the transmitter receives a NAK DLLP, it replays (i.e., re-transmits) the contents of the buffer.

NEXT_TRANSMIT_SEQ Counter

This counter generates the Sequence Number assigned to each new transmitted TLP. The counter is a 12-bit counter that is initialized to 0 at reset, or when the Data Link Layer is in the inactive state. It increments until it reaches 4095 and then rolls over to 0 (i.e., it is a modulo 4096 counter).

LCRC Generator

The LCRC Generator provides a 32-bit LCRC for the TLP. The LCRC is calculated using all fields of the TLP including the Header, Data Payload, ECRC and Sequence Number. The receiver uses the TLP's LCRC field to check for a CRC error in the received TLP.

REPLAY_NUM Count

This 2-bit counter stores the number of replay attempts following either reception of a NAK DLLP, or a REPLAY_TIMER time-out. When the REPLAY_NUM count rolls over from

11 b

00 b

,the Data Link Layer triggers a Physical Layer Link-retrain (see the description of the LTSSM recovery state on page 532). It waits for completion of re-training before attempting to transmit TLPs once again. The REPLAY_NUM counter is initialized to

00 b

at reset,or when the Data Link Layer is inactive. It is also reset whenever an ACK is received, indicating that forward progress is being made in transmitting TLPs.

REPLAY_TIMER Count

The REPLAY_TIMER is used to measure the time from when a TLP is transmitted until an associated ACK or NAK DLLP is received. The REPLAY_TIMER is started (or restarted, if already running) when the last Symbol of any TLP is sent. It restarts from 0 each time that there are outstanding TLPs in the Replay Buffer and an ACK DLLP is received that references a TLP still in the Replay Buffer. It resets to 0 and holds when there are no outstanding TLPs in the Replay Buffer, or until restart conditions are met for each NAK received (except during a replay), or when the REPLAY_TIMER expires. It is not advanced (i.e., its value remains fixed) during Link re-training.

ACKD_SEQ Count

This 12-bit register tracks or stores the Sequence Number of the most recently received ACK or NAK DLLP. It is initialized to all 1s at reset, or when the Data Link Layer is inactive. This register is updated with the AckNak_Seq_Num [11:0] field of a received ACK or NAK DLLP. The ACKD_SEQ count is compared with the NEXT_TRANSMIT_SEQ count.

IF (NEXT_TRANSMIT_SEQ - ACKD_SEQ) mod 4096 ≥ 2048 THEN

New TLPs from Transaction Layer are not accepted by Data Link Layer until this equation is no longer true. In addition, a Data Link Layer protocol error which is a fatal uncorrectable error is reported. This error condition occurs if there is a separation greater than 2047 between NEXT_TRANSMIT_SEQ and ACKD_SEQ. i.e, a separation greater than 2047 between the sequence number of a TLP being transmitted and that of a TLP in the replay buffer that receives an ACK or NAK DLLP.

Also, the ACKD_SEQ count is used to check for forward progress made in transmitting TLPs. If no forward progress is made after 3 additional replay attempts, the Link in re-trained.

DLLP CRC Check

This block checks for CRC errors in DLLPs returned from the receiver. Good DLLPs are further processed. If a DLLP CRC error is detected, the DLLP is discarded and an error reported. No further action is taken.

Definition: The Data Link Layer is in the inactive state when the Physical Layer reports that the Link is non-operational or nothing is connected to the Port. The Physical Layer is in the non-operational state when the Link Training and Status State Machine (LTSSM) is in the Detect, Polling, Configuration, Disabled, Reset or Loopback states during which LinkUp

= 0

(see Chapter 14 on ’Link Initialization and Training'). While in the inactive state, the Data Link Layer state machines are initialized to their default values and the Replay Buffer is cleared. The Data Link Layer exits the inactive state when the Physical Layer reports LinkUp

= 1

and the Link Disable bit of the Link Control register

= 0

Receiver Elements of the ACK/NAK Protocol

Figure 5-5 on page 218 illustrates the receiver Data Link Layer elements associated with processing of inbound TLPs and outbound ACK/NAK DLLPs.

Receive Buffer

The receive buffer temporarily stores received TLPs while TLP CRC and Sequence Number checks are performed. If there are no errors, the TLP is processed and transferred to the receiver's Transaction Layer. If there are errors associated with the TLP, it is discarded and a NAK DLLP may be scheduled (more on this later in this chapter). If the TLP is a duplicate TLP (more on this later in this chapter), it is discarded and an ACK DLLP is scheduled. If the TLP is a 'nullified' TLP, it is discarded and no further action is taken (see "Switch Cut-Through Mode" on page 248).

LCRC Error Check

This block checks for LCRC errors in the received TLP using the TLP's 32-bit LCRC field.

NEXT_RCV_SEQ Count

The 12-bit NEXT_RCV_SEQ counter keeps track of the next expected TLP's Sequence Number. This counter is initialized to 0 at reset, or when the Data Link Layer is inactive. This counter is incremented once for each good TLP received that is forwarded to the Transaction Layer. The counter rolls over to 0 after reaching a value of 4095 . The counter is not incremented for TLPs received with CRC error, nullified TLPs, or TLPs with an incorrect Sequence Number.

Sequence Number Check

After the CRC error check, this block verifies that a received TLP's Sequence Number matches the NEXT_RCV_SEQ count.

If the TLP's Sequence Number $=$ NEXT_RCV_SEQ count,the TLP is accepted, processed and forwarded to the Transaction Layer. NEXT_RCV_SEQ count is incremented. The receiver continues to process inbound TLPs and does not have to return an ACK DLLP until the ACKNAK_LATENCY_TIMER expires or exceeds its set value.

If the TLP's Sequence Number is an earlier Sequence Number than NEXT_RCV_SEQ count and with a separation of no more than 2048 from NEXT_RCV_SEQ count, the TLP is a duplicate TLP. It is discarded and an ACK DLLP is scheduled for return to the transmitter.

If the TLP's Sequence Number is a later Sequence Number than NEXT_RCV_SEQ count, or for any other case other than the above two conditions, the TLP is discarded and a NAK DLLP may be scheduled (more on this later) for return to the transmitter.

NAK_SCHEDULED Flag

The NAK_SCHEDULED flag is set when the receiver schedules a NAK DLLP to return to the remote transmitter. It is cleared when the receiver sees the first TLP associated with the replay of a previously-Nak'd TLP. The specification is unclear about whether the receiver should schedule additional NAK DLLP for bad TLPs received while the NAK_SCHEDULED flag is set. It is the authors' interpretation that the receiver must not schedule the return of additional NAK DLLPs for subsequently received TLPs while the NAK_SCHEDULED flag remains set.

ACKNAK_LATENCY_TIMER

The ACKNAK_LATENCY_TIMER monitors the elapsed time since the last ACK or NAK DLLP was scheduled to be returned to the remote transmitter. The receiver uses this timer to ensure that it processes TLPs promptly and returns an ACK or a NAK DLLP when the timer expires or exceeds its set value. The timer value is set based on a formula described in "Receivers ACKNAK_LATENCY_TIMER" on page 237.

ACK/NAK DLLP Generator

This block generates the ACK or NAK DLLP upon command from the LCRC or Sequence Number check block. The ACK or NAK DLLP contains an AckNak_Seq_Num[11:0] field obtained from the NEXT_RCV_SEQ counter. ACK or NAK DLLPs contain a AckNak_Seq_Num[11:0] value equal to NEXT_RCV_SEQ count - 1.

ACK/NAK DLLP Format

The format of an ACK or NAK DLLP is illustrated in Figure 5-6 on page 219. Table 5-6 describes the ACK or NAK DLLP Fields.

Table 5-1: Ack or Nak DLLP Fields

Field Name	Header Byte/Bit	DLLP Function
AckNak_Seq_Num [11:0]	Byte 3 Bit 7:0 Byte 2 Bit 3:0	For an ACK DLLP: - For good TLPs received with Sequence Number $=$ NEXT_RCV_SEQ count (count before incrementing), use NEXT_RCV_SEQ count - 1 (count after incrementing minus 1). - For TLP received with Sequence Num- ber earlier than NEXT_RCV_SEQ count (duplicate TLP), use NEXT_RCV_SEQ count - 1. For a NAK DLLP: - Associated with a TLP that fails the CRC check, use NEXT_RCV_SEQ count - 1. - For a TLP received with Sequence Num- ber later than NEXT_RCV_SEQ count, use NEXT_RCV_SEQ count - 1. Upon receipt, the transmitter will purge TLPs with equal to and earlier Sequence Numbers and replay the remainder TLPs.

Table 5-1: Ack or Nak DLLP Fields (Continued)

Field Name	Header Byte/Bit	DLLP Function
Type 7:0	Byte 0 Bit 7:0	Indicates the type of DLLP. For the Ack/ Nak DLLPs: - 00000000b = ACK DLLP. - 00010000b = NAK DLLP.
16-bit CRC	Byte 5 Bit 7:0 Byte 4 Bit 7:0	16-bit CRC used to protect the contents of this DLLP. Calculation is made on Bytes 0-3 of the ACK/NAK.

ACK/NAK Protocol Details

This section describes the detailed transmitter and receiver behavior in processing TLPs and ACK/NAK DLLPs. The examples demonstrate flow of TLPs from transmitter to the remote receiver in both the normal non-error case, as well as the error cases.

Transmitter Protocol Details

This section delves deeper into the ACK/NAK protocol. Consider the transmit side of a device's Data Link Layer shown in Figure 5-4 on page 215.

Sequence Number

Before a transmitter sends TLPs delivered by the Transaction Layer, the Data Link Layer appends a 12-bit Sequence Numbers to each TLP. The Sequence Number is generated by the 12-bit NEXT_TRANSMIT_SEQ counter. The counter is initialized to 0 at reset, or when the Data Link Layer is in the inactive state. It increments after each new TLP is transmitted until it reaches its maximum value of 4095 , and then rolls over to 0 . For each new TLP sent, the transmitter appends the Sequence Number from the NEXT_TRANSMIT_SEQ counter.

Keep in mind that an incremented Sequence Number does not necessarily mean a greater Sequence Number (since the counter rolls over when after it reaches a maximum value of 4095).

32-Bit LCRC

The transmitter also appends a 32-bit LCRC (Link CRC) calculated based on TLP contents which include the Header, Data Payload, ECRC and Sequence Number.

Replay (Retry) Buffer

General. Before a device transmits a TLP, it stores a copy of the TLP in a buffer associated with the Data Link Layer referred to as the Replay Buffer (the specification uses the term Retry Buffer). Each buffer entry stores a complete TLP with all of its fields including the Header (up to 16 bytes), an optional Data Payload (up to

4 KB

),an optional ECRC (up to four bytes),the Sequence Number (12-bits wide, but occupies two bytes) and the LCRC field (four bytes). The buffer size is unspecified. The buffer should be big enough to store transmitted TLPs that have not yet been acknowledged via ACK DLLPs.

When the transmitter receives an ACK DLLP, it purges from the Replay Buffer TLPs with equal to or earlier Sequence Numbers than the Sequence Number received with the ACK DLLPs.

When the transmitter receives NAK DLLPs, it purges the Replay Buffer of TLPs with Sequence Numbers that are equal to or earlier than the Sequence Number that arrives with the NAK and replays (re-transmits) TLPs of later Sequence Numbers (the remainder TLPs in the Replay Buffer). This implies that a NAK DLLP inherently acknowledges TLPs with equal to or earlier Sequence Numbers than the AckNak_Seq_Num[11:0] of the NAK DLLP and replays the remainder TLPs in the Replay Buffer. Efficient replay strategies are discussed later.

Replay Buffer Sizing. The Replay Buffer should be large enough so that, under normal operating conditions, TLP transmissions are not throttled due to a Replay Buffer full condition. To determine what buffer size to implement, one must consider the following:

ACK DLLP delivery Latency from the receiver.

Delays cause by the physical Link interconnect and the Physical Layer implementations.

Receiver L0s exit latency to L0. i.e., the buffer should ideally be big enough to hold TLPs while the Link which is in L0s is returned to L0.

PCI Express System Architecture

Transmitter's Response to an ACK DLLP

General. If the transmitter receives an ACK DLLP, it has positive confirmation that its transmitted TLP(s) have reached the receiver successfully. The transmitter associates the Sequence Number contained in the ACK DLLP with TLP entries contained in the Replay Buffer.

A single ACK DLLP returned by the receiver Device B may be used to acknowledge multiple TLPs. It is not necessary that every TLP transmitted must have a corresponding ACK DLLP returned by the remote receiver. This is done to conserve bandwidth by reducing the ACK DLLP traffic on the bus. The receiver gathers multiple TLPs and then collectively acknowledges them with one ACK DLLP that corresponds to the last received good TLP. In InfiniBand, this is referred to as ACK coalescing.

The transmitter's response to reception of an ACK DLLP include:

Load ACKD_SEQ register with AckNak_Seq_Num[11:0] of the ACK DLLP.

Reset the REPLAY_NUM counter and REPLAY_TIMER to 0.

Purge the Replay Buffer as described below.

Purging the Replay Buffer. An ACK DLLP of a given Sequence Number (contained in the AckNak_Seq_Num[11:0] field) acknowledges the receipt of a TLP with that Sequence Number in the transmitter Replay Buffer, PLUS all TLPs with earlier Sequence Numbers. In other words, an ACK DLLP with a given Sequence Number not only acknowledges a specific TLP in the Replay Buffer (the one with that Sequence Number), but it also acknowledges TLPs of earlier (logically lower) Sequence Numbers. The transmitter purges the Replay Buffer of all TLPs acknowledged by the ACK DLLP.

Examples of Transmitter ACK DLLP Processing

Example 1. Consider Figure 5-7 on page 223, with the emphasis on the transmitter Device A.

Device A transmits TLPs with Sequence Numbers 3, 4, 5, 6, 7 where TLP 3 is the first TLP sent and TLP 7 is the last TLP sent.

Device B receives TLPs with Sequence Numbers 3, 4, 5 in that order. TLP 6, 7 are still en route.

Device B performs the error checks and collectively acknowledges good receipt of TLPs 3, 4, 5 with the return of an ACK DLLP with a Sequence Number of 5.

Device A receives ACK 5.

Device A purges TLP 3, 4, 5 from the Replay Buffer.

When Device B receives TLP 6,7, steps 3 through 5 may be repeated for those packets as well.

Figure 5-7: Example 1 that Shows Transmitter Behavior with Receipt of an ACK DLLP

Example 2. Consider Figure 5-8, with the emphasis on the transmitter Device A.

Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, 2 where TLP 4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs with Sequence Numbers 4094, 4095, 0, 1 in that order. TLP 2 is still en route.

Device B performs the error checks and collectively acknowledges good receipt of TLPs 4094, 4095, 0, 1 with the return of an ACK DLLP with a Sequence Number of 1.

Device A receives ACK 1.

Device A purges TLP 4094, 4095, 0, 1 from the Replay Buffer.

When Device B ultimately receives TLP 2, steps 3 through 5 may be repeated for TLP 2.

PCI Express System Architecture

Figure 5-8: Example 2 that Shows Transmitter Behavior with Receipt of an ACK DLLP

Transmitter's Response to a NAK DLLP

A NAK DLLP received by the transmitter implies that a TLP transmitted at an earlier time was received by the receiver in error. The transmitter first purges from the Replay Buffer any TLP with Sequence Numbers equal to or earlier than the NAK DLLP's AckNak_Seq_Num[11:0]. It then replays (retries) the remainder TLPs starting with the TLP with Sequence Number immediately after the AckNak_Seq_Num[11:0] of the NAK DLLP until the newest TLP. In addition, the transmitter's response to reception of a NAK DLLP include:

Reset REPLAY_NUM and REPLAY_TIMER to 0 only if the NAK DLLP's AckNak_Seq_Num[11:0] is later than the current ACKD_SEQ value (forward progress is made in transmitting TLPs).

Load ACKD_SEQ register with AckNak_Seq_Num[11:0] of the NAK DLLP.

TLP Replay

When a Replay becomes necessary, the transmitter blocks the delivery of new TLPs by the Transaction Layer. It then replays (re-sends or retries) the contents of the Replay Buffer starting with the earliest TLP first (of Sequence Number

=

AckNak_Seq_Num[11:0] + 1) until the remainder of the Replay Buffer is replayed. After the replay event, the Data Link Layer unblocks acceptance of new TLPs from the Transaction Layer. The transmitter continues to save the TLPs just replayed until they are finally acknowledged at a later time.

Efficient TLP Replay

ACK DLLPs or NAK DLLPs received during replay must be processed. This means that the transmitter must process the DLLPs and, at the very least, store them until the replay is finished. After replay is complete, the transmitter evaluates the ACK or NAK DLLPs and performs the appropriate processing.

A more efficient design might begin processing the ACK/NAK DLLPs while the transmitter is still in the act of replaying. By doing so, newly received ACK DLLPs are used to purge the Replay Buffer even while replay is in progress. If another NAK DLLP is received in the meantime, at the very least, the TLPs that were acknowledged have been purged and would not be replayed.

During replay, if multiple ACK DLLPs are received, the ACK DLLP received last with the latest Sequence Number can collapse earlier ACK DLLPs of earlier Sequence Numbers. During the replay, the transmitter can concurrently purge TLPs of Sequence Number equal to and earlier than the AckNak_Seq_Num[11:0] of the last received ACK DLLP.

Example of Transmitter NAK DLLP Processing

Consider Figure 5-9 on page 226, with focus on transmitter Device A.

Device A transmits TLPs with Sequence Number 4094, 4095, 0, 1, and 2, where TLP 4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs 4094, 4095, and 0 in that order. TLP 1, 2 are still en route.

Device B receives TLP 4094 with no error and hence NEXT_RCV_SEQ count increments to 4095

Device B receives TLP 4095 with a CRC error.

Device B schedules the return of a NAK DLLP with Sequence Number 4094 (NEXT_RCV_SEQ count - 1).

PCI Express System Architecture

Device A receives NAK 4094 and blocks acceptance of new TLPs from its Transaction Layer until replay completes.

Device A first purges TLP 4094 (and earlier TLPs; none in this example).

Device A then replays TLPs 4095, 0, 1, and 2, but does not purge them.

Figure 5-9: Example that Shows Transmitter Behavior on Receipt of a NAK DLLP

Repeated Replay of TLPs. Each time the transmitter receives a NAK DLLP, it replays the Replay Buffer contents. The transmitter uses a 2-bit Replay Number counter, referred to as the REPLAY_NUM counter, to keep track of the number of replay events. Reception of a NAK DLLP increments REPLAY_NUM. This counter is initialized to 0 at reset, or when the Data Link Layer is inactive. It is also reset if an ACK or NAK DLLP is received with a later Sequence Number than that contained in the ACKD_SEQ register. As long as forward progress is made in transmitting TLPs the REPLAY_NUM counter resets. When a fourth NAK is received, indicating no forward progress has been made after several tries, the counter rolls over to zero. The transmitter will not replay the TLPs a fourth time but instead it signals a replay number rollover error. The device assumes that the Link is non-functional or that there is a Physical Layer problem at either the transmitter or receiver end.

What Happens After the Replay Number Rollover? A transmitter's Data Link Layer triggers the Physical Layer to re-train the Link. The Physical Layer Link Training and Status State Machine (LTSSM) enters the Recovery State (see "Recovery State" on page 532). The Replay Number Rollover error bit is set ("Advanced Correctable Error Handling" on page 384) in the Advanced Error Reporting registers (if implemented). The Replay Buffer contents are preserved and the Data Link Layer is not initialized by the retraining process. Upon Physical Layer re-training exit, assuming that the problem has been cleared, the transmitter resumes the same replay process again. Hopefully, the TLPs can be re-sent successfully on this attempt.

The specification does not address a device's handling of repeated re-train attempts. The author recommends that a device track the number of re-train attempts. After a re-train rollover the device could signal a Data Link Layer protocol error indicating the severity as an Uncorrectable Fatal Error.

Transmitter's Replay Timer

The transmitter implements a REPLAY_TIMER to measure the time from when a TLP is transmitted until the transmitter receives an associated ACK or NAK DLLP from the remote receiver. A formula (described below) determines the timer's expiration period. Timer expiration triggers a replay event and the REPLAY_NUM count increments. A time-out may arise if an ACK or NAK DLLP is lost en route, or because of an error in the receiver that prevents it from returning an ACK or NAK DLLP. Timer-related rules are:

The Timer starts (if not already started) when the last symbol of any TLP is transmitted.

The Timer is reset to 0 and restarted when:

A Replay event occurs and the last symbol of the first TLP is replayed. - For each ACK DLLP received, as long as there are unacknowledged TLPs in the Replay Buffer,

The Timer is reset and held when:

There are no TLPs to transmit, or when the Replay Buffer is empty.

A NAK DLLP is received. The timer restarts when replay begins.

When the timer expires.

The Data Link Layer is inactive.

Timer is Held during Link training or re-training.

REPLAY_TIMER Equation. The timer is loaded with a value that reflects the worst-case latency for the return of an ACK or NAK DLLP. This time depends on the maximum data payload allowed for a TLP and the width of the Link.

The equation to calculate the REPLAY_TIMER value required is:

(\frac{(M a x_P a y l o a d_S i z e + T L P O v e r h e a d) * A c k F a c t o r}{L i n k W i d t h} + I n t e r n a l D e l a y) * 3 + R x_L 0 s_A d j u s t m e n t

The value in the timer represents a symbol time (4ns).

The equation fields are defined as follows:

Max_Payload_Size is the value in the Max_Payload_Size field of the Device Control Register ("Device Capabilities Register" on page 900).

TLP Overhead includes the additional TLP fields beyond the data payload (header, digest, LCRC, and Start/End framing symbols). In the specification, the overhead value is treated as a constant of 28 symbols.

The Ack Factor is a fudge factor that represents the number of maximum-sized TLPs (based on Max_Payload) that can be received before an ACK DLLP must be sent. The AF value ranges from 1.0 to 3.0 and is used to balance Link bandwidth efficiency and Replay Buffer size. Figure 5-10 on page 229 summarizes the Ack Factor values for various Link widths and payloads. These Ack Factor values are chosen to allow implementations to achieve good performance without requiring a large uneconomical buffer.

Link Width ranges from 1-bit wide to 32-bits wide.

Internal Delay is the receiver's internal delay between receiving a TLP, processing it at the Data Link Layer, and returning an ACK or NAK DLLP. It is treated as a constant of 19 symbol times in these calculations.

Rx_L0s_Adjustment is the time required by the receive circuits to exit from L0s to L0, expressed in symbol times.

REPLAY_TIMER Summary Table. Figure 5-10 on page 229 is a summary table that shows possible timer load values with various variables plugged into the REPLAY_TIMER equation.

Figure 5-10: Table and Equation to Calculate REPLAY_TIMER Load Value

Max_Payload Size	X1 Link	X2 Link	X4 Link	X8 Link	X12 Link	x16 Link	X32 Link
128 Bytes	711	384	219	201	174	144	99
256 Bytes	1248	651	354	321	270	216	135
512 Bytes	1677	867	462	258	327	258	156
1024 Bytes	3213	1635	846	450	582	450	252
2048 Bytes	6285	3171	1614	834	1095	834	444
4096 Bytes	12,429	6243	3150	1602	2118	1602	828

The table summarizes values calculated using the equation, minus the Rx_L0s_Adjustment term

Example: Assume a 2-lane link with a Max_Payload of 2048 bytes.

[\frac{(2048 + 28) * 1.0 + 19}{2}] * 3 = \underset{―}{3171}

(about a 12.7uS timeout period)

Transmitter DLLP Handling

The DLLP CRC Error Checking block determines whether there is a CRC error in the received DLLP. The DLLP includes a 16-bit CRC for this purpose (see Table 5-1 on page 219). If there are no DLLP CRC errors, then the DLLPs are further processed. If a DLLP CRC error is detected, the DLLP is discarded, and the error is reported as a DLLP CRC error to the error handling logic which logs the error in the optional Advanced Error Reporting registers (see Bad DLLP in "Advanced Correctable Error Handling" on page 384). No further action is taken.

Discarding an ACK or NAK DLLP received in error is not a severe response because a subsequently received DLLP will accomplish the same goal as the discarded DLLP. The side effect of this action is that associated TLPs are purged a little later than they would have been or that a replay happens at a later time. If a subsequent DLLP is not received in time, the transmitter REPLAY_TIMER expires anyway, and the TLPs are replayed.

Receiver Protocol Details

Consider the receive side of a device's Data Link Layer shown in Figure 5-5 on page 218.

TLP Received at Physical Layer

TLPs received at the Physical Layer are checked for STP and END framing errors as well as other receiver errors such as disparity errors. If there are no errors, the TLPs are passed to the Data Link Layer. If there are any errors, the TLP is discarded and the allocated storage is freed up. The Data Link Layer is informed of this error so that it can schedule a NAK DLLP. (see "Receiver Schedules a NAK" on page 233).

Received TLP Error Check

The receiver accepts TLPs from the Link into a receiver buffer and checks for CRC errors. The receiver calculates an expected LCRC value based on the received TLP (excluding the LCRC field) and compares this value with the TLP's 32-bit LCRC. If the two match, the TLP is good. If the two LCRC values do not match, the received TLP is bad and the receiver schedules a NAK DLLP to be returned to the remote transmitter. The receiver also checks for other types of non-CRC related errors (such as that described in the next section).

Next Received TLP's Sequence Number

The receiver keeps track of the next expected TLP's Sequence Number via a 12- bit counter referred to as the NEXT_RCV_SEQ counter. This counter is initialized to 0 at reset, or when the Data Link Layer is inactive. This counter is incremented once for each good TLP that is received and forwarded to the Transaction Layer. The counter rolls over to 0 after reaching a value of 4095 .

The receiver uses the NEXT_RCV_SEQ counter to identify the Sequence Number that should be in the next received TLP. If a received TLP has no LCRC error, the device compares its Sequence Number with the NEXT_RCV_SEQ count. Under normal operational conditions, these two numbers should match. If this is the case, the receiver accepts the TLP, forwards the TLP to the Transaction Layer, increments the NEXT_RCV_SEQ counter and is ready for the next TLP. An ACK DLLP may be scheduled for return if the ACKNAK_LATENCY_ TIMER expires or exceeds its set value. The receiver is ready to perform a comparison on the next received TLP's Sequence Number.

In some cases, a received TLP's Sequence Number may not match the NEXT_RCV_SEQ count. The received TLP's Sequence Number may be either logically greater than or logically less than NEXT_RCV_SEQ count (a logical number in this case accounts for the count rollover, so in fact a logically greater number may actually be a lower number if the count rolls over). See "Receiver Sequence Number Check" on page 234 for details on these two abnormal conditions.

For a TLP received with a CRC error, or a nullified TLP or a TLP for which the Sequence Number check described above fails, the NEXT_RCV_SEQ counter is not incremented.

Receiver Schedules An ACK DLLP

If the receiver does not detect an LCRC error (see "Received TLP Error Check" on page 230) or a Sequence Number related error (see "Next Received TLP's Sequence Number" on page 230) associated with a received TLP, it accepts the TLP and sends it to the Transaction Layer. The NEXT_RCV_SEQ counter is incremented and the receiver is ready for the next TLP. At this point, the receiver can schedule an ACK DLLP with the Sequence Number of the received TLP (see the AckNak_Seq_Num[11:0] field described in Table 5-1 on page 219). Alternatively, the receiver could also wait for additional TLPs and schedule an ACK DLLP with the Sequence Number of the last good TLP received.

The receiver is allowed to accumulate a number of good TLPs and then sends one aggregate ACK DLLP with a Sequence Number of the latest good TLP received. The coalesced ACK DLLP acknowledges the good receipt of a collection of TLPs starting with the oldest TLP in the transmitter's Replay Buffer and ending with the TLP being acknowledged by the current ACK DLLP. By doing so, the receiver optimizes the use of Link bandwidth due to reduced ACK DLLP traffic. The frequency with which ACK DLLPs are scheduled for return is described in "Receivers ACKNAK_LATENCY_TIMER" on page 237. When the ACKNAK_LATENCY_ TIMER expires or exceeds its set value and TLPs are received, an ACK DLLP with a Sequence Number of the last good TLP is returned to the transmitter. PCI Express System Architecture

When the receiver schedules an ACK DLLP to be returned to the remote transmitter, the receiver might have other packets (TLPs, DLLPs or PLPs) enqueued that also have to be transmitted on the Link in the same direction as the ACK DLLP. This implies that the receiver may not immediately return the ACK DLLP to the transmitter, especially if a large TLP (with up to a 4KB data payload) is already being transmitted (see "Recommended Priority To Schedule Packets" on page 244).

The receiver continues to receive TLPs and as long as there are no detected errors (LCRC or Sequence Number errors), it forwards the TLPs to the Transaction Layer. When the receiver has the opportunity to return the ACK DLLP to the remote transmitter, it appends the Sequence Number of the latest good TLP received and returns the ACK DLLP. Upon receipt of the ACK DLLP, the remote transmitter purges its Replay Buffer of the TLPs with matching Sequence Numbers and all TLPs transmitted earlier than the acknowledged TLP.

Example of Receiver ACK Scheduling

Example: Consider Figure 5-11 on page 233, with focus on the receiver Device B.

Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, and 2, where TLP 4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs with Sequence Numbers 4094, 4095, 0, and 1, in that order. NEXT_RCV_SEQ count increments to 2. TLP 2 is still en route.

Device B performs error checks and issues a coalesced ACK to collectively acknowledge receipt of TLPs 4094, 4095, 0, and 1, with the return of an ACK DLLP with Sequence Number of 1.

Device B forwards TLPs 4094, 4095, 0, and 1 to its Transaction Layer.

When Device B ultimately receives TLP 2, steps 3 and 4 may be repeated for TLP 2.

Figure 5-11: Example that Shows Receiver Behavior with Receipt of Good TLP

NAK Scheduled Flag

The receiver implements a Flag bit referred to as the NAK_SCHEDULED flag. When a receiver detects a TLP CRC error, or any other non-CRC related error that requires it to schedule a NAK DLLP to be returned, the receiver sets the NAK_SCHEDULED flag and clears it when the receiver detects replayed TLPs from the transmitter for which there are no CRC errors.

Receiver Schedules a NAK

Upon receipt of a TLP, the first type of error condition the receiver may detect is a TLP LCRC error (see "Received TLP Error Check" on page 230). The receiver discards the bad TLP. If the NAK_SCHEDULED flag is clear, it schedules a NAK DLLP to return to the transmitter. The NAK_SCHEDULED flag is then set. The receiver uses the NEXT_RCV_SEQ count - 1 count value as the AckNak_Seq_Num [11:0] field in the NAK DLLP (Table 5-1 on page 219). At the time the receiver schedules a NAK DLLP to return to the transmitter, the Link may be in use to transmit other queued TLPs, DLLPs or PLPs. In that case, the PCI Express System Architecture

receiver delays the transmission of the NAK DLLP (see "Recommended Priority To Schedule Packets" on page 244). When the Link becomes available, however, it sends the NAK DLLP to the remote transmitter. The transmitter replays the TLPs from the Replay Buffer (see "TLP Replay" on page 225).

In the meantime, TLPs currently en route continue to arrive at the receiver. These TLPs have later Sequence Numbers than the NEXT_RCV_SEQ count. The receiver discards them. The specification is unclear about whether the receiver should schedule a NAK DLLP for these TLPs. It is the authors' interpretation that the receiver must not schedule the return of additional NAK DLLPs for subsequently received TLPs while the NAK_SCHEDULED flag remains set.

The receiver detects a replayed TLP when it receives a TLP with Sequence Numbers that matches NEXT_RCV_SEQ count. If the replayed TLPs arrive with no errors, the receiver increments NEXT_RCV_SEQ count and clears the NAK_SCHEDULED flag. The receiver may schedule an ACK DLLP for return to the transmitter if the ACKNAK_LATENCY_TIMER expires. The good replayed TLPs are forwarded to the Transaction Layer.

There is a second scenario under which the receiver schedules NAK DLLPs to return to the transmitter. If the receiver detects a TLP with a later Sequence Number than the next expected Sequence Number indicated by NEXT_RCV_SEQ count or for which the TLP has a Sequence Number that is separated from NEXT_RCV_SEQ count by more than 2048, the above described procedure is repeated. See "Receiver Sequence Number Check" below for the reasons why this could happen.

The two error conditions just described wherein a NAK DLLP is scheduled for return are reported as errors associated with the Data Link Layer. The error reported is a bad TLP error with a severity of correctable.

Receiver Sequence Number Check

Every received TLP that passes the CRC check goes through a Sequence Number check. The received TLPs Sequence Number is compared with the NEXT_RCV_SEQ count. Below are three possibilities:

TLP Sequence Number equal NEXT_RCV_SEQ count. This situation results when a good TLP is received. It also occurs when a replayed TLP is received. The TLP is accepted and forwarded to the Transaction Layer. NEXT_RCV_SEQ count is incremented and an ACK DLLP may be scheduled (according to the ACK DLLP scheduling rules described in "Receiver Schedules An ACK DLLP" on page 231).

TLP Sequence Number is logically less than NEXT_RCV_SEQ count (earlier Sequence Number). This situation results when a duplicate TLP is received as the result of a replay event. The duplicate TLP is discarded. NEXT_RCV_SEQ count is not incremented. An ACK DLLP is scheduled so that the transmitter can purge its Replay Buffer of the duplicate TLP(s). The receiver uses the NEXT_RCV_SEQ count - 1 in the ACK DLLP's AckNak_Seq_Num[11:0] field. What scenario results in a duplicate TLP being received? Consider this example. A receiver accepts a TLP and returns an associated ACK DLLP and increments the NEXT_RCV_SEQ count. The ACK DLLP is lost en route to the transmitter. As a result, this TLP remains in the remote transmitter's Replay Buffer. The transmitter's REPLAY_TIMER expires when no further ACK DLLPs are received. This causes the transmitter to replay the entire contents of the Replay Buffer. The receiver sees these TLPs with earlier Sequence Numbers than the NEXT_RCV_SEQ count and discards them because they are duplicate TLPs. More precisely, a TLP is a duplicate TLP if:

(NEXT_RCV_SEQ - TLP Sequence Number) mod

4096 <= 2048

. An ACK DLLP is returned for every duplicate TLP received.

TLP Sequence Number is logically greater than NEXT_RCV_SEQ count (later Sequence Number). This situation results when one or more TLPs are lost en route. The receiver schedules a NAK DLLP for return to the transmitter if NAK_SCHEDULED flag is clear (see NAK DLLP scheduling rules described in "Receiver Schedules a NAK" on page 233). NEXT_RCV_SEQ count does not increment when the receiver receives such TLPs of later Sequence Number.

Receiver Preserves TLP Ordering

In addition to guaranteeing reliable TLP transport, the ACK/NAK protocol preserves packet ordering. The receiver's Transaction Layer receives TLPs in the same order that the transmitter sent them.

A transmitter correctly orders TLPs according to the ordering rules before transmission in order to maintain correct program flow and to eliminate the occurrence of potential deadlock and livelock conditions (see Chapter 8, entitled "Transaction Ordering," on page 315). The Receiver is required to preserve TLP order (otherwise, application program flow is altered). To preserved this order, the receiver applies three rules:

When the receiver detects a bad TLP, it discards the TLP and all new TLPs that follow in the pipeline until the replayed TLPs are detected.

Also, duplicate TLPs are discarded.

TLPs received after one or more lost TLPs are received are discarded. PCI Express System Architecture

For TLPs that arrive after the first bad TLP, the motivation to discard these TLPs, not forward them to the Transaction Layer and schedule a NAK DLLP is as follows. When the receiver detects a bad TLP, it discards it and any new TLPs in the pipeline. The receiver then waits for TLP replay. After verifying that there are no errors in the replayed TLP(s), the receiver forwards them to the Transaction Layer and resumes acceptance of new TLPs in the pipeline. Doing so preserves TLP receive and acceptance order at the receivers Transaction Layer.

Example of Receiver NAK Scheduling

Example: Consider Figure 5-12 on page 237 with emphasis on the receiver Device B.

Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, and 2, where TLP 4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs 4094, 4095, and 0, in that order. TLPs 1 and 2 are still in flight.

Device B receives TLP 4094 with no errors and forwards it to the Transaction Layer. NEXT_RCV_SEQ count increments to 4095.

Device B detects an LCRC error in TLP 4095 and hence returns a NAK DLLP with a Sequence Number of 4094 (NEXT_RCV_SEQ count - 1). The NAK_SCHEDULED flag is set. NEXT_RCV_SEQ count does not increment.

Device B discards TLP 4095.

Device B also discards TLP 0, even though it is a good TLP. Also TLP 1 and 2 are discarded when they arrive.

Device B does not schedule a NAK DLLP for TLP 0, 1 and 2 because the NAK_SCHEDULED flag is set.

Device A receives NAK 4094.

Device A does not accept any new TLPs from its Transaction Layer.

Device A first purges TLP 4094.

Device A then replays TLPs 4095, 0, 1, and 2, but continues to save these TLPs in the Replay Buffer. It then accepts TLPs from the Transaction Layer.

Replayed TLPs 4095, 0, 1, and 2 arrive at Device B in that order.

After verifying that there are no CRC errors in the received TLPs, device B detects TLP 4095 as a replayed TLP because it has a Sequence Number equal to NEXT_RCV_SEQ count. NAK_SCHEDULED flag is cleared.

Device B forwards these TLPs to the Transaction Layer in this order: 4095, 0, 1, and 2.

Figure 5-12: Example that Shows Receiver Behavior When It Receives Bad TLPs

Receivers ACKNAK_LATENCY_TIMER

The ACKNAK_LATENCY_TIMER measures the duration since an ACK or NAK DLLP was scheduled for return to the remote transmitter. This timer has a value that is approximately

1 / 3

that of the transmitter REPLAY_TIMER. When the timer expires, the receiver schedules an ACK DLLP with a Sequence Number of the last good unacknowledged TLP received. The timer guarantees that the receiver schedules an ACK or NAK DLLP for a received TLP before the transmitter's REPLAY_TIMER expires causing it to replay.

The timer resets to 0 and restarts when an ACK or NAK DLLP is scheduled.

The timer resets to 0 and holds when:

All received TLPs have been acknowledged.

The Data Link Layer is in the inactive state.

ACKNAK_LATENCY_TIMER Equation. The receiver's ACKNAK_ LATENCY_TIMER is loaded with a value that reflects the worst-case transmission latency in sending an ACK or NAK in response to a received TLP. This time depends on the anticipated maximum payload size and the width of the Link.

The equation to calculate the ACKNAK_LATENCY_TIMER value required is:

\frac{(Max_Payload_Size + TLPOverhead) * AckFactor}{LinkWidth} + InternalDelay + Tx_L0s_Adjustment

The value in the timer represents symbol times (a symbol time

= 4 ns

The fields above are defined as follows:

Max_Payload_Size is the value in the Max_Payload_Size field of the Device Control Register (see page 900).

TLP Overhead includes the additional TLP fields beyond the data payload (header, digest, LCRC, and Start/End framing symbols). In the specification, the overhead value is treated as a constant of 28 symbols.

The Ack Factor is the biggest number of maximum-sized TLPs (based on Max_Payload) which can be received before an ACK DLLP is sent. The AF value (it's a fudge factor) ranges from 1.0 to 3.0, and is used to balance Link bandwidth efficiency and Replay Buffer size. Figure 5-10 on page 229 summarizes the Ack Factor values for various Link widths and payloads. These Ack Factor values are chosen to allow implementations to achieve good performance without requiring a large, uneconomical buffer.

Link Width ranges from 1-bit wide to 32-bits wide.

Internal Delay is the receiver's internal delay between receiving a TLP, processing it at the Data Link Layer, and returning an ACK or NAK DLLP. It is treated as a constant of 19 symbol times in these calculations.

Tx_L0s_Adjustment: If L0s is enabled, the time required for the transmitter to exit L0s, expressed in symbol times. Note that setting the Extended Sync bit of the Link Control register affects the exit time from L0s and must be taken into account in this adjustment.

It turns out that the entries in this table are approximately a third in value of the REPLAY_TIMER latency values in Figure 5-10 on page 229.

ACKNAK_LATENCY_TIMER Summary Table. Figure 5-13 on page 239 is a summary table that shows possible timer load values with various variables plugged into the ACKNAK_LATENCY_TIMER equation.

Figure 5-13: Table to Calculate ACKNAK_LATENCY_TIMER Load Value

Max_Payload Size	X1 Link	X2 Link	X4 Link	X8 Link	X12 Link	x16 Link	X32 Link
128 Bytes	237 (AF=1.4)	128 $(AF = 1.4)$	73 (AF=1.4)	67 (AF=2.5)	58 (AF=3.0)	48 (AF=3.0)	33 (AF=3.0)
256 Bytes	416 (AF=1.4)	217 (AF=1.4)	118 (AF=1.4)	107 (AF=2.5)	90 (AF=3.0)	72 (AF=3.0)	45 (AF=3.0)
512 Bytes	559 $(AF = 1.0)$	289 $(AF = 1.0)$	154 (AF=1.0)	86 (AF=1.0)	109 (AF=2.0)	86 (AF=2.0)	52 (AF=2.0)
1024 Bytes	1071 $(AF = 1.0)$	545 $(AF = 1.0)$	282 (AF=1.0)	150 (AF=1.0)	194 (AF=2.0)	150 (AF=2.0)	84 (AF=2.0)
2048 Bytes	2095 (AF=1.0)	1057 (AF=1.0)	538 (AF=1.0)	278 (AF=1.0)	365 (AF=2.0)	278 (AF=2.0)	148 (AF=2.0)
4096 Bytes	4143 (AF=1.0)	2081 $(AF = 1.0)$	1050 (AF=1.0)	534 (AF=1.0)	706 (AF=2.0)	534 (AF=2.0)	276 (AF=2.0)

Error Situations Reliably Handled by ACK/NAK Protocol

This section describes the possible sources of errors that may occur in delivery of TLPs from a transmitter to a receiver across a Link. The ACK/NAK protocol guarantees reliable delivery of TLPs despite the unlikely event that these errors occur. Below is a bullet list of errors and the related error correction mechanism the protocol uses to resolve the error:

Problem: CRC error occurs in transmission of a TLP (see "Transmitter's Response to a NAK DLLP" on page 224 and "Receiver Schedules a NAK" on page 233.)

Solution: Receiver detects LCRC error and schedules a NAK DLLP with Sequence Number

=

NEXT_RCV_SEQ count - 1. Transmitter replays TLPs.

Problem: One or more TLPs are lost en route to the receiver.

Solution: The receiver performs a sequence number check on all received TLPs. The receiver expects TLPs to arrive with each TLP that has an incremented 12-bit Sequence Number from that in the previous TLP. If one or more TLPs are lost en route, a TLP will have a Sequence Number issued later than expected Sequence Number reflected in the NEXT_RCV_SEQ count. The receiver schedules a NAK DLLP with a Sequence Number

=

NEXT_RCV_SEQ count - 1. Transmitter replays the Replay Buffer contents.

Problem: Receiver returns an ACK DLLP, but it is corrupted en route to the transmitter. The remote Transmitter detects a CRC error in the DLLP (DLLP is covered by 16-bit CRC, see "ACK/NAK DLLP Format" on page 219). In fact, the transmitter does not know that the malformed DLLP just received is supposed to be an ACK DLLP. All it knows is that the packet is a DLLP. Solution:

Case 1: The Transmitter discards the DLLP. A subsequent ACK DLLP received with a later Sequence Number causes the transmitter Replay Buffer to purge all TLPs with equal and earlier generated Sequence Numbers. The transmitter never knew that anything went wrong.

Case 2: The Transmitter discards the DLLP. A subsequent NAK DLLP received with a later generated Sequence Number causes the transmitter Replay Buffer to purge TLPs with equal to an earlier Sequence Numbers. The transmitter then replay all TLPs with later Sequence Numbers till the last TLP in the Replay Buffer. The transmitter never knew that anything went wrong.

Problem: ACK or NAK DLLP for received TLPs are not returned by the receiver by the proper ACKNAK_LATENCY_TIMER time-out. The associated TLPs remain in the transmitter Replay Buffer.

Solution: The REPLAY_TIMER times-out and the transmitter replays its Replay Buffer.

Problem: The Receiver returns a NAK DLLP but it is corrupted en route to the transmitter. The remote Transmitter detects a CRC error in the DLLP. In fact, the transmitter does not know that the DLLP received is supposed to be an NAK DLLP. All it knows is that the packet is a DLLP.

Solution: The Transmitter discards the DLLP. The receiver discards all subsequently received TLPs and awaits the replay. Given that the NAK was rejected by the transmitter, it's REPLAY_TIMER expires and triggers the replay.

Problem: Due to an error in the receiver, it is unable to schedule an ACK or NAK DLLP for a received TLP.

Solution: The transmitter REPLAY_TIMER will expire and result in TLP replay.

ACK/NAK Protocol Summary

Refer to Figure 5-3 on page 212 and the following subsections for a summary of the elements of the Data Link Layer.

Transmitter Side

Non-Error Case (ACK DLLP Management)

Unless blocked by the Data Link Layer, the Transaction Layer passes down the Header, Data, and Digest information for each TLP to be sent.

Each TLP is assigned a 12-bit Sequence Number using current NEXT_TRANSMIT_SEQ count.

A check is made to see if the acceptance of new TLPs from the Transaction Layer should be blocked. The transmitter performs a modulo 4096 subtraction of the ACKD_SEQ count from the NEXT_TRANSMIT_SEQ count to see if the result is $>= 2048 d$ . If it is,further TLPs are blocked until incoming ACK/NAK DLLPs render the equation untrue.

The NEXT_TRANSMIT_SEQ counter increments by one for each TLP processed. Note: if the transmitter wants to nullify a TLP being sent, it sends an inverted CRC to the physical layer and indicates an EDB end (End Bad Packet) symbol should be used (NEXT_TRANSMIT_SEQ is not incremented). See the "Switch Cut-Through Mode" on page 248 for details.

A 32-bit LCRC value is calculated for the TLP (the LCRC calculation includes the Sequence Number).

A copy of the TLP is placed in the Replay Buffer and the TLP is forwarded to the Physical Layer for transmission.

The Physical Layer adds STP and END framing symbols, then transmits the packet.

At a later time, assume the transmitter receives an ACK DLLP from the receiver. It performs a CRC error check and, if the check fails, discards the ACK DLLP (the same holds true if a bad NAK DLLP is received). If the check is OK, it purges the Replay buffer of TLPs from the oldest TLP up to and including the TLP with Sequence Number that matches the Sequence Number in the ACK DLLP. PCI Express System Architecture

Error Case (NAK DLLP Management)

Repeat the process described in the previous section, but this time, assume that the transmitter receives a NAK DLLP:

Upon receipt of the NAK DLLP with no CRC error, the transmitter follows the following sequence of steps in performing the Replay. NOTE: this is the same sequence of events which would occur if the REPLAY_TIMER expires instead.

The REPLAY_NUM is incremented. The maximum number of attempts to clear (ACK) all unacknowledged TLPs in the Replay Buffer is four.

If the REPLAY_NUM count rolls over from $11 b$ to $00 b$ ,the transmitter instructs the Physical Layer to re-train the Link.

If REPLAY_NUM does not roll over, proceed.

Block acceptance of new TLPs from the Transaction Layer.

Complete transmission of any TLPs in progress.

Purge any TLPs of equal or earlier Sequence Numbers than NAK DLLP's AckNak_Seq_Num[11:0].

Re-transmit TLPs with later Sequence Numbers than the NAK DLLP's

AckNak_Seq_Num[11:0].

ACK DLLPs or NAK DLLPs received during replay must be processed. The transmitter may disregard them until replay is complete or use them during replay to skip transmission of newly acknowledged TLPs. Earlier Sequence Numbers can be collapsed when an ACK DLLP is received with a later Sequence Number. Also, ACK DLLPs with later Sequence Numbers than a NAK DLLP received earlier supersede the earlier NAK DLLP.

When the replay is complete, unblock TLPs and return to normal operation.

Receiver Side

Non-Error Case

TLPs are received at the Physical Layer where they are checked for framing errors and other receiver-related errors. Assume that there are no errors. If the Physical Layer reports the end symbol was EDB and the CRC value was inverted, this is not an error condition; discard the packet and free any allocated space (see "Switch Cut-Through Mode" on page 248). There will be no ACK or NAK DLLP returned for this case.

The sequence of steps performed are as follows:

Calculate the CRC for the incoming TLP and check it against the LCRC provided with the packet. If the CRC passes, go to the next step.

Compare the Sequence Number for the inbound packet against the current value in the NEXT_RCV_SEQ count.

If they are the same, this is the next expected TLP. Forward the TLP to the Transaction Layer. Also increment the NEXT_RCV_SEQ count.

Clear the NAK_SCHEDULED flag if set.

If the ACKNAK_LATENCY_TIMER expires, schedule and ACK DLLP with AckNak_Seq_Num[11:0] = NEXT_RCV_SEQ count - 1.

Error Case

TLPs are received at the Physical Layer where they are checked for framing errors and other receiver-related errors. In the event of an error, the Physical Layer discards the packet, reports the error, and frees any storage allocated for the TLP. If the EDB is set and the CRC is not inverted, this is a bad packet: discard the TLP and set the error flag. If the NAK_SCHEDULED flag is clear, set it, and schedule a NAK DLLP with the NEXT_RCV_SEQ count - 1 value used as the Sequence Number.

If there are no Physical Layer errors detected, forward the TLP to the Data Link Layer.

Calculate the CRC for the incoming TLP and check it against the LCRC provided with the packet. If the CRC fails, set the NAK_SCHEDULED flag. Schedule a NAK DLLP with NEXT_RCV_SEQ count - 1 used as the Sequence Number. If LCRC error check passes, go to the next bullet.

If the LCRC check passes, then compare the Sequence Number for the inbound packet against the current value in the NEXT_RCV_SEQ count. If the TLP Sequence Number is not equal to NEXT_RCV_SEQ count and if (NEXT_RCV_SEQ - TLP Sequence Number) mod $4096 <= 2048$ ,the TLP is a duplicate TLP. Discard the TLP, and schedule an ACK with NEXT_RCV_SEQ count - 1 value used as AckNak_Seq_Num[11:0].

Discard TLPs received with Sequence Number other than the Sequence Number described by the above bullet. If the NAK_SCHEDULED flag is clear, set it, and schedule a NAK DLLP with NEXT_RCV_SEQ count - 1 used as AckNak_Seq_Num[11:0]. If the NAK _SCHEDULED flag bit is already set, keep it set and do not schedule a NAK DLLP.

Recommended Priority To Schedule Packets

A device may have many types of TLPs, DLLPs and PLPs to transmit on a given Link. The following is a recommended but not required set of priorities for scheduling packets:

Completion of any TLP or DLLP currently in progress (highest priority).

PLP transmissions.

NAK DLLP.

ACK DLLP.

FC (Flow Control) DLLP.

Replay Buffer re-transmissions.

TLPs that are waiting in the Transaction Layer.

All other DLLP transmissions (lowest priority)

Some More Examples

To demonstrate the reliable TLP delivery capability provided by the ACK/NAK Protocol, the following examples are provided.

Lost TLP

Consider Figure 5-14 on page 245 which shows the ACK/NAK protocol for handling lost TLPs.

Device A transmits TLPs 4094, 4095, 0, 1, and 2.

Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0 . These TLPs are forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of NEXT_RCV_SEQ count is 1. Device B is ready to receive TLP 1.

Seeing ACK 0, Device A purges TLPs 4094, 4095, and 0 from its replay buffer.

TLP 1 is lost en route.

TLP 2 arrives instead. Upon performing a Sequence Number check, Device B realizes that TLP 2's Sequence Number is greater than NEXT_RCV_SEQ count.

Device B discards TLP 2 and schedules NAK 0 (NEXT_RCV_SEQ count - 1).

Upon receipt of NAK 0, Device A replays TLPs 1 and 2.

TLPs 1 and 2 arrive without error at Device B and are forwarded to the Transaction Layer.

Figure 5-14: Lost TLP Handling

Lost ACK DLLP or ACK DLLP with CRC Error

Consider Figure 5-15 on page 246 which shows the ACK/NAK protocol for handling a lost ACK DLLP.

Device A transmits TLPs 4094, 4095, 0, 1, and 2.

Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0 . These TLPs are forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of NEXT_RCV_SEQ count is set to 1.

ACK 0 is lost en route. TLPs 4094, 4095, and 0 remain in Device A's Replay Buffer.

TLPs 1 and 2 arrive at Device B shortly thereafter. NEXT_RCV_SEQ count increments to 3 .

Device B returns ACK 2 and sends TLPs 1 and 2 to the Transaction Layer.

ACK 2 arrives at Device A.

Device A purges its Replay Buffer of TLPs 4094, 4095, 0, 1, and 2.

PCI Express System Architecture

The example would be the same if a CRC error existed in ACK packet 0 . Device A would detect the CRC error in ACK 0 and discard it. When received later, ACK 2 would cause the Replay Buffer to purge all TLPs (4094 through 2).

If ACK 2 is also lost or corrupted, and no further ACK or NAK DLLPs are returned to Device A, its REPLAY_TIMER will expire. This results in replay of its entire buffer. Device B receives TLP 4094, 4095, 0, 1 and 2 and detects them as duplicate TLPs because their Sequence Numbers are earlier than NEXT_RCV_SEQ count of 3. These TLPs are discarded and ACK DLLPs with AckNak_Seq_Num[11:0]

= 2

are returned to Device A for each duplicate TLP.

Figure 5-15: Lost ACK DLLP Handling

Lost ACK DLLP followed by NAK DLLP

Consider Figure 5-16 on page 247 which shows the ACK/NAK protocol for handling a lost ACK DLLP followed by a valid NAK DLLP.

Device A transmits TLPs 4094, 4095, 0, 1, and 2.

Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0 . These TLPs are forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of NEXT_RCV_SEQ count is 1.

ACK 0 is lost en route. TLPs 4094, 4095, and 0 remain in Device A's Replay Buffer.

TLPs 1 and 2 arrive at Device B shortly thereafter. TLP 1 is good and NEXT_RCV_SEQ count increments to 2. TLP 1 is forwarded to the Transaction Layer.

TLP 2 is corrupt. NEXT_RCV_SEQ count remains at 2.

Device B returns a NAK with a Sequence Number of 1 and discards TLP 2.

NAK 1 arrives at Device A.

Device A first purges TLP 4094, 4095, 0 and 1

Device A replays TLP 2.

TLP 2 arrive at Device B. The NEXT_RCV_SEQ count is 2.

Device B accepts good TLP 2 and forwards it to the Transaction Layer. NEXT_RCV_SEQ increments to 3.

Device B may return an ACK with a Sequence Number of 2 if the ACKNAK_LATENCY_TIMER expires.

Upon receipt of ACK 2, Device A purges TLP 2.

Figure 5-16: Lost ACK DLLP Handling

Switch Cut-Through Mode

PCI Express supports a switch-related feature that allows TLP transfer latency through a switch to be significantly reduced. This feature is referred to as the 'cut-through' mode. Without this feature, the propagation time through a switch could be significant.

Without Cut-Through Mode

Background

Consider an example where a large TLP needs to pass through a switch from one port to another. Until the tail end of the TLP is received by the switch's ingress port, the switch is unable to determine if there is a CRC error. Typically, the switch will not forward the packet through the egress port until it determines that there is no CRC error. This implies that the latency through the switch is at least the time to clock the packet into the switch. If the packet needs to pass through many switches to get to the final destination, the latencies would add up, increasing the time to get from source to destination.

Possible Solution

One option to reduce latency would be to start forwarding the TLP through the switch's egress port before the tail end of the TLP has been received by the switch ingress port. This is fine as long as the packet is not corrupted. Consider what would happen if the TLP were corrupt. The packet would begin transmitting through the egress port before the switch realized that there is an error. After the switch detects the CRC error, it would return a NAK to the TLP source and discard the packet, but part of the packet has already been transmitted and its transmission cannot be cleanly aborted in mid-transmit. There is no point keeping a copy of the bad TLP in the egress port Replay Buffer because it is bad. The TLP source port would at a later time replay after receiving the NAK DLLP. The TLP is already outbound and en route to the Endpoint destination. The Endpoint receives the packet, detects a CRC error, and returns a NAK to the switch. The switch is expected to replay the TLP, but the switch has already discarded the TLP due to the detected error on the inbound TLP. The switch is stuck between a rock and a hard place!

Switch Cut-Through Mode

Background

The PCI Express protocol permits the implementation of an optional feature referred to as cut-through mode. Cut-though is the ability to start streaming a packet through a switch without waiting for the receipt of the tail end of the packet. If, ultimately, a CRC error is detected when the CRC is received at the tail end of the packet, the packet that has already begun transmission from the switch egress port can be 'nullified'.

A nullified packet is a packet that terminates with an EDB symbol as opposed to an END. It also has an inverted 32-bit LCRC.

Example That Demonstrates Switch Cut-Through Feature

Consider the example in Figure 5-17 that illustrates the cut-though mode of a switch.

A TLP with large data payload passes from the left, through the switch, to the Endpoint on the right. The steps as the packet is routed through the switch are as follows:

A TLP is inbound to a switch. While en route, the packet's contents is corrupted.

The TLP header at the head of the TLP is decoded by the switch and the packet is forwarded to the egress port before the switch becomes aware of a CRC error. Finally, the tail end of the packet arrives in the switch ingress port and it is able to complete a CRC check.

The switch detects a CRC error for which the switch returns a NAK DLLP to the TLP source.

On the egress port, the switch replaces the END framing symbol at the tail end of the bad TLP with the EDB (End Bad) symbol. The CRC is also inverted from what it would normally be. The TLP is now 'nullified'. Once the TLP has exited the switch, the switch discards its copy from the Replay Buffer.

The nullified packet arrives at the Endpoint. The Endpoint detects the EDB symbol and the inverted CRC and discards the packet.

The Endpoint does not return a NAK DLLP (otherwise the switch would be obliged to replay).

When the TLP source device receives the NAK DLLP, it replays the packet. This

PCI Express System Architecture

time no error occurs on the switch's ingress port. As the packet arrives in the switch, the header is decoded and the TLP is forwarded to the egress port with very short latency. When the tail end of the TLP arrives at the switch, a CRC check is performed. There is no error, so an ACK is returned to the TLP source which then purges its replay buffer. The switch stores a copy of the TLP in its egress port Replay Buffer. When the TLP reaches the destination Endpoint, the Endpoint device performs a CRC check. The packet is a good packet terminated with the END framing symbol. There are no CRC errors and so the Endpoint returns an ACK DLLP to the switch. The switch purges the copy of the TLP from its Replay Buffer. The packet has been routed from source to destination with minimal latency.

Figure 5-17: Switch Cut-Through Mode Showing Error Handling

QoS/TCs/VCs and Arbitration

The Previous Chapter

The previous chapter detailed the Ack/Nak Protocol that verifies the delivery of TLPs between each port as they travel between the requester and completer devices. This chapter details the hardware retry mechanism that is automatically triggered when a TLP transmission error is detected on a given link.

This Chapter

This chapter discusses Traffic Classes, Virtual Channels, and Arbitration that support Quality of Service concepts in PCI Express implementations. The concept of Quality of Service in the context of PCI Express is an attempt to predict the bandwidth and latency associated with the flow of different transaction streams traversing the PCI Express fabric. The use of QoS is based on application-specific software assigning Traffic Class (TC) values to transactions, which define the priority of each transaction as it travels between the Requester and Completer devices. Each TC is mapped to a Virtual Channel (VC) that is used to manage transaction priority via two arbitration schemes called port and VC arbitration.

The Next Chapter

The next chapter discusses the purposes and detailed operation of the Flow Control Protocol. This protocol requires each device to implement credit-based link flow control for each virtual channel on each port. Flow control guarantees that transmitters will never send Transaction Layer Packets (TLPs) that the receiver can't accept. This prevents receive buffer over-runs and eliminates the need for inefficient disconnects, retries, and wait-states on the link. Flow Control also helps enable compliance with PCI Express ordering rules by maintaining separate virtual channel Flow Control buffers for three types of transactions: Posted (P), Non-Posted (NP) and Completions (Cpl).

Quality of Service

Quality of Service (QoS) is a generic term that normally refers to the ability of a network or other entity (in our case, PCI Express) to provide predictable latency and bandwidth. QoS is of particular interest when applications require guaranteed bus bandwidth at regular intervals, such as audio data. To help deal with this type of requirement PCI Express defines isochronous transactions that require a high degree of QoS. However, QoS can apply to any transaction or series of transactions that must traverse the PCI Express fabric. Note that QoS can only be supported when the system and device-specific software is PCI Express aware.

QoS can involve many elements of performance including:

Transmission rate

Effective Bandwidth

Latency

Error rate

Other parameters that affect performance

Several features of PCI Express architecture provide the mechanisms that make QoS achievable. The PCI Express features that support QoS include:

Traffic Classes (TCs)

Virtual Channels (VCs)

Port Arbitration

Virtual Channel Arbitration

Link Flow Control

PCI Express uses these features to support two general classes of transactions that can benefit from the PCI Express implementation of QoS.

Isochronous Transactions - from Iso (same) + chronous (time), these transactions require a constant bus bandwidth at regular intervals along with guaranteed latency. Isochronous transactions are most often used when a synchronous connection is required between two devices. For example, a CD-ROM drive containing a music CD may be sourcing data to speakers. A synchronous connection exists when a headset is plugged directly into the drive. However, when the audio card is used to deliver the audio information to a set of external speakers, isochronous transactions may be used to simplify the delivery of the data.

Asynchronous Transactions - This class of transactions involves a wide variety of applications that have widely varying requirements for bandwidth and latency. QoS can provide the more demanding applications (those requiring higher bandwidth and shorter latencies) with higher priority than the less demanding applications. In this way, software can establish a hierarchy of traffic classes for transactions that permits differentiation of transaction priority based on their requirements. The specification refers to this capability as differentiated services.

Isochronous Transaction Support

PCI Express supports QoS and the associated TC, VC, and arbitration mechanisms so that isochronous transactions can be performed. A classic example of a device that benefits from isochronous transaction support is a video camera attached to a tape deck. This real-time application requires that image and audio data be transferred at a constant rate (e.g., 64 frames/second). This type of application is typically supported via a direct synchronous attachment between the two devices.

Synchronous Versus Isochronous Transactions

Two devices connected directly perform synchronous transfers. A synchronous source delivers data directly to the synchronous sink through use of a common reference clock. In our example, the video camera (synchronous source) sends audio and video data to the tape deck (synchronous sink), which immediately stores the data in real time with little or no data buffering, and with only a slight delay due to signal propagation.

When these devices are connected via PCI Express a synchronous connection is not possible. Instead, PCI Express emulates synchronous connections through the use of isochronous transactions and data buffering. In this scenario, isochronous transactions can be used to ensure that a constant amount of data is delivered at specified intervals (

100 μ s

in this example),thus achieving the required transmission characteristics. Consider the following sequence (Refer to Figure 6-1 on page 254):

The synchronous source (video camera and PCI Express interface) accumulates data in Buffer A during service interval 1 (SI 1).

The camera delivers the accumulated data to the synchronous sink (tape deck) sometime during the next service interval (SI 2). The camera also accumulates the next block of data in Buffer B as the contents of Buffer A is delivered.

Isochronous Transaction Management

Management of an isochronous communications channel is based on a Traffic Class (TC) value and an associated Virtual Channel (VC) number that software assigns during initialization. Hardware components including the Requester of a transaction and all devices in the path between the requester and completer are configured to transport the isochronous transactions from link to link via a hi-priority virtual channel.

The requester initiates isochronous transactions that include a TC value representing the desired QoS. The Requester injects isochronous packets into the fabric at the required rate (service interval), and all devices in the path between the Requester and Completer must be configured to support the transport of the isochronous transactions at the specified interval. Any intermediate device along the path must convert the TC to the associated VC used to control transaction arbitration. This arbitration results in the desired bandwidth and latency for transactions with the assigned TC. Note that the TC value remains constant for a given transaction while the VC number may change from link to link.

Differentiated Services

Various types of asynchronous traffic (all traffic other than isochronous) have different priority from the system perspective. For example, ethernet traffic requires higher priority (smaller latencies) than mass storage transactions. PCI Express software can establish different TC values and associated virtual channels and can set up the communications paths to ensure different delivery policies are established as required. Note that the specification does not define specific methods for identifying delivery requirements or the policies to be used when setting up differentiated services.

Perspective on QOS/TC/VC and Arbitration

PCI does not include any QoS-related features similar to those defined by PCI Express. Many questions arise regarding the need for such an elaborate scheme for managing traffic flow based on QoS and differentiated services. Without implementing these new features, the bandwidth available with a PCI Express system is far greater and latencies much shorter than PCI-based implementations, due primarily to the topology and higher delivery rates. Consequently, aside from the possible advantage of isochronous transactions, there appears to be little advantage to implementing systems that support multiple Traffic

PCI Express System Architecture

Classes and Virtual Channels.

While this may be true for most desktop PCs, other high-end applications may benefit significantly from these new features. The PCI Express specification also opens the door to applications that demand the ability to differentiate and manage system traffic based on Traffic Class prioritization.

Traffic Classes and Virtual Channels

During initialization a PCI Express device-driver communicates the levels of QoS that it desires for its transactions, and the operating system returns TC values that correspond to the QoS requested. The TC value ultimately determines the relative priority of a given transaction as it traverses the PCI Express fabric. Two hardware mechanisms provide guaranteed isochronous bandwidth and differentiated services:

Virtual Channel Arbitration Port Arbitration

These arbitration mechanisms use VC numbers to manage transaction priority. System configuration software must assign VC IDs and set up the association between the traffic class assigned to a transaction and the virtual channel to be used when traversing each link. This is done via VC configuration registers mapped within the extended configuration address space. The list of these registers and their location within configuration space is illustrated in Figure 6-2.

Chapter 6: QoS/TCs/VCs and Arbitration

Figure 6-2: VC Configuration Registers Mapped in Extended Configuration Address Space

The TC value is carried in the transaction packet header and can contain one of eight values (TC0-TC7). TC0 must be implemented by all PCI Express devices and the system makes a "best effort" when delivering transactions with the TC0 label. TC values of TC1-TC7 are optional and provide seven levels of arbitration for differentiating between packet streams that require varying amounts of bandwidth. Similarly, eight VC numbers (VC0-VC7) are specified, with VC0 required and VC1-VC7 optional. ("VC Assignment and TC Mapping" on page 258 discusses VC initialization).

Note that TC0 is hardwired to VC0 in all devices. If configuration software is not PCI Express aware all transactions will use the default TC0 and VC0; thereby eliminating the possibility of supporting differentiated services and isochronous transactions. Furthermore, the specification requires some transaction types to use TC0/VC0 exclusively:

Configuration

$I / O$

PCI Express System Architecture

INTx Message

Power Management Message

Error Signaling Message

Unlock Message

Set_Slot_Power_Limit Message

VC Assignment and TC Mapping

Configuration software designed for PCI Express sets up virtual channels for each link in the fabric. Recall that the default TC and VC assignments following Cold Reset will be TC0 and VC0, which is used when the configuration software is not PCI Express aware. The number of virtual channels used depends on the greatest capability shared by the two devices attached to a given link. Software assigns an ID for each VC and maps one or more TCs to each.

Determining the Number of VCs to be Used

Software checks the number of VCs supported by the devices attached to a common link and assigns the greatest number of VCs that both devices have in common. For example, consider the three devices attached to the switch in Figure 6- 3 on page 259. In this example, the switch supports all 8 VCs on each of its ports; while Device A supports only the default VC, Device B supports 4 VC s, and Device C support 8 VCs. When configuring VCs for each link, software determines the maximum number of VCs supported by both devices at each end of the link and assigns that number to both devices. The VC assignment applies to transactions flowing across a link in both directions.

Figure 6-3: The Number of VCs Supported by Device Can Vary

Note that even though switch port A supports all 8 VCs Device A supports a single VC, leaving 7 VCs unused within switch port A. Similarly, 4 VCs are used by switch port B. Software of course configures and enables all 8 VCs within switch port

C

Configuration software determines the maximum number of VCs supported by each port interface by reading its Extended VC Count field contained within the "Virtual Channel Capability" registers. The smaller of the two values governs the maximum number of VCs supported by this link for both transmission and reception of transactions. Figure 6-4 on page 260 illustrates the location and format of the Extended VC Count field. Software may restrict the number of VCs configured and enabled to fewer than actually allowed. This may be done to achieve the QoS desired for a given platform or application.

Figure 6-4: Extended VCs Supported Field

Assigning VC Numbers (IDs)

Configuration software must assign VC numbers or IDs to each of the virtual channels, except VC0 which is always hardwired. As illustrated in Figure 6-5 on page 261, the VC Capabilities registers include 3 DWs used for configuring each VC. The first set of registers (starting at offset 10h) always applies to VC0. The Extended VCs Count field (described above) defines the number of additional VC register sets implemented by this port, each of which permits configuration of an additional VC. Note that these register sets are mapped in configuration space directly following the VC0 registers. The mapping is expressed as an offset from each of the three VC0 DW registers:

- 10 h + (n_{*} 0 Ch)

- 14 h + (n * 0 Ch)

- 18 h + (n * 0 Ch)

The value "

n

" represents the number of additional VCs implemented. For example,if the Extended VCs Count contains a value of 3,then

n = 1, 2

,and 3 for the three additional register sets. Note that these numbers simply identify the register sets for each VC supported and is not the VC ID.

Software assigns a VC ID for each of the additional VCs being used via the VC ID field within the VCn Resource Control Register. (See Figure 6-5) These IDs are not required to be assigned contiguous values, but the same VC value can be used only once.

Figure 6-5: VC Resource Control Register

PCI Express System Architecture

Assigning TCs to each VC TC/VC Mapping

The Traffic Class value assigned by a requester to each transaction must be associated with a VC as it traverses each link on its journey to the recipient. Also, the VC ID associated with a given TC may change from link to link. Configuration software establishes this association during initialization via the TC/VC Map field of the VC Resource Control Register. This 8-bit field permits any TC value to be mapped to the selected VC, where each bit position represents the corresponding TC value (i.e.,bit

0 =

TC0:: bit

7 =

TC7). Setting a bit assigns the corresponding TC value to the VC ID. Figure 6-6 shows a mapping example where TC0 and TC1 are mapped to VC0 and TC2::TC4 are mapped to VC3.

Figure 6-6: TC to VC Mapping Example

Software is permitted a great deal of flexibility in assigning VC IDs and mapping the associated TCs. However, the specification states several rules associated with the TC/VC mapping:

TC/VC mapping must be identical for the two ports attached to the same link.

One TC must not be mapped to multiple VCs in any PCI Express Port.

One or multiple TCs can be mapped to a single VC.

Table 6-1 on page 263 lists a variety of combinations that may be implemented. This is intended only to illustrate a few combinations, and many more are possible.

Table 6-1: Example TC to VC Mappings

TC	VC Assignment	Comment
TC0	VC0	Default setting, used by all transactions.
TC0-TC1	VC0	VCs are not required to be assigned consecutively.
TC2-TC7	VC7	Multiple TCs can be assigned to a single VC.
TC0	VC0	Several transaction types must use TC0/VC0. (1) TCs are not required to be assigned consecutively. Some TC/VC combinations can be used to support an isochronous connection.
TC1	VC1
TC6	VC6
TC7	VC7
TC0	VC0	All TCs can be assigned to the corresponding VC numbers.
TC1	VC1
TC2	VC2
TC3	VC3
TC4	VC4
TC5	VC5
TC6	VC6
TC7	VC7
TC0	VC0	The VC number that is assigned need not match
TC1-TC4	VC6	one of the corresponding TC numbers.
TC0	VC0	Illegal. A TC number can be assigned to only one
TC1-TC2	VC1	VC number. This example shows TC2 mapped to
TC2	VC2	both VC1 and VC2, which is not allowed.

Arbitration

Two types of transaction arbitration provide the method for managing isochronous transactions and differentiated services:

Virtual Channel (VC) Arbitration - determines the priority of transactions being transmitted from the same port, based on their VC ID.

Port Arbitration - determines the priority of transactions with the same VC assignment at the egress port, based on the priority of the port at which the transactions arrived. Port arbitration applies to transactions that have the same VC ID at the egress port, therefore a port arbitration mechanism exists for each virtual channel supported by the egress port. PCI Express System Architecture

Arbitration is also affected by the requirements associated with transaction ordering and flow control. These additional requirements are discussed in subsequent chapters, but are mentioned in the context of arbitration as required in the following discussions.

Virtual Channel Arbitration

In addition to supporting QoS objectives, VC arbitration should also ensure that forward progress is made for all transactions. This prevents inadvertent split transaction time-outs. Any device that both initiates transactions and supports two or more VCs must implement VC arbitration. Furthermore, other device types that support more than one VC (e.g., switches) must also support VC arbitration.

VC arbitration allows a transmitting device to determine the priority of transactions based on their VC assignment. Key characteristics of VCs that are relevant to VC arbitration include:

Each VC supported and enabled provides its own buffers and flow control.

Transactions mapped to the same VC are issued in strict order (unless the "Relaxed Ordering" attribute bit is set).

No ordering relationship exists between transactions assigned to different VCs.

Figure 6-7 on page 265 illustrates the concept of VC arbitration. In this example two VCs are implemented (VC0 and VC1) and transmission priority is based on a 3:1 ratio, where 3 VC1 transactions are sent to each VC0 transaction. The device core issues transactions (that include a TC value) to the TC/VC Mapping logic. Based on the associated VC value, the transaction is routed to the appropriate VC buffer where it awaits transmission. The VC arbiter determines the VC buffer priority when sending transactions.

This example illustrates the flow of transaction in only one direction. The same logic exists for transmitting transactions simultaneously in the opposite direction. That is, the root port also contains transmit buffers and an arbiter and the endpoint device contains receive buffers. The method chosen by the designer is specified within the VC capability registers. In general, there are three approaches that can be taken:

Figure 6-7: Conceptual VC Arbitration Example

A variety of VC arbitration mechanisms may be employed by a given design.

Strict Priority Arbitration for all VCs

Split Priority Arbitration - VCs are segmented into low- and high-priority groups. The low-priority group uses some form of round robin arbitration and the high-priority group uses strict priority.

Round robin priority (standard or weighted) arbitration for all VCs

Strict Priority VC Arbitration

The specification defines a default priority scheme based on the inherent priority of VC IDs (VC0=lowest priority and VC7=highest priority). The arbitration mechanism is hardware based, and requires no configuration. Figure 6-8 illustrates a strict priority arbitration example that includes all VCs. The VC ID governs the order in which transactions are sent. The maximum number of VCs

PCI Express System Architecture

that use strict priority arbitration cannot be greater than the value in the Extended VC Count field. (See Figure 6-4 on page 260.) Furthermore, if the designer has chosen strict priority arbitration for all VCs supported, the Low Priority Extended VC Count field of Port VC Capability Register 1 is hardwired to zero. (See Figure 6-9 on page 267.)

Figure 6-8: Strict Arbitration Priority
VC Resources	Priority Order
8th VC	VC7	Highest Lowest
7th VC	VC6
6th VC	VC5
5th VC	VC4
4th VC	VC3
3rd VC	VC2
2nd VC	VC1
1st VC	VC0

Strict priority requires that VCs of higher priority get precedence over lower priority VCs based on the VC ID. For example, if all eight VCs are governed by strict priority, transactions with a VC ID of VC0 can only be sent when no transactions are pending transmission in VC1-VC7. In some circumstances strict priority can result in lower priority transactions being starved for bandwidth and experiencing extremely long latencies. Conversely, the highest priority transactions receive very high bandwidth with minimal latencies. The specification requires that high priority traffic be regulated to avoid starvation, and further defines two methods of regulation:

The originating port can manage the injection rate of high priority transactions, to permit greater bandwidth for lower priority transactions.

Switches can regulate multiple data flows at the egress port that are vying for link bandwidth. This method may limit the throughput from high bandwidth applications and devices that attempt to exceed the limitations of the available bandwidth.

The designer of a device may also limit the number of VCs that participate in strict priority by specifying a split between the low- and high-priority VCs as discussed in the next section.

Low- and High-Priority VC Arbitration

Figure 6-9 on page 267 illustrates the Low Priority Extended VC Count field within VC Capability Register 1. This read-only field specifies a VC ID value that identifies the upper limit of the low-priority arbitration group for the design. For example, if this count contains a value of 4 , then VC0-VC4 are members of the low-priority group and VC5-VC7 use strict priority. Note that a Low Priority Extended VC Count of 7 means that no strict priority is used. strict priority arbitration, while the low-priority arbitration group uses one of the other prioritization methods supported by the device. VC Capability Register 2 reports which alternate arbitration methods are supported for the low pri-PCI Express System Architecture

Figure 6-9: Low Priority Extended VC Count

As depicted in Figure 6-11 on page 269, the high-priority VCs continue to use

ority group, and the VC Control Register permits selection of the method to be used by this group. See Figure 6-10 on page 268. The low-priority arbitration schemes include:

Hardware Based Fixed Arbitration Scheme - the specification permits the vendor to define a hardware-based fixed arbitration scheme that provides all VCs with the same priority. (e.g. round robin).

Weighted Round Robin (WRR) - with WRR some VCs can be given higher priority than others because they have more positions within the round robin than others. The specification defines three WRR configurations, each with a different number of entries (or phases).

Figure 6-10: Determining VC Arbitration Capabilities and Selecting the Scheme

Figure 6-11: VC Arbitration with Low- and High-Priority Implementations

Hardware Fixed Arbitration Scheme. This selection defines a hardware-based VC arbitration scheme that requires no additional software setup. The specification mentions standard Round Robin arbitration as an example scheme that the designer may choose. In such a scheme, transactions pending transmission within each low-priority VC are sent during each pass through the round robin. The specification does not preclude other implementation-specific schemes.

Weighted Round Robin Arbitration Scheme. The weighted round robin (WRR) approach permits software to configure the VC Arbitration table. The number of arbitration table entries supported by the design is reported in the VC Arbitration Capability field of Port VC Capability Register 2. The table size is selected by writing the corresponding value in to the

V C

Arbitration Select field of the Port VC Control Register. See Figure 6-10 on page 268. Each entry in the table represents one phase that software loads with a low priority VC ID value. The VC arbiter repeatedly scans all table entries in a sequential fashion and sends transactions from the VC buffer specified in the table entries. Once a transaction has been sent, the arbiter immediately proceeds to the next phase.

Software can set up the VC arbitration table such that some VCs are listed in more entries than others; thereby, allowing differentiation of QoS between the VCs. This gives software considerable flexibility in establishing the desired priority. Figure 6-12 on page 270 depicts the weighted round robin VC arbitration concept.

Figure 6-12: Weighted Round Robin Low-Priority VC Arbitration Table Example

Round Robin Arbitration (Equal or Weighted) for All VCs

The hardware designer may choose to implement one of the round robin forms of VC arbitration for all VCs. This is accomplished by specifying the highest VC number supported by the device as a member of the low priority group (via the Lowest Priority Extended Count field. In this case, all VC priorities are managed via the VC arbitration table. Note that the VC arbitration table is not used when the Hardware Fixed Round Robin scheme is selected. See page 269.

Loading the Virtual Channel Arbitration Table

The VC Arbitration Table (VAT) is located at an offset from the beginning of the extended configuration space as indicated by the VC Arbitration Table Offset field. This offset is contained within Port VC Capability Register 2. (See Figure 6-13 on page 271.) Chapter 6: QoS/TCs/VCs and Arbitration

PCI Express System Architecture

The table is loaded by configuration software to achieve the priority order desired for the virtual channels. Hardware sets the VC Arbitration Table Status bit when software updates any entry within the table. Once the table is loaded, software sets the Load VC Arbitration Table bit within the Port VC Control register. This bit causes hardware to load the new values into the VC Arbiter. Hardware clears the VC Arbitration Table Status bit when table loading is complete; thereby, permitting software to verify successful loading.

Figure 6-14: Loading the VC Arbitration Table Entries

VC Arbitration within Multiple Function Endpoints

The specification does not state how an endpoint should manage the arbitration of data flows from different functions within an endpoint. However it does state that "Multi-function Endpoints... should support PCI Express VC-based arbitration control mechanisms if multiple VCs are implemented for the PCI Express Link." VC arbitration when there are multiple functions raises interesting questions about the approach to be taken. Of course when the device functions support only VC0, no VC arbitration is necessary. The specification leaves the approach open to the designer.

Figure 6-15 on page 274 shows a functional block diagram of an example implementation in which two functions are implemented within an endpoint device, each of which supports two VCs. The example approach is based upon the goal of using a standard PCI Express core to interface both functions to the link. The transaction layer within the link performs the TC/VC mapping and VC arbitration. The device-specific portion of the design is the function arbiter that determines the priority of data flows from the functions to the transaction layer of the core. Following are key considerations for such an approach:

Rather than duplicating the TC/VC mapping within each function, the standard device core performs the task. An important consideration for this decision is that all functions must use the same TC/VC mapping. The specification requires that the TC/VC mapping be the same for devices at each end of a link. This means that each function within the endpoint must have the same mappings.

The function arbiter used TC values to determine the priority of transactions being delivered from the two functions, and selects the highest priority transaction from the functions when forwarding transactions to the transaction layer of the PCI Express core. The arbitration algorithm is hardwired based on the applications associated with each function.

PCI Express System Architecture

Figure 6-15: Example Multi-Function Endpoint Implementation with VC Arbitration

Port Arbitration

When traffic from multiple ports vie for limited bandwidth associated with a common egress port, arbitration is required. The concept of port arbitration is pictured in Figure 6-16 on page 275. Note that port arbitration exists in three locations within a system:

Egress ports of switches

Root Complex ports when peer-to-peer transactions are supported

Root Complex egress ports to that lead to sources such as main memory

Port arbitration requires software configuration, which is handled via PCI-to-PCI bridge (PPB) configuration in both switches and peer-to-peer transfers within the Root Complex and by the Root Complex Register Block when accessing shared root complex resources such as main memory. Port arbitration occurs independently for each virtual channel supported by the egress port. In the example below, root port 2 supports peer-to-peer transfers from root ports 1 and 2; however, peer-to-peer transfer support between root complex ports is not required.

PCI Express System Architecture

The process of arbitrating between different packet streams also implies the use of additional buffers to accumulate traffic from each port in the egress port as illustrated in Figure 6-18 on page 277. This example illustrates two ingress ports (1 and 2) whose transactions are routed to an egress port (3). The action taken by the switch include:

Transactions arriving at the ingress ports are directed to the appropriate flow control buffers based on the TC/VC mapping.

Transactions are forwarded from the flow control buffers to the routing logic is consulted to determine the egress port.

Transactions are routed to the egress port (3) where TC/VC mapping determines into which VC buffer the transactions should be placed.

A set of VC buffers is associated with each of the egress ports. Note that the ingress port number is tracked until transactions are placed in their VC buffer.

Port arbitration logic determines the order in which transactions are sent from each group of VC buffers.

Figure 6-18: Port Arbitration Buffering

The Port Arbitration Mechanisms

The actual port arbitration mechanisms defined by the specification are similar to the models used for VC arbitration and include:

Non-configurable hardware-fixed arbitration scheme

Weighted Round Robin (WRR) arbitration with 32 phases

WRR arbitration with 64 phases

WRR arbitration with 128 phases

Time-based WRR arbitration with 128 phases

WRR arbitration with 256 phases

Configuration software must determine the port arbitration capability for a switch or RCRB and select the port arbitration scheme to be used for each enabled VC. Figure 6-19 on page 278 illustrates the registers and fields involved in determining port arbitration capabilities and selecting the port arbitration scheme to be used by each VC.

progress. However, it does not service the goals of differentiated services and does not support isochronous transactions.

Weighted Round Robin Arbitration. Like the weighted round robin mechanism used for VC arbitration, software loads the port arbitration table such that some ports can receive higher priority than others based on the number of phases in the round robin that are allocated for each port. This approach allows software to facilitate differentiated services by assigning different weights to traffic coming from different ports.

As the table is scanned each table phase specifies a port number that identifies the VC buffer from which the next transaction is sent. Once the transaction is delivered arbitration control logic immediately proceeds to the next phase. For a given port, if no transaction is pending transmission the arbiter advances immediately to the next phase.

The specification defines four table lengths for WRR port arbitration, determined by the number of phases used by the table. The table length selections include:

32 phases

64 phases

128 phases

256 phases

Time-Based, Weighted Round Robin Arbitration. The time-based WRR mechanism is required for supporting isochronous transactions. Consequently, each switch egress port and RCRB that supports isochronous transactions must implement time-based WRR port arbitration.

Time-based weighted round robin adds the element of a virtual timeslot for each arbitration phase. Just as in WRR the port arbiter delivers one transaction from the Ingress Port VC buffer indicated by the Port Number of the current phase. However, rather than immediately advancing to the next phase, the time-based arbiter waits until the current virtual timeslot elaspses before advancing. This ensures that transactions are accepted from the ingress port buffer at regular intervals. Note that the timeslot does not govern the duration of the transfer, but rather the interval between transfers. The maximum duration of a transaction is the time it takes to complete the round robin and return to the original timeslot. Each timeslot is defined as 100ns.

PCI Express System Architecture

Also, it is possible that no transaction is delivered during a timeslot, resulting in an idle timeslot. This occurs when:

no transaction is pending for the selected ingress port during the current phase, or

the phase contains the port number of this egress port

Time-based WRR arbitration supports a maximum table length of 128 phases. The actual number of phases implemented is reported via the Maximum Time Slot field of each virtual channel that supports Timed WRR arbitration. See the Figure 6-20 on page 280 which illustrate the Maximum Time Slots Field within the VCn Resource Capability register. See MindShare's website for a white paper on example applications of Time-Based WRR.

Figure 6-20: Maximum Time Slots Register

Loading the Port Arbitration Tables

A port arbitration table is required for each VC supported by the egress port.

The actual size and format of the Port Arbitration Tables are a function of the number of phases and the number of ingress ports supported by the Switch, RCRB, or Root Port that supports peer-to-peer transfers. The maximum number of ingress ports supported by the Port Arbitration Table is 256 ports. The actual number of bits within each table entry is design dependent and governed by the number of ingress ports whose transactions can be delivered to the egress port. The size of each table entry is reported in the 2-bit Port Arbitration Table Entry Size field of Port VC Capability Register 1. The permissible values are:

$00 b - 1$ bit

$01 b - 2$ bits

$10 b - 4$ bits

$11 b - 8$ bits

Switch Arbitration Example

This section provides an example of a three-port switch with both Port and VC arbitration illustrated. The example presumes that packets arriving on ingress ports 0 and 1 are moving in the upstream direction and port 2 is the egress port facing the Root Complex. This example serves to summarize port and VC arbitration and illustrate their use within a PCI Express switch. Refer to Figure 6-22 on page 283 during the following discussion.

Packets arrive at ingress port 0 and are placed in a receiver flow control buffer based on TC/VC mapping associated with port 0 . As indicated, TLPs carrying traffic class TC0 or TC1 are sent to the VC0 receiver flow control buffers. TLPs carrying traffic class TC3 or TC5 are sent to the VC1 receiver flow control buffers. No other TCs are permitted on this link.

Packets arrive at ingress port 1 and are placed in a receiver flow control buffer based on port $1 TC / VC$ mapping. As indicated,TLPs carrying traffic class TC0 are sent to the VC0 receiver flow control buffers. TLPs carrying traffic class TC2-TC4 are sent to the VC3 receiver flow control buffers. NO OTHER TCs are permitted on this link.

The target egress port is determined from routing information in each packet. Address routing is applied to memory or IO request TLPs, ID routing is applied to configuration or completion TLPs, etc.

All packets destined for egress port 2 are subjected to the TC/VC mapping for that port. As shown, TLPs carrying traffic class TC0-TC2 are managed as virtual channel 0 (VC0) traffic, TLPs carrying traffic class TC3-TC7 are managed as VC1 traffic.

Independent Port Arbitration is applied to packets within each VC. This may be a fixed or weighted round robin arbitration used to select packets from all possible different ingress ports. Port arbitration ultimately results in all VCs of a given type being routed to the same VC buffer.

Following Port Arbitration, VC arbitration determines the order in which transactions pending transmission within the individual VC buffers will be transferred across the link. The arbitration algorithm may be fixed or weighted round robin. The arbiter selects transactions from the head of each VC buffer based on the priority scheme implemented.

Note that the VC arbiter selects packets for transmission only if sufficient flow control credits exist.

Flow Control

The Previous Chapter

This previous chapter discussed Traffic Classes, Virtual Channels, and Arbitration that supports Quality of Service concepts in PCI Express implementations. The concept of Quality of Service in the context of PCI Express is an attempt to predict the bandwidth and latency associated with the flow of different transaction streams traversing the PCI Express fabric. The use of QoS is based on application-specific software assigning Traffic Class (TC) values to transactions, which define the priority of each transaction as it travels between the Requester and Completer devices. Each TC is mapped to a Virtual Channel (VC) that is used to manage transaction priority via two arbitration schemes called port and VC arbitration.

This Chapter

This chapter discusses the purposes and detailed operation of the Flow Control Protocol. This protocol requires each device to implement credit-based link flow control for each virtual channel on each port. Flow control guarantees that transmitters will never send Transaction Layer Packets (TLPs) that the receiver can't accept. This prevents receive buffer over-runs and eliminates the need for inefficient disconnects, retries, and wait-states on the link. Flow Control also helps enable compliance with PCI Express ordering rules by maintaining separate virtual channel Flow Control buffers for three types of transactions: Posted (P), Non-Posted (NP) and Completions (Cpl).

The Next Chapter

The next chapter discusses the ordering requirements for PCI Express devices, as well as PCI and PCI-X devices that may be attached to a PCI Express fabric. The discussion describes the Producer/Consumer programming model upon which the fundamental ordering rules are based. It also describes the potential performance problems that can emerge when strong ordering is employed, describes the weak ordering solution, and specifies the rules defined for deadlock avoidance.

Flow Control Concept

The ports at each end of every PCI Express link must implement Flow Control. Before a transaction packet can be sent across a link to the receiving port, the transmitting port must verify that the receiving port has sufficient buffer space to accept the transaction to be sent. In many other architectures including PCI and PCI-X, transactions are delivered to a target device without knowing if it can accept the transaction. If the transaction is rejected due to insufficient buffer space, the transaction is resent (retried) until the transaction completes. This procedure can severely reduce the efficiency of a bus, by wasting bus bandwidth when other transactions are ready to be sent.

Because PCI Express is a point-to-point implementation, the Flow Control mechanism would be ineffective, if only one transaction stream was pending transmission across a link. That is, if the receive buffer was temporarily full, the transmitter would be prevented from sending a subsequent transaction due to transaction ordering requirements, thereby blocking any further transfers. PCI Express improves link efficiency by implementing multiple flow-control buffers for separate transaction streams (virtual channels). Because Flow Control is managed separately for each virtual channel implemented for a given link, if the Flow Control buffer for one VC is full, the transmitter can advance to another VC buffer and send transactions associated with it.

The link Flow Control mechanism uses a credit-based mechanism that allows the transmitting port to check buffer space availability at the receiving port. During initialization each receiver reports the size of its receive buffers (in Flow Control credits) to the port at the opposite end of the link. The receiving port continues to update the transmitting port regularly by transmitting the number of credits that have been freed up. This is accomplished via Flow Control DLLPs.

Flow control logic is located in the transaction layer of the transmitting and receiving devices. Both transmitter and receiver sides of each device are involved in flow control. Refer to Figure 7-1 on page 287 during the following descriptions.

Devices Report Buffer Space Available - The receiver of each node contains the Flow Control buffers. Each device must report the amount of flow control buffer space they have available to the device on the opposite end of the link. Buffer space is reported in units called Flow Control Credits (FCCs). The number of Flow Control Credits within each buffer is forwarded from the transaction layer to the transmit side of the link layer as illustrated in Figure 7-1. The link creates a Flow Control DLLP that carries this credit information to the receiver at the opposite end of the link. This is done for each Flow Control Buffer.

Receiving Credits - Notice that the receiver in Figure 7-1 also receives Flow Control DLLPs from the device at the opposite end of the link. This information is transferred to the transaction layer to update the Flow Control Counters that track the amount of Flow Control Buffer space in the other device.

Credit Checks Made - Each transmitter check consults the Flow Control Counters to check available credits. If sufficient credits are available to receive the transaction pending delivery then the transaction is forwarded to the link layer and is ultimately sent to the opposite device. If enough credits are not available the transaction is temporarily blocked until additional Flow Control credits are reported by the receiving device.

Figure 7-1: Location of Flow Control Logic

Flow Control Buffers

Flow control buffers are implemented for each VC resource supported by a PCI Express port. Recall that devices at each end of the link may not support the same number of VC resources, therefore the maximum number of VCs configured and enabled by software is the greatest number of VCs in common between the two ports.

VC Flow Control Buffer Organization

Each VC Flow Control buffer at the receiver is managed for each category of transaction flowing through the virtual channel. These categories are:

Posted Transactions - Memory Writes and Messages

Non-Posted Transactions - Memory Reads, Configuration Reads and Writes, and I/O Reads and Writes

Completions - Read Completions and Write Completions

In addition, each of these categories is separated into header and data portions of each transaction. Flow control operates independently for each of the six buffers listed below (also see Figure 7-2 on page 289).

Posted Header

Posted Data

Non-Posted Header

Non-Posted Data

Completion Header

Completion Data

Some transactions consist of a header only (e.g., read requests) while others consist of a header and data (e.g., write requests). The transmitter must ensure that both header and data buffer space is available as required for each transaction before the transaction can be sent. Note that when a transaction is received into a VC Flow Control buffer that ordering must be maintained when the transactions are forwarded to software or to an egress port in the case of a switch. The the receiver must also track the order of header and data components within the Flow Control buffer.

Figure 7-2: Flow Control Buffer Organization

Flow Control Credits

Buffer space is reported by the receiver in units called Flow Control credits. The unit value of Flow Control credits (FCCs) may differ between header and data as listed below:

Header FCCs - maximum header size + digest

o 4 DWs for completions

o 5 DWs for requests

Data FCCs -4 DWs (aligned 16 bytes)

Flow control credits are passed within the header of the link layer Flow Control Packets. Note that DLLPs do not require Flow Control credits because they originate and terminate at the link layer. PCI Express System Architecture

Maximum Flow Control Buffer Size

The maximum buffer size that can be reported via the Flow Control Initialization and Update packets for the header and data portions of a transaction are as follows:

128 Credits for headers

2,560 bytes Request Headers @ 20 bytes/credit

2048 bytes for completion headers @ 16 bytes/credit

2048 Credits for data

$32 KB @ 16$ bytes/credit

The reason for these limits is discussed in the section entitled "Stage 1 - Flow Control Following Initialization" page 296, step 2.

Introduction to the Flow Control Mechanism

The specification defines the requirements of the Flow Control mechanism by describing conceptual registers and counters along with procedures and mechanisms for reporting, tracking, and calculating whether a transaction can be sent. These elements define the functional requirements; however, the actual implementation may vary from the conceptual model. This section introduces the specified model that serves to explain the concept and define the requirements. The approach taken focuses on a single flow control example for a non-posted header. The concepts discussed apply to all Flow Control buffer types.

The Flow Control Elements

Figure 7-3 identifies and illustrates the elements used by the transmitter and receiver when managing flow control. This diagram illustrates transactions flowing in a single direction across a link, but of course another set of these elements is used to support transfers in the opposite direction. The primary function of each element within the transmitting and receiving devices is listed below. Note that for a single direction these Flow Control elements are duplicated for each Flow Control receive buffer, yielding six sets of elements. This example deals with non-posted header flow control.

Transmitter Elements

Pending Transaction Buffer - holds transactions that are pending transfer within the same virtual channel.

Credit Consumed Counter - tracks the size of all transactions sent from the VC buffer (of the specified type, e.g., non-posted headers) in Flow Control credits. This count is abbreviated "CC."

Credit Limit Register - this register is initialized by the receiving device when it sends Flow Control initialization packets to report the size of the corresponding Flow Control receive buffer. Following initialization, Flow Control update packets are sent periodically to add more Flow Control credits as they become available at the receiver. This value is abbreviated “CL.”

Flow Control Gating Logic - performs the calculations to determine if the receiver has sufficient Flow Control credits to receive the pending TLP (PTLP). In essence, this check ensures that the total CREDITS_CONSUMED (CC) plus the credit required for the next packet pending transmission (PTLP) does not exceed the CREDIT_LIMIT (CL). This specification defines the following equation for performing the check, with all values represented in credits:

CL - (CC + PTLP) {mod 2}^{[FieldSize]} \leq 2^{[FieldSize]} / 2

For an example application of this equation, See "Stage 1 - Flow Control Following Initialization" on page 294.

Receiver Elements

Flow Control (Receive) Buffer - stores incoming header or data information.

Credit Allocated - This counter tracks the total Flow Control credits that have been allocated (made available) since initialization. It is initialized by hardware to reflect the size of the associated Flow Control buffer. As the buffer fills the amount of available buffer space decreases until transactions are removed from the buffer. The number of Flow Control credits associated with each transaction removed from the buffer is added to the CREDIT_ALLOCATED counter; thereby keeping a running count of new credits made available.

Credits Received Counter (optional) - this counter keeps track of the total size of all data received from the transmitting device and placed into the Flow Control buffer (in Flow Control credits). When flow control is functioning properly, the CREDITS_RECEIVED count should be the same as

PCI Express System Architecture

CREDITS_CONSUMED count at the transmitter and be equal to or less than the CREDIT_ALLOCATED count. If this is not true, then a flow control buffer overflow has occurred and error is detected. Although optional the specification recommends its use.

Flow control management is based on keeping track of Flow Control credits using modulo counters. Consequently, the counters are designed to role over when the count saturates. The width of the counters depend on whether flow control is tracking transaction headers or data:

Header flow control uses modulo 256 counters (8-bits wide)

Data flow control uses modulo 4096 counters (12-bits wide)

In addition, all calculations are made using unsigned arithmetic. The operation of the counters and the calculations are explained by example on page 290 .

Figure 7-3: Flow Control Elements

Flow Control Packets

The transmit side of a device reports flow control credit information from its receive buffers to the opposite device. The specification defines three types of Flow Control packets:

Flow Control Init1 - used to report the size of the Flow Control buffers for a given virtual channel

Flow Control Init2 - same as Flow Control Init1 except it is used to verify completion of flow control initialization at each end of the link (receiving device ignores flow control credit information)

Flow Control Update - used to update Credit Limit periodically

Each Flow Control packet contains the header and data flow control credit information for each virtual channel and type of Flow Control packet. The packet fields that carry the header and data Flow Control credits reflect the counter width as discussed in the previous section. Figure 7-4 pictures the format and content of these packets.

Figure 7-4: Types and Format of Flow Control Packets

Operation of the Flow Control Model An Example

The purpose of this example is to explain the operation of the Flow Control mechanism based on the conceptual model presented by the specification. The example uses the non-posted header Flow Control buffer type, and spans four stages to capture the nuances of the flow control implementation:

Stage One - Immediately following initialization, the several transactions are tracked to explain the basic operation of the counters and registers as they track transactions as they are sent across the link. In this stage, data is accumulating within the Flow Control buffer, but no transactions are being removed.

Stage Two - If the transmitter sends non-posted transactions at a rate such that the Flow Control buffer is filled faster than the receiver can forward transactions from the buffer, the buffer will fill. Stage two describes this circumstance.

Stage Three - The modulo counters are designed to roll over and continue counting from zero. This stage describes the flow control operation at the point of the CREDITS_ALLOCATED count rolling over to zero.

Stage Four - The specification describes the optional error check that can be made by the receiver in the event of a Flow Control buffer overflow. This error check is described in this section.

Stage 1 — Flow Control Following Initialization

The assumption made in this example is that flow control initialization has just completed and the devices are ready for normal operation. The Flow Control buffer is presumed to be

2 KB

in size,which represents

102 d (66 h)

Flow Control units with 20 bytes/header. Figure 7-5 on page 295 illustrates the elements involved with the values that would be in each counter and register following flow control initialization.

Figure 7-5: Flow Control Elements Following Initialization

The transmitter must check Flow Control credit prior to sending a transaction. In the case of headers the number of Flow Control units required is always one. The transmitter takes the following steps to determine if the transaction can be sent. For simplicity, this example ignores the possibility of data being included in the transaction.

The credit check is made using unsigned arithmetic (

2^{'} s

complement) in order to satisfy the following formula:

CL - (CC + PTLP) {mod 2}^{[FieldSize]} \leq 2^{[FieldSize]} / 2

Substituting values from Figure 7-5 yields:

66 h - (00 h + 01 h) \mod 2^{8} \leq 2^{8} / 2

66 h - 01 h \mod 256 \leq 80 h

PCI Express System Architecture

The current CREDITS_CONSUMED count (CC) is added to the PTLP credits required, to determine the CUMULATIVE_CREDITS_REQUIRED (CR), or $00 h + 01 h = 01 h$ . Sufficient credits exist if this value is equal to or less than the credit limit.

The CUMULATIVE_CREDITS_REQUIRED count is subtracted from the CREDIT_LIMIT count (CL) to determine if sufficient credits are available. The following description incorporates a brief review of 2's complement subtraction. When performing subtraction using 2's complement the number to be subtracted is complemented (1's complement) and 1 is added (2's complement). This value is then added to the number being subtracted from. Any carry due to the addition is simply ignored.

The numbers to subtract are:

CL 01100110b (66h) - CR 00000001b (01h) = n

Number to be subtracted is converted to 2's complement:

0000001b > 11111110b (1’s complement)

11111110 b + 1 = 11111111 b

(1’s complement

+ 1 = 2

’s complement)

2's complement is added.

01100110

11111111 (add)

01100101 = 01100101 b = 65 h

Is result $<= 80 h$ ?

Yes,

65 h <= 80 h

(send transaction)

The result of the subtraction must be equal to or less than

1 / 2

the maximum value that can be tracked with a modulo 256 counter (128). This approach is taken to ensure unique results from the unsigned arithmetic. For example, unsigned 2's-complement subtraction yields the same results for both 0-128 and 255-127, as shown below.

00 h (0) - 80 h (128) = - 80 h (128)

0000000 b + 01111111 + 1 b

(add

2^{'} s

complement)

0000000 b + 1000000 b = 1000000 b (10 h)

FFh(255) - 7 Fh (127) = + 80 h (128)

11111111b - 01111111b = n

11111111b + 10000000+1 (add 2's complement)

11111111 b + 10000001 b = 10000000 b (10 h)

To ensure that conflicts such as the one above do not occur, the maximum number of unused credits that can be reported is limited to

2^{8} / 2

(128) credits for headers and

2^{12} / 2 (2048)

credits for data. This means that the CREDITS_ALLOCATED count must never exceed the CREDITS_CONSUMED count by more than 128 for headers and 2048 for data. This ensures that any result

< 1 / 2

the maximum register count is a positive number and represents credits available,and results

> 1 / 2

the maximum count are negative numbers that indicate credits not available.

The CREDITS_CONSUMED count increments by one when the transaction is forwarded to the link layer.

When the transaction arrives at the receiver, the transaction header is placed into the Flow Control buffer and the CREDITS_RECEIVED counter (optional) increments by one. Note that CREDIT_ALLOCATED does not change.

Figure 7-6 on page 297 illustrates the Flow Control elements following transfer of the first transaction.

Figure 7-6: Flow Control Elements Following Delivery of First Transaction

Stage 2 — Flow Control Buffer Fills Up

This example presumes that the receiving device has been unable to move transactions from the Flow Control buffer since initialization. This could be caused if the device core was temporarily busy and unable to process transactions. Consequently, the Flow Control buffer has completely filled. Figure 7-7 on page 299 illustrates this scenario.

Again the transmitter checks Flow Control credits to determine if the next pending TLP can be sent. The unsigned arithmetic is performed to subtract the Credits Required from the CREDIT_LIMIT:

66 h (CL) - 67 h (CR) <= 80 h

01100110 b - 01100111 b <= 10000000 b

(if yes,send transaction)

CL 01100110 (66)

CR 10011001 (add 2's complement of 67h)

11111111 = FFh<=80h (not true, don't send packet)

Not until the receiver moves one or more transactions from the Flow Control buffer can the pending transaction be sent. When the first transaction is moved from the Flow Control buffer, the CREDIT_ALLOCATED count is increased to 67h. When the Update Flow Control packet is delivered to the transmitter, the new CREDIT_LIMIT will be loaded into the CL register. The resulting check will pass the test, thereby permitting the packet to be sent.

CL 01100111 (67)

CR 10011001 add 2's complement of 67

00000000 = 00 h <= 80 h

(send transaction)

Figure 7-7: Flow Control Elements with Flow Control Buffer Filled

Stage 3 — The Credit Limit count Rolls Over

The receiver's CREDIT_LIMIT (CL) always runs ahead of (or is equal to) the CREDITS_CONSUMED (CC) count. Each time the transmitter performs a credit check, it adds the credits required (CR) for a TLP to the current CREDITS_CONSUMED count and subtracts the result from the current CREDIT_LIMIT to determine if enough credits are available to send the TLP.

Because both the CL count and the CC count only index up, they are allowed to roll over from maximum count back to 0 . A problem appears to arise when the CL count (which, again, is running ahead) has rolled over and the CC has not. Figure 7-8 shows the CL and CR counts before and after CL rollover.

Figure 7-8: Flow Control Rollover Problem

If a simple subtraction is performed in the rollover case, the result is negative. This indicates that credits are not available. However, because unsigned arithmetic is used the problem does not arise. See below:

CL 00001000 (08h)

CR 11111000 (F8h)

> 00000111 + 1 = 2^{'}

s complement

00001000 (08 h)

CR 00001000 (add 2's complement)

00010000 or 10h

Stage 4 — FC Buffer Overflow Error Check

The specification recommends implementation of the optional FC buffer overflow error checking mechanism. These optional elements include:

CREDITS_RECEIVED counter

Error Check Logic

These elements permit the receiver to track Flow Control credits in the same manner as the transmitter. That is, the transmitter CREDIT_LIMIT count should

be the same as the receiver's CREDITS_ALLOCATED count (after an Update DLLP is sent) and the receiver's CREDITS_RECEIVED count should be the same as the transmitter's CREDITS_CONSUMED count. If flow control is working correctly the following will be true:

the transmitter’s CREDITS_CONSUMED count should always be $\leq$ its CREDIT_LIMIT

the receiver’s CREDITS_RECEIVED count (CR) should always be $\leq$ its CREDITS_ALLOCATED count (CA)

An overflow condition is detected when the following formula is satisfied. Note that the field size is either 8 (headers) or 12 (data):

(C A - C R) {\mod 2}^{[FieldSize]} > 2^{[FieldSize]} / 2

If the formula is true, then the result is negative; thus, more credits have been sent to the FC buffer than were available and an overflow has occurred. Note that the 1.0a version of the specification defines the equation as

\geq

rather than

>

as shown above. This appears to be an error, because when CA=CR no overflow condition exists. For example, for the case right after initialization where the receiver advertises that it has 128 credits for the transmitter to use,

CA = 128

, and

CR = 0

because it hasn’t received anything yet,then this equation evaluates true. Which means it has overflowed, when actually all we have done is advertise our max allowed number of credits. If the equation evaluates for only

>

and not

\geq

,then everything seems to work.

Infinite Flow Control Advertisement

PCI Express defines an infinite Flow Control credit value. A device that advertises infinite Flow Control credits need not send Flow Control Update packets following initialization and the transmitter will never be blocked from sending transactions. During flow control initialization, a device advertises "infinite" credits by delivering a zero in the credit field of the FC_INIT1 DLLP.

Who Advertises Infinite Flow Control Credits?

It's interesting to note that the minimum Flow Control credits that must be advertised includes infinite credits for completion transactions in certain situations. See Table 7-1 on page 303. These requirements involve devices that originate requests for which completions are expected to be returned (i.e., Endpoints

and root ports that do not support peer-to-peer transfers). It does not include devices that merely forward completions (switches and root ports that support peer-to-peer transfers). This implies a requirement that any device initiating a request must commit buffer space for the expected completion header and data (if applicable). This guarantees that no throttling would ever occur when completions cross the final link to the original requester. This type of rule is required of PCI-X devices that initiate split transactions. Multiple searches of the specification failed to reveal this requirement explicitly stated for PCI Express devices; however, it is implied by the requirement to advertise infinite Flow Control credits.

Note also that infinite flow control credits can only be advertised during initial-iztion. This must be true, because the CA counter in the receiver could rollover to 00h and send an Update FC packet with the credit field set to 00h. If the Link is in the DL_Init state, this means infinite credits, but if the Link is in the DL_Active state, this does not mean infinite credits.

Special Use for Infinite Credit Advertisements.

The specification points out a special consideration for devices that do not need to implement all the FC buffer types for all VCs. For example, the only Non-Posted writes are I/O Writes and Configuration Writes both of which are permitted only on VC0. Thus, Non-Posted data buffers are not needed for VC1 - VC7. Because no Flow Control tracking is needed, a device can simply advertise infinite Flow Control credits during initialization, thereby eliminating the need to send needless FC_Update packets.

Header and Data Advertisements May Conflict

An infinite Flow Control advertisement might be sent for either the Data or header buffers (with same FC type) but not both. In this case, Update DLLPs are required for one buffer but not the other. This simply means that the device requiring credits will send an Update DLLP with the corresponding field containing the CREDITS_ALLOCATED credit information, and the other field must be set to zero (consistent with its advertisement).

The Minimum Flow Control Advertisement

The minimum number of credits that can be reported for the different Flow Control buffer types is listed in Table 7-1 on page 303.

Table 7-1: Required Minimum Flow Control Advertisements

Credit Type	Minimum Advertisement
Posted Request Header (PH)	1 unit. Credit Value $=$ one 4DW HDR $+$ Digest $=$ 5DW.
Posted Request Data (PD)	Largest possible setting of the Max_Payload_Size; for the component divided by FC Unit Size (4DW). Example: If the largest Max_Payload_Size value sup- ported is 1024 bytes, the smallest permitted initial credi value would be 040h.
Non-Posted Request HDR (NPH)	1 unit. Credit Value $=$ one 4 DW HDR + Digest $=$ 5DW.
Non-Posted Request Data (NPD)	1 unit. Credit Value = 4DW.
Completion HDR (CPLH)	1 unit. Credit Value $=$ one 3DW HDR $+$ Digest $= 4$ DW; for Root Complex with peer-to-peer support and Switches. Infinite units. Initial Credit Value $=$ all $0^{'} s$ for Root Com- plex with no peer-to-peer support and Endpoints
Completion Data (CPLD)	n units. Value of largest possible setting of Max_Payload_Size or size of largest Read Request (which ever is smaller) divided by FC Unit Size (4DW); for Root Complex with peer-to-peer support and Switches. Infinite units. Initial Credit Value $=$ all $0^{'} s$ ; for Root Com- plex with no peer-to-peer support and Endpoints.

PCI Express System Architecture

Flow Control Initialization

Prior to sending any transactions, flow control initialization must be performed. Initialization occurs for each link in the system and involves a handshake between the devices attached to the same link. TLPs associated with the virtual channel being initialized cannot be forwarded across the link until Flow Control Initialization is performed successfully.

Once initiated, the flow control initialization procedure is fundamentally the same for all Virtual Channels. The small differences that exist are discussed later. Initialization of VC0 (default VC) must be done in hardware so that configuration transactions can traverse the PCI Express fabric. Other VCs initialize once configuration software has set up and enabled the VCs at both ends of the link. Enabling a VC triggers hardware to perform flow control initialization for this VC.

Figure 7-9 pictures the Flow Control counters within the devices at both ends of the link, along with the state of flag bits used during initialization.

Figure 7-9: Initial State of Example FC Elements

The FC Initialization Sequence

PCI Express defines two stages in flow control initialization: FC_INIT1 and FC_INIT2. Each stage of course involves the use of the Flow Control packets (FCPs).

Flow Control Init1 - reports the size of the Flow Control buffers for a given virtual channel

Flow Control Init2 - verifies that the device transmitting the Init2 packet has completed the flow control initialization for the specified VC and buffer type.

FC Init1 Packets Advertise Flow Control Credits Available

During the FC_INIT1 state, a device continuously outputs a sequence of 3 InitFC1 Flow Control packets advertising its posted, non-posted, and completion receiver buffer sizes. (See Figure 7-10.) Each device also waits to receive a similar sequence from its neighbor. Once a device has received the complete sequence and sent its own, it initializes transmit counters, sets an internal flag FI1, and exits FC_INIT1. This process is illustrated in Figure 7-11 on page 306 and described below. The example shows Device A reporting Non-Posted Buffer Credits and Device B reporting Posted Buffer Credits. This illustrates that the devices need not be in synchronization regarding what they are reporting. In fact, the two device will typically not start the flow control initialization process at the same time.

Figure 7-10: INIT1 Flow Control Packet Format and Contents

PCI Express System Architecture

Figure 7-11: Devices Send and Initialize Flow Control Registers

Each device sends InitFC1 type Flow Control packets (FCPs) to advertise the size of its respective receive buffers. A separate FCP for posted requests (P), non-posted requests (NP) and completion (CPL) packet types is required. The order in which this sequence of three FCPs is sent is:

Header and Data buffer credit units for Posted Requests (P).

Header and Data buffer credit units for Non-Posted Requests (NP)

Header and Data buffer credit units for Completions (CPL)

The sequence of FCPs is repeated continuously until a device leaves the FC_INIT1 initialization state.

In the meantime, devices take the credit information and initialize the transmit credit limit registers. In this example, Device A loads its PH transmit Credit Limit register with a value of 4,which was reported by Device B for its posted request header FC buffer. It also loads its PD Credit Limit register with a value of $64 d$ credits (1024 bytes worth of data) for accompanying posted data. Similarly, Device B loads its NPH transmit Credit Limit counter with a value of 2 for non-posted request headers and its NPD transmit counter with a value of 32d credits (512 bytes worth of data) for accompanying non-posted data.

Note that when this process is complete, the Credits Allocated counter in the receivers and the corresponding Credit Limit counters in the transmitters will be equal.

Once a device receives Init1 FC values for a given buffer type (e.g., Posted) and has recorded them, the FC_INIT1 state is complete for that Flow Control buffer. Once all FC buffers for a given VC have completed the FC_INIT1 state, Flag 1 (Fl1) is set and the device ceases to send FCInit1 DLLPs and advances to FC Init2 state. Note that receipt of an Init2 FC packets may also cause Fl1 to be set. This can occur if the neighboring device has already advanced to the FC Init2 state.

FC Init2 Packets Confirm Successful FC Initialization

PCI Express defines the InitFC2 state that is used for feedback to verify the Flow Control initialization has been successful for a given VC. During FC_INIT2, each device continuously outputs a sequence of 3 InitFC2 Flow Control packets; however, credit values are discarded during the FC_INIT2 state. Note that devices are permitted to send TLPs upon entering the FC_INIT2 state. Figure 7-12 illustrates InitFC2 behavior, which is described following the illustration.

Figure 7-12: Device Confirm that Flow Control Initialization is Completed for a Given Buffer PCI Express System Architecture

At the start of initialization state FC_INIT2, each device commences sending InitFC2 type Flow Control packets (FCPs) to indicate it has completed the FC_INIT1 state. Devices use the same repetitive sequence when sending FCPs in this state as before:

Header and Data buffer credit allocation for Posted Requests (P)

Header and Data buffer credit allocation for Non-Posted Requests (NP)

Header and Data buffer credit allocation for Completions (CPL)

All credits reported in InitFC2 FCPs may be discarded, as the transmitter Credit Limit counters were already set up in FC_INIT1.

Once a device receives an FC_INIT2 packet for any buffer type, it sets an internal flag (Fl2). (It doesn't wait to receive an FC_Init2 for each type.) Note that Fl2 is also set upon receipt of an UpdateFC packet or TLP.

Rate of FC_INIT1 and FC_INIT2 Transmission

The specification defines the latency between sending FC_INIT DLLPs as follows:

VC0. Hardware initiated flow control of VC0 requires that FC_INIT1 and FC_INIT2 packets be transmitted "continuously at the maximum rate possible." That is, the resend timer is set to a value of zero.

VC1-VC7. When software initiates flow control initialization, the FC_INIT sequence is repeated "when no other TLPs or DLLPs are available for transmission." However, the latency between the beginning of one sequence to the next can be no greater than $17 μ s$ .

Violations of the Flow Control Initialization Protocol

A violation of the flow control initialization protocol can be optionally checked by a device. An error detected can be reported as a Data Link Layer protocol error. See "Link Flow Control-Related Errors" on page 363.

Flow Control Updates Following FC_INIT

The receiver must continually update its neighboring device to report additional Flow Control credits that have accumulated as a result of moving transactions from the Flow Control buffer. Figure 7-13 on page 309 illustrates an example where the transmitter was previously blocked from sending header transactions because the Flow Control buffer was full. In the example, the receiver has just removed three headers from the Flow Control buffer. More space is now available, but the neighboring device has no knowledge of this. As each header is removed from the Flow Control buffer, the

CREDITS_ALLOCATED count increments. The new count is delivered to the CREDIT_LIMIT register of the neighboring device via an update Flow Control packet. The updated credit limit allows transmission of additional transactions.

Figure 7-13: Flow Control Update Example

FC_Update DLLP Format and Content

Recall that update Flow Control packets, like the Flow Control initialization packets contain two update fields, one for header and one for data for the selected credit type (P, NP, and Cpl). Figure 7-14 on page 310 depicts the content of the update packet. The receiver's CREDITS_ALLOCATED counts that are reported in the HdrFC and DataFC fields may have been updated many times or not at all since the last update packet sent.

Figure 7-14: Update Flow Control Packet Format and Contents

Flow Control Update Frequency

The specification defines a variety of rules and suggested implementations that govern when and how often Flow Control Update DLLPs should be sent. The motivation includes:

Notifying the transmitting device as early as possible about new credits allocated, which allows previously blocked transactions to continue.

Establishing worst-case latency between FC Packets.

Balancing the requirements and variables associated with flow control operation. This involves:

o the need to report credits available often enough to prevent transaction blocking

o the desire to reduce the link bandwidth required to send FC_Update DLLPs

o selecting the optimum buffer size

o the maximum data payload size

Detecting violation of the maximum latency between Flow Control packets.

The update frequency limits specified assume that the link is in the active state (L0 or LOs (s=standby). All other link states represent more aggressive power management with longer recovery latencies that require link recovery prior to sending packets.

Immediate Notification of Credits Allocated

When a Flow Control buffer has filled to the extent that maximum-sized packets cannot be sent, the specification requires immediate delivery of an FC_Update DLLP when the deficit is eliminated. Specifically, when additional credits are allocated by a receiver that guarantee sufficient space now exists to accepts another maximum-sized packet, an Update packet must be sent. Two cases exist:

Maximum Packet Size $= 1$ Credit. When packet transmission is blocked due to a buffer full condition for non-infinite NPH, NPD, PH, and CPLH buffer types, an UpdateFC packet must be scheduled for Transmission when one or more credits are made available (allocated) for that buffer type.

Maximum Packet Size = Max_Payload_Size. Flow Control buffer space may decrease to the extent that a maximum-sized packet cannot be sent for non-infinite PD and CPLD credit types. In this case, when one or more additional credits are allocated, an Update FCP must be scheduled for transmission.

Maximum Latency Between Update Flow Control DLLPs

The transmission frequency of Update FCPs for each FC credit type (non-infinite) must be scheduled for transmission at least once every

30 μ s (- 0 % / + 50 %)

. If the Extended Sync bit within the Control Link register is set, Updates must be scheduled no later than every

120 μ s (- 0 % / + 50 %)

. Note that Update FCPs may be scheduled for transmission more frequently than is required.

Calculating Update Frequency Based on Payload Size and Link Width

The specification offers a formula for calculating the frequency at which update packets need to be sent for maximum data payloads sizes and link widths. The formula, shown below, defines FC Update delivery intervals in symbol times (4ns).

\frac{(MaxPayloadSize + TLPOverhead) \times UpdateFactor}{LinkWidth} + InternalDelay

where:

MaxPayloadSize = The value in the Max_Payload_Size field of the Device Control register

TLPOverhead = the constant value (28 symbols) representing the additional TLP components that consume Link bandwidth (header, LCRC, framing Symbols)

UpdateFactor = the number of maximum size TLPs sent during the interval between UpdateFC Packets received. This number balances link bandwidth efficiency and receive buffer sizes - the value varies with Max_Payload_Size and Link width

LinkWidth = The operating width of the Link negotiated during initialization

InternalDelay $=$ a constant value of 19 symbol times that represents the internal processing delays for received TLPs and transmitted DLLPs

The simple relationship defined by the formula show that for a given data payload and buffer size, the frequency of update packet delivery becomes higher as the link width increases. This relatively simple approach suggests a timer implementation that triggers scheduling of update packets. Note that this formula does not account for delays associated with the receiver or transmitter being in the L0s power management state.

The specification recognizes that the formula will be inadequate for many applications such as those that stream large blocks of data. These applications may require buffer sizes larger than the minimum specified, as well as a more sophisticated update policy in order to optimize performance and reduce power consumption. Because a given solution is dependent on the particular requirements of an application, no definition for such policies is provided.

Error Detection Timer — A Pseudo Requirement

The specification defines an optional time-out mechanism that is highly recommended. So much so, that the specification points out that it is expected to become a requirement in futures versions of the spec. This mechanism detects prolonged absences of Flow Control packets. The maximum latency between FC packets for a given Flow Control credit type is specified to be no greater than

120 μ s

. This error detection timer has a maximum limit of

200 μ s

,and it gets reset any time a Flow Control packet of any type is received. If a time-out occurs, this suggests a serious problem with a device's ability to report Flow Control credits. Consequently, a time-out triggers the Physical Layer to enter its Recovery state which retrains the link and hopefully clears the error condition. Characteristics of this timer include:

operational only when the link is in its active state (L0 or L0s)

maximum count limited to $200 μ s (- 0 % / + 50 %)$

Transaction Ordering

The Previous Chapter

The previous chapter discussed the purposes and detailed operation of the Flow Control Protocol. This protocol requires each device to implement credit-based link flow control for each virtual channel on each port. Flow control guarantees that transmitters will never send Transaction Layer Packets (TLPs) that the receiver can't accept. This prevents receive buffer over-runs and eliminates the need for inefficient disconnects, retries, and wait-states on the link. Flow Control also helps enable compliance with PCI Express ordering rules by maintaining separate Virtual Channel Flow Control buffers for three types of transactions: Posted (P), Non-Posted (NP) and Completions (Cpl).

This Chapter

This chapter discusses the ordering requirements for PCI Express devices as well as PCI and PCI-X devices that may be attached to a PCI Express fabric. The discussion describes the Producer/Consumer programming model upon which the fundamental ordering rules are based. It also describes the potential performance problems that can emerge when strong ordering is employed and specifies the rules defined for deadlock avoidance.

The Next Chapter

Native PCI Express devices that require interrupt support must use the Message Signaled Interrupt (MSI) mechanism defined originally in the PCI 2.2 specification. The next chapter details the MSI mechanism and also describes the legacy support that permits virtualization of the PCI INTx signals required by devices such as PCI Express-to-PCI Bridges.

Introduction

As with other protocols, PCI Express imposes ordering rules on transactions moving through the fabric at the same time. The reasons for the ordering rules include:

Ensuring that the completion of transactions is deterministic and in the sequence intended by the programmer.

Avoiding deadlocks conditions.

Maintaining compatibility with ordering already used on legacy buses (e.g., PCI, PCI-X, and AGP).

Maximize performance and throughput by minimizing read latencies and managing read/write ordering.

PCI Express ordering is based on the same Producer/Consumer model as PCI. The split transaction protocol and related ordering rules are fairly straight forward when restricting the discussion to transactions involving only native PCI Express devices. However, ordering becomes more complex when including support for the legacy buses mentioned in bullet three above.

Rather than presenting the ordering rules defined by the specification and attempting to explain the rationale for each rule, this chapter takes the building block approach. Each major ordering concern is introduced one at a time. The discussion begins with the most conservative (and safest) approach to ordering, progresses to a more aggressive approach (to improve performance), and culminates with the ordering rules presented in the specification. The discussion is segmented into the following sections:

The Producer/Consumer programming model upon which the fundamental ordering rules are based.

The fundamental PCI Express device ordering requirements that ensure the Producer/Consumer model functions correctly.

The Relaxed Ordering feature that permits violation of the Producer/Consumer ordering when the device issuing a request knows that the transaction is not part of a Producer/Consumer programming sequence.

Modification of the strong ordering rules to improve performance.

Avoiding deadlock conditions and support for PCI legacy implementations.

Producer/Consumer Model

Readers familiar with the Producer/Consumer programming model may choose to skip this section and proceed directly to "Native PCI Express Ordering Rules" on page 318.

The Producer/Consumer model is a common methodology that two requester-capable devices might use to communicate with each other. Consider the following example scenario:

A network adapter begins to receive a stream of compressed video data over the network and performs a series of memory write transactions to deliver the stream of compressed video data into a Data buffer in memory (in other words the network adapter is the Producer of the data).

After the Producer moves the data to memory, it performs a memory write transaction to set an indicator (or Flag) in a memory location (or a register) to indicate that the data is ready for processing.

Another requester (referred to as the Consumer) periodically performs a memory read from the Flag location to see if there's any data to be processed. In this example, this requester is a video decompressor that will decompress and display the data.

When it sees that the Flag has been set by the Producer, it performs a memory write to clear the Flag, followed by a burst memory read transaction to read the compressed data (it consumes the data; hence the name Consumer) from the Data buffer in memory.

When it is done consuming the Data, the Consumer writes the completion status into the Status location. It then resumes periodically reading the Flag location to determine when more data needs to be processed.

In the meantime, the Producer has been reading periodically from the Status location to see if data processing has been completed by the other requester (the Consumer). This location typically contains zero until the other requester completes the data processing and writes the completion status into it. When the Producer reads the Status and sees that the Consumer has completed processing the Data, the Producer then performs a memory write to clear the Status location.

The process then repeats whenever the Producer has more data to be processed.

Ordering rules are required to ensure that the Producer/Consumer model works correctly no matter where the Producer, the Consumer, the Data buffer, the Flag location, and the Status location are located in the system (in other words, no matter how they are distributed on various links in the system).

Native PCI Express Ordering Rules

PCI Express transaction ordering for native devices can be summarized with four simple rules:

PCI Express requires strong ordering of transactions (i.e., performing transactions in the order issued by software) flowing through the fabric that have the same TC assignment (see item 4 for the exception to this rule). Because all transactions that have the same TC value assigned to them are mapped to a given VC, the same rules apply to transactions within each VC.

No ordering relationship exists between transactions with different TC assignments.

The ordering rules apply in the same way to all types of transactions: memory, IO, configuration, and messages.

Under limited circumstances, transactions with the Relaxed Ordering attribute bit set can be ordered ahead of other transactions with the same TC.

These fundamental rules ensure that transactions always complete in the order intended by software. However, these rules are extremely conservative and do not necessarily result in optimum performance. For example, when transactions from many devices merge within switches, there may be no ordering relationship between transactions from these different devices. In such cases, more aggressive rules can be applied to improve performance as discussed in "Modified Ordering Rules Improve Performance" on page 322.

Producer/Consumer Model with Native Devices

Because the Producer/Consumer model depends on strong ordering, when the following conditions are met native PCI Express devices support this model without additional ordering rules:

All elements associated with the Producer/Consumer model reside within native PCI Express devices.

All transactions associated with the operation of the Producer/Consumer model transverse only PCI Express links within the same fabric.

All associated transactions have the same TC values. If different TC values are used, then the strong ordering relationship between the transactions is no longer guaranteed.

The Relaxed Ordering (RO) attribute bit of the transactions must be cleared to avoid reordering the transactions that are part of the Producer/Consumer transaction series.

When PCI legacy devices reside within a PCI Express system, the ordering rules become more involved. Consequently, additional ordering rules apply because of PCI's delayed transaction protocol. Without ordering rules, this protocol could permit Producer/Consumer transactions to complete out of order and cause the programming model to break.

Relaxed Ordering

PCI Express supports the Relaxed Ordering mechanism introduced by PCI-X; however, PCI Express introduces some changes (discussed later in this chapter). The concept of Relaxed Ordering in the PCI Express environment allows switches in the path between the Requester and Completer to reorder some transactions just received before others that were previously enqueued.

The ordering rules that exist to support the Producer/Consumer model may result in transactions being blocked, when in fact the blocked transactions are completely unrelated to any Producer/Consumer transaction sequence. Consequently, in certain circumstances, a transaction with its Relaxed Ordering (RO) attribute bit set can be re-ordered ahead of other transactions.

The Relaxed Ordering bit may be set by the device if its device driver has enabled it to do so (by setting the Enable Relaxed Ordering bit in the Device Control register-see Table 24-3 on page 906). Relaxed ordering gives switches and the Root Complex permission to move this transaction ahead of others, whereas the action is normally prohibited.

RO Effects on Memory Writes and Messages

PCI Express Switches and the Root Complex are affected by memory write and message transactions that have their RO bit set. Memory write and Message transactions are treated the same in most respects-both are handled as posted operations, both are received into the same Posted buffer, and both are subject to the same ordering requirements. When the RO bit is set, switches handle these transactions as follows:

Switches are permitted to reorder memory write transactions just posted ahead of previously posted memory write transactions or message transactions. Similarly, message transactions just posted may be ordered ahead of previously posted memory write or message transactions. Switches must also forward the RO bit unmodified. The ability to reorder these transactions within switches is not supported by PCI-X bridges. In PCI-X, all PCI Express System Architecture posted writes must be forwarded in the exact order received. Another difference between the PCI-X and PCI Express implementations is that message transactions are not defined for PCI-X.

The Root Complex is permitted to order a just-posted write transaction ahead of another write transaction that was received earlier in time. Also, when receiving write requests (with RO set), the Root Complex is required to write the data payload to the specified address location within system memory, but is permitted to write each byte to memory in any address order.

RO Effects on Memory Read Transactions

All read transactions in PCI Express are handled as split transactions. When a device issues a memory read request with the RO bit set, the request may traverse one or more switches on its journey to the Completer. The Completer returns the requested read data in a series of one or more split completion transactions, and uses the same RO setting as in the request. Switch behavior for the example stated above is as follow:

A switch that receives a memory read request with the RO bit set must forward the request in the order received, and must not reorder it ahead of memory write transactions that were previously posted. This action guarantees that all write transactions moving in the direction of the read request are pushed ahead of the read. Such actions are not necessarily part of the Producer/Consumer programming sequence, but software may depend on this flushing action taking place. Also, the RO bit must not be modified by the switch.

When the Completer receives the memory read request, it fetches the requested read data and delivers a series of one or more memory read Completion transactions with the RO bit set (because it was set in the request).

A switch receiving the memory read Completion(s) detects the RO bit set and knows that it is allowed to order the read Completion(s) ahead of previously posted memory writes moving in the direction of the Completion. If the memory write transaction were blocked (due to flow control), then the memory read Completion would also be blocked if the RO was not set. Relaxed ordering in this case improves read performance.

Table 8-1 summarizes the relaxed ordering behavior allowed by switches.

Table 8-1: Transactions That Can Be Reordered Due to Relaxed Ordering

These Transactions with RO=1 Can Pass	These Transactions
Memory Write Request	Memory Write Request
Message Request	Memory Write Request
Memory Write Request	Message Request
Message Request	Message Request
Read Completion	Memory Write Request
Read Completion	Message Request

Summary of Strong Ordering Rules

The PCI Express specification defines strong ordering rules associated with transactions that are assigned the same TC value, and further defines a Relaxed Ordering attribute that can be used when a device knows that a transaction has no ordering relationship to other transactions with the same TC value. Table 8-2 on page 322 summarizes the PCI Express ordering rules that satisfy the Producer/Consumer model and also provides for Relaxed Ordering. The table represents a draconian approach to ordering and does not consider issues of performance, preventing deadlocks, etc.

The table applies to transactions with the same TC assignment that are moving in the same direction. These rules ensure that transactions will complete in the intended program order and eliminates the possibility of deadlocks in a pure PCI Express implementation (i.e., systems with no PCI Bridges). Columns 2 - 6 represent transactions that have previously latched by a PCI Express device, while column 1 represents subsequently-latched transactions. The ordering relationship between the transaction in column 1 to other transactions previously enqueued is expressed in the table on a row-by-row basis. Note that these rules apply uniformly to all transaction types (Memory, Messages, IO, and Configuration). The table entries are defined as follows:

No - The transaction in column 1 must not be permitted to proceed ahead of the previously enqueued transaction in the corresponding columns (2-6).

Y/N (Yes/No) - The transaction in column 1 is allowed to proceed ahead of the previously enqueued transaction because its Relaxed Ordering bit is set (1), but it is not required to do so.

PCI Express System Architecture

Table 8-2: Fundamental Ordering Rules Based on Strong Ordering and RO Attribute

Row Pass Column?		Posted Request	Non-Posted Request		Completion
Row Pass Column?		Memory Write or MessageRequest (Col 2)	Read Request (Col 3)	I/O or Configuration Write Request (Col 4)	Read Completion (Col 5)	I/O or Configuration Write Completion (Col 6)
palsodisanday	Memory Write or Message Request (Row A)	a) No b) Y/N	No	No	No	No
PAISOD-UONisanbay	Read Request (Row B)	No	No	No	No	No
PAISOD-UONisanbay	I/O or Configuration WriteRequest (Row C)	No	No	No	No	No
uonejdwog	Read Completion (Row D)	a) No b) Y/N	No	No	No	No
uonejdwog	I/O or Configuration Write Completion (Row E)	No	No	No	No	No

Note that the shaded area represents the ordering requirements that ensure the Producer/Consumer model functions correctly and is consistent with the basic rules associated with strong ordering. The transaction ordering associated with columns 3 - 6 play no role in the Producer/Consumer model.

Modified Ordering Rules improve Performance

This section describes how temporary transaction blocking can occur when the strong ordering rules listed in Table 8-2 are rigorously enforced. Modification of strong ordering between transactions that do not violate the Produce/Consumer programming model can eliminate many blocking conditions and improve link efficiency.

Strong Ordering Can Result in Transaction Blocking

Maintaining the strong ordering relationship between transactions would likely result in instances where all transactions would be blocked due to a single receive buffer being full. The strong ordering requirements to support the Producer/Consumer model cannot be modified (except in the case of relaxed

ordering described previously). However, transaction sequences that do not occur within the Producer/Consumer programming model can be modified to a weakly ordered scheme that can lead to improved performance.

The Problem

Consider the following example illustrated in Figure 8-1 on page 323 when strong ordering is maintained for all transaction sequences. This example depicts transmitter and receiver buffers associated with the delivery of transactions in a single direction (from left to right) for a single Virtual Channel (VC), and the transmit and receive buffers are organized in the same way. Also, recall that each of the transaction types (Posted, Non-Posted, and Completions) have independent flow control within the same VC. The numbers within the transmit buffers show the order in which these transactions were issued to the transmitter. In addition, the non-posted receive buffer is currently full. Consider the following sequence.

Transaction 1 (a memory read-non-posted operation) is the next transaction that must be sent (based on strong ordering). The flow control mechanism detects that insufficient credits are available, so Transaction 1 cannot be sent.

Transaction 2 (a posted memory write) is the next transaction pending. When consulting Table 8-2 (based on strong ordering), entry A3 specifies that a memory write must not pass a previously posted read transaction.

Because all entries in Table 8-2 are "No", all transactions are blocked due to the non-posted receive buffer being filled. PCI Express System Architecture

Figure 8-1: Example of Strongly Ordered Transactions that Results in Temporary Blocking

The Weakly Ordered Solution

As discussed previously, strong ordering is required to support the Producer/ Consumer model. This requirement is satisfied entirely by the shaded area in Table 8-2. The non-shaded area deals with transaction sequences that do not occur in the Producer/Consumer programming model, and therefore can be modified. Table 8-3 on page 326 lists these entries as weakly ordered. The modified entries are defined as:

Y/N (Yes/No) - The transaction in column 1 is allowed to proceed ahead of the previously enqueued transaction because the entry is not related to the Producer/Consumer strong ordering requirements and can be weakly ordered to improve performance.

Consider the scenario in Figure 8-1 with weak ordering employed:

Transaction 1 (a memory read-non-posted operation) is the next transaction that must be sent. The flow control mechanism detects that insufficient credits are available, so Transaction 1 cannot be sent.

The next transaction pending (2) is a posted memory write operation. When consulting Table 8-3 on page 326,entry $A 3 (Y / N)$ allows the transmitter to reorder transaction 2 ahead of transaction 1. No blocking occurs!

The remaining transactions pending will also complete ahead of transaction 1 if the non-posted buffer remains full. When flow control credits are returned for the non-posted operations, transaction 1 will be the next transaction sent.

In summary, these examples illustrate how strong ordering can temporarily block all transactions pending delivery, and that weak ordering rules can be used to improve link efficiency without violating the Producer/Consumer model.

Order Management Accomplished with VC Buffers

As the previous example illustrated, transaction ordering is managed within the Virtual Channel buffers. These buffers are grouped into Posted, Non-Posted, and Completion transactions and flow control is managed independently for each group. This makes it much easier to implement the modified (weak) ordering described in the previous example. See Chapter 7, entitled "Flow Control," on page 285 for details.

Recall that transactions are mapped to Virtual Channels using the transaction's TC. If each TC is mapped to a separate VC, then each VC buffer will contain transactions with a single TC assignment. In this situation, VC flow control permits optimum flow of transactions.

Summary of Modified Ordering Rules

Table 8-3 on page 326 lists and highlights the modified ordering rules that allow switches to move some transactions ahead of others that may be stalled due to a receive buffer full condition. Definition of the entries are the same as the previous table:

No-The transaction in column 1 must not be permitted to proceed ahead of the previously enqueued transaction in the corresponding columns (2-6).

Y/N (Yes/No)-The transaction in column 1 is allowed to proceed ahead of the previously enqueued transaction because:

its Relaxed Ordering bit is set (1), but it is not required to do so.

the entry is not subject to the Producer/Consumer strong ordering requirements and is weakly ordered to improve performance.

Note: The "No" entry in Row D, Column 5 (D5) applies to cases where a Completer returns multiple Completions in response to a single read request. These Completions must return in order (i.e., Completions with the same transaction ID).

Upon examination of the table, some readers may question whether programs will operate correctly when weak ordering is employed. For example, note that a write transaction is permitted to be reordered ahead of a previously latched read request (entry A3). A programmer who reads from a location followed by a write to the same location must not expect these operations to complete in program order. Note that the ordering rules only guarantee proper operation of the Producer/Consumer programming model. If a programmer requires a read operation to complete ahead of a write transaction, then the write must not be issued until the read transaction completes.

PCI Express System Architecture

Table 8-3: Weak Ordering Rules Enhance Performance

Row Pass Column?		Posted Request	Non-Posted Request		Completion
Row Pass Column?		Memory Write or Message Request (Col 2)	Read Request (Col 3)	I/O or Configuration Write Request (Col 4)	Read Completion (Col 5)	I/O or Configuration Write Completion (Col 6)
persod1.5 and 2.4	Memory Write or Message Request (Row A)	a) No b) Y/N	Y/N	Y/N	Y/N	Y/N
PAISOD-UONisenbey	Read Request (Row B)	No	Y/N	Y/N	Y/N	Y/N
PAISOD-UONisenbey	I/O or Configuration Write Request (Row C)	No	Y/N	Y/N	Y/N	Y/N
uonejdwog	Read Completion (Row D)	a) No b) Y/N	Y/N	Y/N	a) Y/N b) No	Y/N
uonejdwog	I/O or Configuration Write Completion (Row E)	No	Y/N	Y/N	Y/N	Y/N

Support for PCI Buses and Deadlock Avoidance

Because the PCI bus employs delayed transactions, several deadlock scenarios can develop. These deadlock avoidance rules are included in PCI Express ordering to ensure that no deadlocks occur regardless of topology. Adhering to the ordering rules prevent problems when boundary conditions develop due to unanticipated topologies (e.g., two PCI Express to PCI bridges connected across the PCI Express fabric). Refer to the MindShare book entitled PCI System Architecture, Fourth Edition (published by Addison-Wesley) for a detailed explanation of the scenarios that are the basis for the PCI ordering rules related to deadlock avoidance. Table 8-4 on page 327 lists and highlights the deadlock avoidance ordering rules. Note that avoiding the deadlocks involves "Yes" entries in each case. If blocking occurs, the transaction in column 1 must be moved ahead of the transaction specified in the column where the "Yes" entry exists. Note also that the "Yes" entries in A5b and A6b apply only to PCI Express to PCI Bridges and PCI Express to PCI-X Bridges.

Table 8-4: Ordering Rules with Deadlock Avoidance Rules

Row Pass Column?		Posted Request	Non-Posted Request		Completion
Row Pass Column?		Memory Write or MessageRequest (Col 2)	Read Request (Col 3)	I/O or Configuration WriteRequest (Col 4)	Read Completion (Col 5)	I/O or Configuration Write Completion (Col 6)
pelsodisanday	Memory Write or MessageRequest (Row A)	a) No b) Y/N	Yes	Yes	a) Y/N b) Yes	a) Y/N b) Yes
PAISODisenbey	Read Request (Row B)	No	Y/N	Y/N	Y/N	Y/N
PAISODisenbey	I/O or Configuration WriteRequest (Row C)	No	Y/N	Y/N	Y/N	Y/N
uonejdwog	Read Completion (Row D)	a) No b) Y/N	Yes	Yes	a) Y/N b) No	Y/N
uonejdwog	I/O or Configuration Write Completion (Row E)	Y/N	Yes	Yes	Y/N	Y/N

The specification provides the following explanation of the table entries:

A2a - A Memory Write or Message Request with the Relaxed Ordering Attribute bit clear (0b) must not pass any other Memory Write or Message Request.

A2b - A Memory Write or Message Request with the Relaxed Ordering Attribute bit set (1b) is permitted to pass any other Memory Write or Message Request.

A3, A4 - A Memory Write or Message Request must be allowed to pass Read Requests and I/O or Configuration Write Requests to avoid deadlocks.

A5a, A6a - Endpoints, Switches, and Root Complexes may either allow Memory Write and Message Requests to pass Completions or to be blocked by Completions.

A5b, A6b -PCI Express to PCI Bridges and PCI Express to PCI-X Bridges (when operating in PCI mode), must allow Memory Write and Message Requests to pass Completions traveling in the PCI Express to PCI direction (Primary side of Bridge to Secondary side of Bridge) to avoid deadlock.

B2, C2 — These Requests cannot pass a Memory Write or Message Request. This preserves strong write ordering required to support the Producer/ Consumer model.

B3, B4, C3, C4 - Read Requests and I/O or Configuration Write Requests are permitted to be blocked by or to pass other Read Requests and I/O or Configuration Write Requests.

B5, B6, C5, C6 - The Requests specified are permitted to be blocked by or to pass Completions.

D2a - If the Relaxed Ordering attribute bit is not set, then a Read Completion cannot pass a previously enqueued Memory Write or Message Request.

D2b -If the Relaxed Ordering attribute bit is set, then a Read Completion is permitted to pass a previously enqueued Memory Write or Message Request.

D3, D4, E3, E4 - Completions must be allowed to pass Read and I/O or Configuration Write Requests to avoid deadlocks.

D5a - Read Completions associated with different Read Requests are allowed to be blocked by or to pass each other.

D5b - When multiple completions are returned in response to a single Read Request, the completions must return the requested read data in the proper address order. Note that the data returned in each completion is delivered in ascending address order. Switches can recognize this condition because each completion will have the same Transaction ID. Completions with different transaction IDs can be reordered without concern.

D6 Read - Completions are permitted to be blocked by or to pass I/O or Configuration Write Completions.

E2 - I/O or Configuration Write Completions are permitted to be blocked by or to pass Memory Write and Message Requests. Such transactions are actually moving in the opposite direction, and have no ordering relationship.

E5, E6 - I/O or Configuration Write Completions are permitted to be blocked by or to pass Read Completions and other I/O or Configuration Write Completions.

The specification also states the following additional rules:

For Root Complex and Switch, Memory Write combining (as defined in the PCI Specification) is prohibited. Note: This is required so that devices can be permitted to optimize their receive buffer and control logic for Memory Write sizes matching their natural expected sizes, rather than being required to support the maximum possible Memory Write payload size.

Combining of Memory Read Requests, and/or Completions for different Requests is prohibited.

The No Snoop bit does not affect the required ordering behavior.

Interrupts

The Previous Chapter

This Chapter

Native PCI Express devices that require interrupt support must use the Message Signaled Interrupt (MSI) mechanism defined originally in the PCI 2.2 version of the specification. This chapter details the MSI mechanism and also describes the legacy support that permits virtualization of the PCI INTx signals required by devices such as PCI Express-to-PCI Bridges.

The Next Chapter

To this point it has been presumed that transactions traversing the fabric have not encountered any errors that cannot be corrected by hardware. The next chapter discusses both correctable and non-correctable errors and discusses the mechanisms used to report them. The PCI Express architecture provides a rich set of error detection, reporting, and logging capabilities. PCI Express error reporting classifies errors into three classes: correctable, non-fatal, and fatal. Prior to discussing the PCI Express error reporting capabilities, including PCI-compatible mechanisms, a brief review of the PCI error handling is included as background information.

Two Methods of Interrupt Delivery

Interrupt delivery is conditionally optional for PCI Express devices. When a native PCI Express function does depend upon delivering interrupts to call its device driver, Message Signaled Interrupts (MSI) must be used. However, in the event that a device connecting to a PCI Express link cannot use MSIs (i.e., legacy devices), an alternate mechanism is defined. Both mechanisms are summarized below:

Native PCI Express Interrupt Delivery - PCI Express eliminates the need for sideband signals by using the Message Signaled Interrupt (MSI), first defined by the 2.2 version of the PCI Specification (as an optional mechanism) and later required by PCI-X devices. The term "Message Signaled Interrupt" can be misleading in the context of PCI Express because of possible confusion with PCI Express's "Message" transactions. A Message Signaled Interrupt is not a PCI Express Message, instead it is simply a Memory Write transaction. A memory write associated with an MSI can only be distinguished from other memory writes by the address locations they target, which are reserved by the system for Interrupt delivery.

Legacy PCI Interrupt Delivery - This mechanism supports devices that must use PCI-Compatible interrupt signaling (i.e., INTA#, INTB#, INTC#, and INTD#) defined for the PCI bus. Legacy functions use one of the interrupt lines to signal an interrupt. An INTx# signal is asserted to request interrupt service and deasserted when the interrupt service accesses a device-specific register, thereby indicating the interrupt is being serviced. PCI Express defines in-band messages that act as virtual INTx# wires, which target the interrupt controller located typically within the Root Complex.

Figure 9-1 illustrates the delivery of interrupts from three types of devices:

Native PCI Express device - must use MSI delivery

Legacy endpoint device - must support MSI and optionally support INTx messages. Such devices may be boot devices that must use legacy interrupts during boot, but once its driver loads MSIs are used.

PCI Express-to-PCI (X) Bridge - must support INTx messages

PCI Express System Architecture

See "Description of 3DW And 4DW Memory Request Header Fields" on page 176 for a review of the Memory Write Transaction Header. Note that MSIs always have a data payload of 1DW.

The MSI Capability Register Set

A PCI Express function indicates its support for MSI via the MSI Capability registers. Each native PCI Express function must implement a single MSI register set within its own configuration space. Note that the PCI Express specification defines two register formats:

64-bit memory addressing format (Figure 9-2 on page 332) - required by all native PCI Express devices and optionally implemented by Legacy endpoints.

32-bit memory addressing format (Figure 9-3 on page 332) - optionally supported by Legacy endpoints.

Figure 9-2: 64-bit MSI Capability Register Format

3116 158		0	Dword 0 Dword 1 Dword 2 Dword 3
Message Control Register	Pointer to Next ID	Capability ID = 05h
Least-Significant 32-bits of Message Address Register0 0
Most-Significant 32-bits of Message Address Register
	Message Data Register

Figure 9-3: 32-bit MSI Capability Register Set Format

The following sections describe each field within the MSI registers.

Capability ID

The Capability ID that identifies the MSI register set is

05 h

. This is a hardwired, read-only value.

Pointer To Next New Capability

The second byte of the register set either points to the next New Capability's register set or contains 00h if this is the end of the New Capabilities list. This is a hardwired, read-only value. If non-zero, it must be a dword-aligned value.

Message Control Register

Figure 9-4 on page 333 and Table 9-1 on page 333 illustrate the layout and usage of the Message Control register.

Figure 9-4: Message Control Register

Table 9-1: Format and Usage of Message Control Register

Bit(s)	Field Name	Description
15:8	Reserved	Read-Only. Always zero.
7	64-bit Address Capable	Read-Only. - $0 =$ Function does not implement the upper 32- bits of the Message Address register and is inca- pable of generating a 64-bit memory address. - $1 =$ Function implements the upper 32-bits of the Message Address register and is capable of gen- erating a 64-bit memory address.

Table 9-1: Format and Usage of Message Control Register (Continued)

Bit(s)	Field Name	Description
6:4	Multiple Message Enable	Read/Write. After system software reads the Mul- tiple Message Capable field (see next row in this table) to determine how many messages are requested by the device, it programs a 3-bit value into this field indicating the actual number of mes- sages allocated to the device. The number allocated can be equal to or less than the number actually requested. The state of this field after reset is $000$ b. The field is encoded as follows:

		Value Number of Messages Requested 000b1
		001b2
		010b4
		011b8
		100b16
		101b32
		110bReserved
		111bReserved
3:1	Multiple Message Capable	Read-Only. System software reads this field to determine how many messages the device would like allocated to it. The requested number of mes- sages is a power of two, therefore a device that would like three messages must request that four messages be allocated to it. The field is encoded as follows:
		ValueNumber of Messages Requested
		000b1
		001b2
		010b4
		011b8
		100b16
		101b32
		110bReserved
		111bReserved

Table 9-1: Format and Usage of Message Control Register (Continued)

Bit(s)	Field Name	Description
0	MSI Enable	Read/Write. State after reset is 0 , indicating that the device’s MSI capability is disabled. - 0 = Function is disabled from using MSI. It must use INTX Messages to deliver interrupts (legacy endpoint or bridge). 1 = Function is enabled to use MSI to request service and is forbidden to use its interrupt pin.

Message Address Register

The lower two bits of the 32-bit Message Address register are hardwired to zero and cannot be changed. In other words, the address assigned by system software is always aligned on a dword address boundary.

The upper 32-bits of the Message Address register are required for native PCI Express devices and optional for legacy endpoints. This register is present if Bit 7 of the Message Control register is set. If present, it is a read/write register and it is used in conjunction with the Message Address register to assign a 32-bit or a 64-bit memory address to the device:

If the upper 32-bits of the Message Address register are set to a non-zero value by the system software, then a 64-bit message address has been assigned to the device using both the upper and lower halves of the register.

If the upper 32-bits of the Message Address register are set to zero by the system software, then a 32-bit message address has been assigned to the device using both the upper and lower halves of the register.

Message Data Register

The system software assigns the device a base message data pattern by writing it into this 16-bit, read/write register. When the device must generate an interrupt request, it writes a 32-bit value to the memory address specified in the Message Address register. The data written has the following format:

The upper 16 bits are always set to zero.

The lower 16 bits are supplied from the Message Data register. If more than one message has been assigned to the device, the device modifies the lower bits (the number of modifiable bits depends on how many messages have been assigned to the device by the configuration software) of the data from PCI Express System Architecture

the Message Data register to form the appropriate message for the event it wishes to report to its driver. For an example, refer to the example cited in "Basics of Generating an MSI Interrupt Request" on page 338.

Basics of MSI Configuration

The following list specifies the steps taken by software to configure MSI interrupts for a PCI Express device. Refer to Figure 9-5 on page 337.

At startup time, the configuration software scans the PCI bus(es) (referred to as bus enumeration) and discovers devices (i.e., it performs configuration reads for valid Vendor IDs).When a PCI Express function is discovered, the configuration software reads the Capabilities List Pointer to obtain the location of the first Capability register within the chain of registers.

The software then searches the capability register sets until it discovers the MSI Capability register set (Capability ID of 05h).

Software assigns a dword-aligned memory address to the device's Message Address register. This is the destination address of the memory write used when delivering an interrupt request.

Software checks the Multiple Message Capable field in the device's Message Control register to determine how many event-specific messages the device would like assigned to it.

The software then allocates a number of messages equal to or less than what the device requested. At a minimum, one message will be allocated to the device.

The software writes the base message data pattern into the device's Message Data register.

Finally, the software sets the MSI Enable bit in the device's Message Control register, thereby enabling it to generate interrupts using MSI memory writes.

Basics of Generating an MSI Interrupt Request

When a PCI Express function generates an interrupt request to the processor it performs a memory write transaction. The associated data is platform specific and is always 1DW in size, and is written to a pre-defined memory address location. As described earlier, the configuration software is responsible for priming the function's MSI Address and Data registers with the appropriate memory address and the data to be written to that address when generating a request. It also primes a field in the Message Control register with the number of messages that have been allocated to the device.

Memory Write Transaction (MSI)

When the device must generate an interrupt request, it writes the Message Data register contents to the memory address specified in its Message Address register. Figure 9-6 on page 339 illustrates the contents of the Memory Write Transaction Header and Data field. Key points include:

Format field must be 11b, indicating a 4DW header with Data (native functions) and may be 10b for Legacy Endpoints.

Header Attribute bits (No Snoop and Relaxed Ordering) must be zero.

Length field must be $01 h$ to indicate maximum data payload of $1 DW$ .

First BE field must be $0011 b$ ,indicating valid data in lower 16 bits.

Last BE field must be 0000b, indicating a single DW transaction.

Address fields within the header come directly from the address fields within the MSI Capability registers.

Lower 16 bits of the Data payload come directly from the data field within the MSI Capability registers. Chapter 9: Interrupts

PCI Express System Architecture

As an example, assume the following:

Four messages have been allocated to a device.

A data value of 0500h has been assigned to the device's Message Data register.

Memory address 0A000000h has been written into the device's Message Address register.

When any one of four different device-specific events occurs, the device generates a request by performing a dword write to memory address 0A000000h with a data value of

00000500 h, 00000501 h, 00000502 h

,or

00000503 h

. In other words, the device automatically appends the value

0000 h

to the upper part of its assigned message data value (to make a 32-bit value) and modifies the lower two bits of the value to indicate the specific message type.

Memory Synchronization When Interrupt Handler Entered

The Problem

Assume that a PCI Express device performs one or more memory write transactions to deliver data (application data) into main memory, followed by an MSI (which notifies software that new application data has been moved to memory). Also assume the following:

application data transactions have a Traffic Class of Zero (TC0) and will always flow through VC0 buffers.

the MSI transaction uses TC1, and it flows through the VC1 buffers.

These transactions traverse one or more switches on their way to the Root Complex and memory.

VC arbitration is set up so that VC1 transactions have a much higher priority than VC0 transactions.

Flow Control and VC arbitration associated with the delivery of the data and the MSI may result in the MSI transaction being moved ahead of the application data transactions based on the goals of differentiated services. This is possible because there is no ordering relationship maintained between transactions that have different TC values and VC assignments. Consequently, the MSI may arrive at the Root Complex well ahead of the corresponding application data.

When the CPU is interrupted by the MSI, the currently-executing program is suspended and the processor executes the interrupt handler within the

Requester's device driver. The driver may immediately read data from the target memory buffer in main memory. If some of the application data transactions that are still making their way upstream, the driver will fetch and process old data.

Solving the Problem

The problem can be solved in two ways:

Ensure that the TC numbers of the Memory Write data and the MSI are the same. The MSI must also have its relaxed ordering bit cleared.

The driver can solve this problem by performing a dummy read (Memory Read Dword with all Byte Enables deasserted) from a location within its device before processing the data. The read must also have the same TC number as the Memory Write data. The read completion returned to the Root Complex will travel in the same VC as the Memory Write data, thereby ensuring that the write data will be pushed ahead of the read completion and into memory prior to the completion being received by the driver. Recall that the ordering rules require that all transactions with the same TC must be performed in order. The only exception is a transaction with the relaxed ordering bit set.

Interrupt Latency

The time from signaling an interrupt request until software services the device is referred to as its interrupt latency. As with the other interrupt request delivery mechanisms the MSI capability does not provide interrupt latency guarantees.

MSI Results In ECRC Error

Because MSIs are delivered as Memory Write transactions, and error associated with delivery of an MSI is treated the same as any other Memory Write error condition. See "ECRC Generation and Checking" on page 361 for treatment of ECRC errors.

Some Rules, Recommendations, etc.

It is the specification's intention that mutually-exclusive messages will be assigned to devices by the system software and that each message will be converted to an exclusive interrupt level upon delivery to the processor.

PCI Express System Architecture

More than one MSI capability register set per function is prohibited.

A read from the Message Address register produces undefined results.

Reserved registers and bits are read-only and always return zero when read.

System software can modify Message Control register bits, but the device is prohibited from doing so. In other words, it's not permitted to modify the bits via the "back door."

At a minimum, a single message will be assigned to each device.

System software must not write to the upper half of the dword that contains the Message Data register.

If the device writes the same message multiple times, only one of those messages is guaranteed to be serviced. If all of them must be serviced, the device must not generate the same message again until the driver services the earlier one.

If a device has more than one message assigned, and it writes a series of different messages, it is guaranteed that all of them will be serviced.

Legacy PCI Interrupt Delivery

This section provides background information regarding the standard PCI interrupt delivery using INTx signals. This is followed by a detailed discussion of how PCI Express supports virtual INTx signaling. Readers familiar with PCI interrupt handling may wish to proceed to "Virtual INTx Signaling" on page 347.

Background PCI Interrupt Signaling

PCI devices that use interrupts have two options:

INTx# active low-level signals that can be shared. These signals were defined in the original specification.

Message Signaled Interrupts introduced with the 2.2 version of the specification are optional for PCI devices. These MSIs are compatible with PCI Express and require no modification by devices or by PCI Express-to-PCI Bridges.

Device INTx# Pins

Each physical PCI component can implement up to 4 INTx# signals (INTA#, INTB#, INTC#, and INTD#). Because PCI devices (like PCI Express devices) can support up to 8 functions, four interrupt pins are supported. However, if a maximum of eight functions are implemented and all required interrupts then the INTx# signals would have to be shared. Also, no single function is permitted to use more than one INTx# signal.

PCI Express System Architecture

Interrupt Routing

The system designer determines the routing of INTx pins from devices. The INTx signals used by each device can be routed in a variety of ways so that ultimately the each INTx pin will go to an input of the interrupt controller. Figure 9- 8 on page 344 illustrates a variety of PCI devices using INTx pins to signal interrupts. As is typical, all PCI INTx signals are routed to one of four inputs. All INTx signals routed to a given input will be directed to a specific input to the interrupt controller, thus each of the INTx routed to the common interrupt input will also have the same interrupt Line number assigned to it by platform software. For example, IRQ15 has three PCI INTx inputs from different devices - INTB#, INTA#, and INTA#. Consequently, the functions using these INTx lines will share IRQ15 and its associated interrupt vector.

Figure 9-8: INTx Signal Routing is Platform Specific

Associating the INTx# Line to an IRQ Number

Based on the routing of the INTx pin associated with each function, the Interrupt Line number is reported by configuration software (also pictured in Figure 9-7 on page 343). The value ultimately tells the function's device driver which interrupt vector will be reported when an interrupt occurs from this function. Therefore, when this function generates an interrupt the CPU will receive the vector number that corresponds to the IRQ specified in the Interrupt Line register. The CPU uses this vector to index into the interrupt service table to fetch the entry point of the interrupt service routine associated with the function's device driver. The method used to communicate this information is operating environment specific (e.g., the Windows XP or Linux).

Note that because the INTx# lines can be wire ORed from different devices, the interrupt line number assignment will be the same for those devices whose INTx# lines are wired together. In these cases, an interrupt signaled by any of the devices sharing the same IRQ will cause the same vector to be sent to the CPU. Software must ensure that all service routines that share the same IRQ input chain the service routines together so that all devices can be checked to determine which one(s) caused the interrupt request. Once again the mechanism used for chaining the service routines is operating environment specific.

INTx# Signaling

The INTx# lines are active low signals implemented as open-drain with a pullup resistor provided on each line by the system. Multiple devices connected to the same PCI interrupt request signal line can assert it simultaneously without damage.

When a device signals an interrupt it also sets a bit within a device-specific register to indicate that an interrupt is pending. This register can be mapped into memory or I/O address space and is read by device-specific software to verify that an interrupt is pending completion. When this bit is cleared, the INTx signal is deasserted.

The device must also set the Interrupt Status bit located in the Configuration Status register. This bit can be read by system software to see if an interrupt is currently pending completion. (See Figure 9-10 on page 347.)

PCI Express System Architecture

Interrupt Disable. The 2.3 specification added an interrupt disable bit (Bit 10) to the configuration command register. See Figure 9-9 on page 346. The bit is cleared at reset permitting generation of INTx signal generation. Software may set this bit thereby inhibiting generation of INTx signaling. Note that the Interrupt Disable bit has no effect on Message Signalled Interrupts (MSI). MSIs are enabled via the MSI capability command register.

Figure 9-9: Configuration Command Register — Interrupt Disable Field

Interrupt Status. The PCI 2.3 specification added an interrupt status bit to the configuration status register (pictured in Figure 9-10 on page 347). A function must set this status bit when an interrupt is pending. In addition, if the Interrupt Disable bit in the configuration command register is cleared (i.e. interrupts enabled), then the function's INTx# signal is asserted, but only after the interrupt status bit is set. This bit is unaffected by the state of the interrupt disable bit, and it has no effect on the MSI mechanism. Note also that the bit is read only.

Virtual INTx Signaling

When circumstances make it impossible to use MSIs standard compatible INTx signaling may be used. Following are two examples of devices that cannot use MSI:

PCI Express-to-PCI(X) bridges - PCI devices will likely use the INTx signals to deliver an interrupt request (MSI is optional). Because PCI Express does not support sideband interrupt signaling, an INTx virtual wire message is used to signal the interrupt controller (located in the Root Complex). The interrupt controller in turn delivers an interrupt request to the CPU, including the vector number that identifies the entry point of the interrupt service routine.

Boot Devices - Standard PC systems typically use the legacy interrupt subsystem (8259 interrupt controller and related signals) during the boot sequence. Furthermore, the MSI subsystem cannot be used because it typically initializes after the Operating System (OS) loads and device drivers initialize. PCI Express devices involved in initializing the system and loading the OS (e.g., video, hard drive, and keyboard) are called "boot devices." Boot devices must use legacy interrupt support until the OS and device drivers for their devices install, after which they use MSI.

PCI Express System Architecture

Virtual INTx Wire Delivery

Figure 9-11 on page 348 illustrates an example PCI Express system that implements a legacy boot device and a PCI Express-to-PCI Bridge. The bridge can never issue MSIs because it does not know the source of the INTx# signals; whereas, the boot device can signal interrupt via MSI following the boot sequence. Figure 9-11 depicts the bridge using an INTB messages to signal the assertion and deassertion of INTB# from the PCI bus. The legacy device is shown signaling an INTA from its function. Note that INTx signaling involves two messages:

The Assert_INTx messages that indicates a high to low transition of the virtual INTx# signal.

The Deassert_INTx messages that indicates a low to high transition of the virtual INTx# signal.

When a Legacy device delivers an Assert_INTx message, it also sets its Interrupt Pending bit located in memory or I/O space and also sets the Interrupt Pending bit located within the Configuration Status register (Figure 9-10).

Figure 9-11: Legacy Devices use INTx Messages to Virtualize INTA#-INTD# Signal Transitions

Collapsing INTx Signals within a Bridge

Switches that have multiple downstream ports to which legacy devices attach must ensure that INTx transactions are delivered upstream in the correct fashion. The specific requirement is to ensure that the interrupt controller receives INTx messages that represent the wire-ORed behavior of legacy PCI implementations. As illustrated in Figure 9-8 on page 344, INTx lines may be shared when one or more INTx lines are tied together (wire-ORed). Consequently, when more than one devices signals an interrupt at roughly the same time only the first assertion is seen by the interrupt controller. Similarly, when one of these devices deasserts its INTx line, the line remains asserted and only the last deas-sertion will be seen by the interrupt controller.

Two or more legacy PCI Express devices sending the same INTx message on different ports of the same switch, must be treated as wire-ORed messages. This ensures that the interrupt controller observes the correct transitions. Figure 9-12 on page 350 illustrates two legacy devices issuing INTA messages to the Switch. Note that because the INTA messages overlap, the second Assert_INTA is blocked because an Assert_INTA message has been registered and no deasser-tion has yet occurred. Similarly the first Deassert_INTA message is blocked because two Assert_INTAs are outstanding and this is the first deassert message, so another will follow. This ensures that the interrupt controller will never receive two Assert_INTx messages of the same type nor two Deassert_INTX messages of the same type.

As described above, switches must track the state of each of the INTx messages at each port and transfer only those that represent a valid change in the virtual signaling.

INTx Message Format

Figure 9-13 on page 351 depicts the format of the INTx message header and defines the message types supported. INTx messages are always delivered from upstream ports of Endpoints, Bridges, and Switches. The routing employed is "Local-Terminate at Receiver," with the interrupt controller as the ultimate destination. The message code field identifies the message type and eight codes are used by the INTx messages as listed in Table 9-2. PCI Express System Architecture

Figure 9-13: INTx Message Format and Types

Table 9-2: INTx Message Codes

INTx Messages	Message Code
Assert_INTA	0010 0000
Assert_INTB	00100001
Assert_INTC	0010 0010
Assert_INTD	00100011
Deassert_INTA	0010 0100
Deassert_INTB	00100101
Deassert_INTC	0010 0110
Deassert_INTD	0010 0111

The rules associated with the delivery of INTx messages are consistent with other message types, but have some unique characteristics. The INTx message rules are summarized below:

Assert_INTx and Deassert_INTx are only issued in the upstream direction by Legacy Endpoint or Bridge devices. Note that an otherwise native PCI Express endpoint is allowed to send INTx messages prior to its device drivers being loaded, because it is a boot device.

Switches must issue INTx messages upstream when there is a change of the "collapsed" interrupt due to one of the downstream ports receiving an assert or deassert message. (See "Collapsing INTx Signals within a Bridge" on page 349.)

Devices on either side of a link must track the current state of INTA-INTD assertion.

A Switch tracks the state of the four virtual wires for each of its downstream ports, and presents a collapsed set of virtual wires on its upstream port.

The Root Complex must track the state of the four virtual wires (A-D) for each downstream port.

INTx signaling may be disabled with the Interrupt Disable bit in the Command Register.

If any INTx virtual wires are active and device interrupts are then disabled, a corresponding Deassert_INTx message must be sent.

If a switch downstream port goes to DL_Down status, any active INTx virtual wires must be deasserted, and the upstream port updated accordingly (Deassert_INTx message required if that INTx was in active state).

Devices May Support Both MSI and Legacy Interrupts

When a PCI Express device supports both INTx messages and MSI, only one of the mechanism will be enabled at any given time. The most likely type of device that would support both capabilities is a boot device. A system in which a boot device resides may not support MSI during the boot sequence. Consequently configuration software will initialize interrupts by loading a Line register value and enabling the device for legacy operation just as is done for PCI devices. Once the OS loads, the device's MSI register set is loaded and the MSI enable bit is set. Setting this bit disables the devices ability to use INTx messages and enables the delivery of MSIs.

Note also that setting the Interrupt Disable bit in the Configuration Command register also inhibits the generation of INTx messages.

Special Consideration for Base System Peripherals

Interrupts may also originate in embedded legacy hardware, such as an I/O Controller Hub or Super I/O device. Some of the typical legacy devices required in such systems include:

Serial ports

Parallel ports

Keyboard and Mouse Controller

System Timer

IDE controllers

These devices typically require a very specific IRQ line, which allows legacy software to interact with them correctly.

Using the INTx messages does not guarantee that the devices will receive the IRQ assignment that they require. Many different approaches and strategies may be employed to ensure they get the IRQs required. Following is an example system that supports legacy interrupt assignment.

Example System

Figure 9-14 on page 354 illustrates a PCI Express system that includes an existing I/O Controller Hub (ICH) that attaches to the Root Complex via a proprietary link. The interrupt controller that is embedded within the ICH is an IOAPIC that can generate an MSI when it receives an interrupt request at its inputs. In such an implementation, software can assign the legacy vector number to each input, to ensure the correct legacy software is called.

The advantage to this approach is that existing hardware can be used to support the legacy requirements within a PCI Express platform. This system also requires that the MSI subsystem be configured for use during the boot sequence. The example illustrated eliminates the need for INTx messages, unless a PCI Express expansion device incorporates a PCI Express-to-PCI Bridge.

10 Error Detection and Handling

The Previous Chapter

Native PCI Express devices that require interrupt support must use the Message Signaled Interrupt (MSI) mechanism defined originally in the PCI 2.2 version of the specification. The previous chapter detailed the MSI mechanism and also described the legacy support that permits virtualization of the PCI INTx signals required by devices such as PCI Express-to-PCI Bridges.

This Chapter

To this point it has been presumed that transactions traversing the fabric have not encountered any errors that cannot be corrected by hardware. This chapter discusses both correctable and non-correctable errors and discusses the mechanisms used to report them. The PCI Express architecture provides a rich set of error detection, reporting, and logging capabilities. PCI Express error reporting classifies errors into three classes: correctable, non-fatal, and fatal. PCI Express error reporting capabilities include PCI-compatible mechanisms, thus a brief review of the PCI error handling is included as background information.

The Next Chapter

The next chapter describes the Logical Physical Layer core logic. It describes how an outbound packet is processed before clocking the packet out differentially. The chapter also describes how an inbound packet arriving from the Link is processed and sent to the Data Link Layer. Sub-block functions of the Physical Layer such as Byte Striping and Un-Striping logic, Scrambler and De-Scrambler,

8 b / 10 b

Encoder and Decoder,Elastic Buffers are discussed,and more.

Background

The original PCI bus implementation provides for basic parity checks on each transaction as it passes between two devices residing on the same bus. When a transaction crosses a bridge, the bridge is involved in the parity checks at both the originating and destination busses. Any error detected is registered by the device that has detected the error and optionally reported. The PCI architecture provides a method for reporting the following types of errors:

data parity errors - reported via the PERR# (Parity Error) signal

data parity errors during multicast transactions (special cycles) - reported via the SERR# (System Error) signal

address and command parity errors - reported via the SERR# signal

other types of errors (e.g. device specific) - reported via SERR#

Errors reported via PERR# are considered potentially recoverable, whereas, errors reported via SERR# are considered unrecoverable. How the errors reported via PERR# are handled is left up to the implementer. Error handling may involve only hardware, device-specific software, or system software. Errors signaled via SERR# are reported to the system and handled by system software. (See MindShare's PCI System Architecture book for details.)

PCI-X uses the same error reporting signals as PCI, but defines specific error handling requirements depending on whether device-specific error handling software is present. If a device-specific error handler is not present, then all parity errors are reported via SERR#.

PCI-X 2.0 adds limited support for Error Correction Codes (ECC) designed to automatically detect and correct single-bit errors within the address or data. (See MindShare's PCI-X System Architecture book for details.)

Introduction to PCI Express Error Management

PCI Express defines a variety of mechanisms used for checking errors, reporting those errors and identifying the appropriate hardware and software elements for handling these errors.

PCI Express Error Checking Mechanisms

PCI Express error checking focuses on errors associated with the PCI Express interface and the delivery of transactions between the requester and completer functions. Figure 10-1 on page 357 illustrates the scope of the error checking that

PCI Express System Architecture

Transaction Layer Errors

The transaction layer checks are performed only by the Requestor and Completer. Packets traversing switches do not perform any transaction layer checks. Checks performed at the transaction layer include:

ECRC check failure (optional check based on end-to-end CRC)

Malformed TLP (error in packet format)

Completion Time-outs during split transactions

Flow Control Protocol errors (optional)

Unsupported Requests

Data Corruption (reported as a poisoned packet)

Completer Abort (optional)

Unexpected Completion (completion does not match any Request pending completion)

Receiver Overflow (optional check)

Data Link Layer Errors

Link layer error checks occur within a device involved in delivering the transaction between the requester and completer functions. This includes the requesting device, intermediate switches, and the completing device. Checks performed at the link layer include:

LCRC check failure for TLPs

Sequence Number check for TLP s

LCRC check failure for DLLPs

Replay Time-out

Replay Number Rollover

Data Link Layer Protocol errors

Physical Layer Errors

Physical layer error checks are also performed by all devices involved in delivering the transaction, including the requesting device, intermediate switches, and the completing device. Checks performed at the physical layer include:

Receiver errors (optional)

Training errors (optional)

Error Reporting Mechanisms

PCI Express provides three mechanisms for establishing the error reporting policy. These mechanisms are controlled and reported through configuration registers mapped into three distinct regions of configuration space. (See Figure 10-2 on page 360.) The various error reporting features are enabled as follows:

PCI-compatible Registers (required) - this error reporting mechanism provides backward compatibility with existing PCI compatible software and is enabled via the PCI configuration Command Register. This approach requires that PCI Express errors be mapped to PCI compatible error registers.

PCI Express Capability Registers (required) - this mechanism is available only to software that has knowledge of PCI Express. This required error reporting is enabled via the PCI Express Device Control Register mapped within PCI-compatible configuration space.

PCI Express Advanced Error Reporting Registers (optional) - this mechanism involves registers mapped into the extended configuration address space. PCI Express compatible software enables error reporting for individual errors via the Error Mask Register.

The specification refers to baseline (required) error reporting capabilities and advanced (optional) error reporting capabilities. The baseline error reporting mechanisms require access to the PCI-compatible registers and PCI Express Capability registers (bullets 1 and 2 above), while advanced error reporting (bullet 3) requires access to the Advanced Error Reporting registers that are mapped into extended configuration address space as illustrated in Figure 10-2. This chapter details all error reporting mechanisms.

By defining errors into these classes, error handling software can be partitioned into separate handlers to perform the actions required for a given platform. The actions taken based on severity of an error might range from monitoring the effects of correctable errors on system performance to simply resetting the system or PCI Express sub-system in the event of a fatal error.

Note that regardless of the severity of a given error, software can establish a policy whereby any error can be reported to the system (via the Root Complex) for the purpose of tracking and logging. (See page 390 for details.)

Sources of PCI Express Errors

This section provides a more detailed description of the error checks made by PCI Express interfaces when handling transactions.

ECRC Generation and Checking

ECRC generation and checking is optional and only supported by devices and systems that implement Advanced Error Reporting. Devices that support ECRC must implement the Advanced Error Capabilities and Control Register. Configuration software checks this register to determine if ECRC is supported, and to enable ECRC support.

A PCI Express device that originates a transaction (Request or Completion) can create and append a 32-bit CRC (within the digest field) that covers the header and data portions of the transaction. This CRC is termed end-to-end (ECRC) and is typically checked and reported by the ultimate recipient of the transaction. Switches in the path between the originating and receiving devices may optionally check and report ECRC errors, or merely forward the packet without checking the ECRC. If a Switch detects an ECRC error it must still forward the packet unaltered as it would any other packet. Switches may also be the originator or recipient of a transaction in which case they can participate in ECRC generation and checking in this role.

The actions taken when an ECRC error is detected is considered beyond the scope of the specification. However, the possible actions taken will likely depend on whether the ECRC error occurs in a Request or Completion:

ECRC in Request - completers that detect an ECRC error may simply drop the transaction without forwarding it to the receiving function and as a result not return a completion. This ultimately will result in a completion time-out within the requesting device. The requester could then reschedule the transaction under software control.

PCI Express System Architecture

ECRC in Completion - requesters that detect an ECRC error may drop the packet and report the error to the function's device driver via a function-specific interrupt. The driver would check the status bits in the Uncorrectable Error Status Register to detect the nature of the error and potentially reschedule the transaction in the event that a prefetchable address location was being accessed.

Note that ECRC errors may also result in error messages being sent to the host for handling or logging.

Data Poisoning (Optional)

Data poisoning provides a way of indicating that data associated with a transaction is corrupted. When a TLP is received that contains a data field, the recipient will know that data is corrupted if the poisoned bit is set. Figure 10-3 illustrates the Error/Poisoned bit (EP) located within the first Dword of a packet.

Figure 10-3: The Error/Poisoned Bit within Packet Headers

The specification includes three examples of data corruption that could result in data poisoning being used. In each of these cases, the error can be forwarded to the recipient via the data poisoning bit in the transaction header.

When a Requester wishes to perform a Memory Write transaction it must first fetch the data it wishes to send to the completer from local memory. In the event of a parity error when reading the data, the data can be marked as poisoned.

When a Completer must return data in response to a Memory Read request, the data it fetches from memory may incur a parity error.

Data that is stored in a cache or buffer that has error checking may also result in data corruption being detected. The specification does not indicate where these caches or buffers may be located, but it is conceivable that any device that originates a transaction or forwards it may indicate that the data has been poisoned.

The specification states that data poisoning applies to data associated with both posted and non-posted writes and read completions. Therefore data poisoning can be used in conjunction with Memory, I/O, and Configuration transactions that have a data payload.

Data poisoning can only be done at the transaction layer of a device. The link layer does not process or in any way affect the contents of the TLP header. The transaction layer indicates that data is corrupted by setting the Data Error/Poisoned bit

(E / P)

in the Write request or Read Completion transaction header.

Data poisoning errors are enabled and reported via the Uncorrectable Error Registers.

TC to VC Mapping Errors

All PCI Express ports that implement the extended VC capability (endpoints, switches, and root ports) must check to verify that the TC of each inbound packet is mapped to an active VC. Packets with TC's that fail this check are treated as malformed TLPs. Similarly, when switches forward packets, they must also verify that the target outbound port of the switch also supports the packet's TC. If not, the packet is treated as a malformed TLP.

A requester or completer may implement more than one Virtual Channel (VC) and enable more than one Traffic Class (TC). If the device core issues a request to send a transaction with a TC number that is not enabled (pointing to an active VC) within the TC to VC mapping tables, the transaction is rejected and a malformed TLP is indicated. If the device has implemented the PCI Express Virtual Channel Capability structure (supports multiple VCs and/or TC filtering), then the malformed TLP error is detected at the transmitting device. However, if the device only supports the default TC0/VC0 configuration, then this error would be detected at the first receiving device along the packet's path that implements the extended VC capability and applies TC filtering.

Note also that the TC to VC mapping is a transaction layer function. (See "Assigning TCs to each VC - TC/VC Mapping" on page 262 for details).

Link Flow Control-Related Errors

Prior to forwarding the packet to the link layer for transmission across the link, the transaction layer must check flow control credits to ensure that the receive buffers in the adjacent node (switch or completer) have sufficient room to hold PCI Express System Architecture

the transaction. Flow control protocol errors may occur that will likely prevent transactions from being sent. These errors can be reported to the Root Complex and are considered uncorrectable. The uncorrectable error registers are used to enable and check status for Flow Control (FC) errors. The specification does not require that these errors be checked and reported.

All five conditions that cause flow control related errors (flow control protocol errors and receiver overflow errors) are detected by and associated with the port receiving the flow control information. These error conditions are:

The specification defines the minimum credit size that can be initially reported for each Flow Control type. During FC initialization for any Virtual Channel, if a receiver fails to advertise VC credit value equal to or greater than those permitted, it is considered an FC protocol error.

The maximum number of data payload credits that can be reported is restricted to 2048 unused credits and 128 unused credits for headers. Exceeding these limits is considered an FC protocol error.

During FC initialization receivers are allowed to report infinite FC credits. FC updates are required following initialization. FC updates are allowed providing that the credit value field is set to zero, which is ignored by the recipient. If the data field contains any value other than zero, it is considered an FC protocol error.

During FC initialization either the Data or header FC advertisement (but not both) for a given FC type may be infinite. FC update packets are required to report credits for the buffer that advertised limited FC credits. However the update credit value for the buffer advertised as infinite must be set to zero and ignored by the receiver. A non-zero credit value could cause an FC protocol error.

A specific check can be made at the receiving port to determine if a flow control receive buffer has overflowed, resulting in lost data. This check is optional and considered an FC protocol error.

Flow Control Protocol errors are reported as uncorrectable errors, if supported and enabled.

Malformed Transaction Layer Packet (TLP)

When the ultimate recipient of a transaction receives a request or completion packet into the transaction layer, the packet format is checked for violations of the TLP formatting rules. The specification defines the following items that cause a malformed packet:

Data payload exceeds Max payload size.

The actual data length does not match data length specified in the header.

Start Memory DW address and length field results in the transaction crossing a $4 KB$ boundary.

TD field $= 1$ (indicating Digest included) but no digest field is present.

Byte Enable violation detected.

Packets which use an undefined Type field values.

Multiple completion packets are used to send read data back to the requester and the data size returned in any of the completion packets violates the Read Completion Boundary (RCB) value.

Completions with a Configuration Request Retry Status in response to a Request other than a Configuration Request.

Traffic Class field (TC) contains a value not assigned to an enabled Virtual Channel (VC) within the TC - VC mapping for the receiving device.

Transaction type requiring use of TC0 has TC value other than zero:

o I/O Read or Write requests and completions

) Configuration Read or Write requests and completions

o Error messages

o INTx messages

o Power Management messages

o Unlock messages

o Slot Power messages

Routing is incorrect for transaction type (e.g., transactions requiring routing to Root Complex detected moving away from Root Complex).

Split Transaction Errors

A variety of failures can occur during a split transaction. When a transaction request is sent to a destination device a completion transaction is expected in response (except for Memory Write and Message transactions which are posted transactions). The various failure modes that can occur are discussed in the following sections.

Unsupported Request

When a recipient of a transaction request detects that it does not support this transaction, it returns a completion transaction with unsupported request (UR) specified in the completion status field. The specification defines a number of specific conditions that cause UR to be returned in the completion status field:

Request type not supported.

Message request received with unsupported or undefined message code.

Request does not reference address space mapped within device.

Request contains an address that cannot be routed to any egress port of a bridge or a switch (i.e., address is not mapped to any port's base and limit registers).

Poisoned Write Request addresses an I/O or Memory mapped control space in the Completer. Such transactions must be discarded by the Completer and reported as a UR.

PCI Express endpoint device receives a MRdLk (lock) transaction. (Recall that lock is not supported by endpoint devices.)

The downstream port of the Root Complex or Switch receives a configuration request with a Device number of 1-31. The port must terminate the transaction and not initiate a configuration transaction on the downstream link. Instead a completion transaction is returned with UR status.

A configuration access that targets an un-implemented Bus, Dev, or Function results in termination of the transaction, and a completion transaction is returned with UR status.

Type 1 configuration request received at endpoint.

A completion is received at the requester with a reserved completion code. This must be interpreted as a UR.

A function is in the D1 or D2 power management states and a request other than a configuration request is received.

A configuration access that targets a bus that is not in the range of buses downstream of that port according to the secondary and subordinate register values.

A configuration request passing through a PCI Express - PCI Bridge for which the Extended Register Address field is non-zero that is directed toward a PCI bus that does not support Extended Configuration Space.

A transaction headed for a port with the Reject Snoop Transactions field set in the VC Resource Capability register that does not have the No Snoop bit set in the TLP header.

A transaction is targeting a device on a PCI bus but is Master Aborted after the request was accepted by the bridge.

Completer Abort

These type of error checks are optional. Three circumstances can occur that could result in a Completer returning an abort to the Requester:

A Completer receives a request that it cannot complete because the request violates the programming rules for the device. As an example, some devices may be designed to permit access to a single location within a specific DWord, while any attempt to access the other locations within the same DWord would fail. Such restrictions are not violations of the specification,

but rather legal restrictions associated with an implementation-specific programming interface for this function. Accesses to such devices are based upon the expectation that only the device-driver for this device understands how to access its function.

A completer receives a request that it cannot process because of some permanent error condition associated with the device. For example, a PCI Express wireless LAN card that is not accepting PCI Express transactions because it will not transmit or receive data over its radio unless there is an approved antenna attached.

A PCI Express to PCI Bridge may receive a request that targets a PCI or PCI- $X$ bus. These buses support a signaling convention that allows the target device to indicate that it has aborted the request (target abort) because it cannot complete the request due to some permanent condition or violation of the function's programming rules. The bridge in turn would return a PCI Express completion transaction indicating CA status.

A Completer that aborts a request may report the error to the Root Complex as a Non-Fatal Error message. Further, if the aborted request requires a completion, the completion status would be reported as CA.

Unexpected Completion

A completion transaction that arrives at a Requester uses the transaction descriptor (Requester ID and Tag) to match the request to which this completion applies. In rare circumstances the transaction descriptor may not match a request that is pending completion. The typical reason for this unexpected completion is that the completion was mis-routed on its journey back to the intended requester. Consequently, two requesters will be surprised:

the requester that has received the unexpected completion, and

the requester that fails to receive the completion (thereby causing a completion time-out)

A Non-Fatal Error message can be sent by the device that receives the unexpected completion.

Completion Time-out

The previous discussion points out that a completion transaction can be routed to the wrong device. Consequently, the pending request will never receive its completion. The specification defines the completion time-out mechanism to identify this situation and report the error to requester software for possible recovery. The specification clearly defines that the intent of the completion time-

PCI Express System Architecture

out is to detect when a completion has no reasonable chance of returning and not related to expected latencies associated with split transactions.

The completion time-out mechanism must be implemented by any device that initiates requests that require completions to be returned. An exception is allowed for devices that only initiate configuration transactions. The specification defines the permissible range of the time-out value as follows:

It is strongly recommended that a device not time-out earlier than $10 ms$ after sending a request; however, if the device requires greater granularity a time-out can occur as early as $50 μ s$ .

Devices must time-out no later than $50 ms$ .

Note that for Memory Read requests a request may require two or more completions to return all of the requested data. All of the data must be returned prior to the time-out. If some but not all of the data has been returned when the time-out occurs, the requester may either discarded or keep the data.

Error Classifications

Errors are categorized into three classes that specify the severity of an error as listed below. Note also the specification defines the entity that should handle the error based on its severity:

Correctable errors - handled by hardware

Uncorrectable errors-nonfatal - handled by device-specific software

Uncorrectable errors-fatal - handled by system software

By defining errors into these classes, error handling software can be partitioned into separate handlers to perform the actions required of a given platform. The actions taken based on severity of an error might range from monitoring the effects of correctable errors on system performance to simply resetting the system in the event of a fatal error.

Note that regardless of the severity of a given error, software can establish a policy whereby the system is notified of all errors for the purpose of tracking and logging each category.

Correctable Errors

The specification defines correctable errors as those errors that are corrected solely by hardware. Such errors may have an impact on performance (i.e., latency and bandwidth), but no information is lost as a result of the error. These types of error can be reported to software, which can take a variety of actions, Including:

log the error

update calculations of PCI Express performance

track errors to project possible weaknesses within the fabric. This can suggest areas where greater potential exists for fatal errors in the future.

Uncorrectable Non-Fatal Errors

Non-fatal errors mean that the integrity of the PCI Express fabric is not affected, but that information has been lost. Non-fatal errors typically mean that a transaction has been corrupted, and the PCI Express hardware cannot correct the error. However, the PCI Express fabric continues to function correctly and other transactions are unaffected. Recovery from a non-fatal error may or may not be possible. The possibility of recovery rests in the hands of the device-specific software associated with the requester that initiated the transaction.

Uncorrectable Fatal Errors

Fatal errors indicate that a link in the PCI Express fabric is no longer reliable. Data has been lost and every attempt to recover data under software control will likely fail also. Such conditions affect all transactions that traverse a given link. In some cases, the error condition leading to the fatal errors may be resolved by resetting the link. Alternatively, the specification invites implementation-specific approaches, in which software may attempt to limit the effects of the failure. The specification does not define any particular actions that should or could be taken by software.

PCI Express System Architecture

How Errors are Reported

PCI Express includes two methods of reporting errors:

error message transactions - used to report errors to the host

completion status - used by the completer to report errors to the requester

Each reporting mechanism is described below.

Error Messages

As discussed previously, PCI reports errors via the PERR# and SERR# signals. Because PCI Express eliminates theses error-related signals, error messages have been defined to replace these signals by acting as virtual wires. Furthermore, these messages provide additional information that could not be conveyed directly via the PERR# and SERR# signals. This includes identification of the device that detected the error and an indication of the severity of each error.

Figure 10-4 illustrates the format of the error messages. Note that these packets are routed to the Root Complex for handling by system software. The message code defines the type of message being signaled. As one might guess, the specification defines three basic types of error messages shown in Table 10-1.

Table 10-1: Error Message Codes and Description

Message Code	Name	Description
30h	ERR_COR	used when a PCI Express device detects a cor- rectable error
31h	ERR_NONFATAL	used when a device detects a non-fatal, uncor- rectable error
33h	ERR_FATAL	used when a device detects a fatal, uncorrect- able error

Completion Status

PCI Express defines a completion status field within the completion header that enables the transaction completer to report errors back to the requester. Figure 10-5 illustrates the location of the completion field and Table 10-2 defines the completion values. Note that the shaded status entries represent error conditions that can be reported via error messages.

Figure 10-5: Completion Status Field within the Completion Header

Table 10-2: Completion Code and Description

Status Code	Completion Status Definition
000b	Successful Completion (SC)
001b	Unsupported Request (UR)
010b	Configuration Request Retry Status (CRS)
011b	Completer Abort (CA)
100b - 111b	Reserved

Baseline Error Detection and Handling

This section defines the required support for detecting and reporting PCI Express errors. Each PCI Express compliant devices must include:

PCI-Compatible support - required to support operating environments that have no knowledge of PCI Express.

PCI Express Error reporting - available to operating environments that do have knowledge of PCI Express.

PCI-Compatible Error Reporting Mechanisms

Each PCI Express must map required PCI Express error support to the PCI-related error registers. This involves enabling error reporting and setting status bits that can be read by PCI-compliant software. To understand the features available from the PCI-compatible point of view consider the error-related bits of the Command and Status registers located within the Configuration header. While the command and status register bits have the PCI name, some of the field definitions have been modified to reflect the related PCI Express error conditions and reporting mechanisms. The PCI Express errors tracked by the PCI-compatible registers are:

Transaction Poisoning/Error Forwarding (optional)

Completer Abort (CA) detected by a completer

Unrecognizable Request (UR) detected by a completer

Chapter 10: Error Detection and Handling

The PCI mechanism for reporting errors is the assertion of PERR# (data parity errors) and SERR# (unrecoverable errors). The PCI Express mechanisms for reporting these events are via the split transaction mechanism (transaction completions) and virtual SERR# signaling via error messages.

Configuration Command and Status Registers

Figure 10-6 illustrates the command register and the location of the error-related fields. These bits are set to enable baseline error reporting under control of PCI-compatible software. Table 10-3 defines the specific effects of each bit.

Figure 10-6: PCI-Compatible Configuration Command Register

Table 10-3: Error-Related Command Register Bits

Name	Description
SERR# Enable	Setting this bit (1) enables the generation of the appropriate PCI Express error messages to the Root Complex. Error messages are sent by the device that has detected either a fatal or non-fatal erro

Table 10-3: Error-Related Command Register Bits (Continued)

Name	Description
Parity Error Response	This bit enables poisoned TLP reporting. This error is typically reported as an Unsupported Request (UR) and may also result in $i$ non-fatal error message if SERR# enable=1b. Note that reporting in some cases is device-specific.

Figure 10-7 illustrates the configuration status register and the location of the error-related bit fields. Table 10-4 on page 374 defines the circumstances under which each bit is set and the actions taken by the device when error reporting is enabled.

Figure 10-7: PCI-Compatible Status Register (Error-Related Bits)

Table 10-4: Description of PCI-Compatible Status Register Bits for Reporting Errors

Error-Related Bit	Description
Detected Parity Error	Set by the interface that receives a Write Request or Read Completion transaction with the poisoned bit se This action pertains to the requestors, completers, and switches. (This bit is updated regardless of the state of the Parity Error enable bit.)

Table 10-4: Description of PCI-Compatible Status Register Bits for Reporting Errors

Error-Related Bit	Description
Signalled System Error	This bit is set by a device that has detected an uncorrect- able error and reported it via an error message (requires SERR# enable bit to be set in the Command register to send error message).
Received Master Abort	Set by a requester that receives a completion transaction with Unsupported Request (UR) in the completion sta- tus field.
Received Target Abort	Set by a requester that receives a completion transaction with Completer Abort (CA) in the completion status field.
Signalled Target Abort	Set by a completer when aborting a request that violates the device’s programming rules.
Master Data Parity Error	Set by a transmitter that initiates or forwards a transac- tion with the poisoned bit set and to a receiver device that received a completion with the poisoned bit set. This bit is set by a device that either: - receives a completion packet that has been poisoned - transmits a write packet that has been poisoned

PCI Express Baseline Error Handling

The Baseline capability also requires use of the PCI Express capability registers. These registers include error detection and handling bit fields that provide finer granularity regarding the nature of an error that is supplied with standard PCI error handling.

Figure 10-8 on page 376 illustrates the PCI Express capability register set. These registers provide support for:

Enabling/disabling error reporting (Error Message Generation)

Providing error status

Providing status for link training errors

Initiating link re-training

PCI Express System Architecture

Figure 10-8: PCI Express Capability Register Set

Enabling/Disabling Error Reporting

The Device Control and Device Status registers permit software to enable generation of Error Messages for four error-related events and to check status information to determine which type of error has been detected:

Correctable Errors

Non-Fatal Errors

Fatal Errors

Unsupported Request Errors

Note that the only specific type of error condition identified is the unsupported request (UR). No granularity is provided for determining other types of error conditions that occur. Only the classification of the error is reported within the device status register. Table 10-5 on page 377 lists each error type and its associated error classification.

Table 10-5: Default Classification of Errors

Classification	Name of Error	Layer Detected
Correctable	Receiver Error	Physical
Correctable	Bad TLP	Link
Correctable	Bad DLLP	Link
Correctable	Replay Time-out	Link
Correctable	Replay Number Rollover	Link
Uncorrectable - Non Fatal	Poisoned TLP Received	Transaction
Uncorrectable - Non Fatal	ECRC Check Failed	Transaction
Uncorrectable - Non Fatal	Unsupported Request	Transaction
Uncorrectable - Non Fatal	Completion Time-out	Transaction
Uncorrectable - Non Fatal	Completion Abort	Transaction
Uncorrectable - Non Fatal	Unexpected Completion	Transaction
Uncorrectable - Fatal	Training Error	Physical
Uncorrectable - Fatal	DLL Protocol Error	Link
Uncorrectable - Fatal	Receiver Overflow	Transaction
Uncorrectable - Fatal	Flow Control Protocol Error	Transaction
Uncorrectable - Fatal	Malformed TLP	Transaction

Enabling Error Reporting - Device Control Register. Setting the corresponding bit in the Device Control Register enables the generation of the corresponding Error Message which reports errors associated with each classification. (Refer to Figure 10-9 on page 378.) Unsupported Request errors are specified as Non-Fatal errors and are reported via a Non-Fatal Error Message, but only when the UR Reporting Enable bit is set.

Figure 10-9: Device Control Register Bit Fields Related to Error Handling

Error Status — Device Status Register. See Figure 10-10 on page 379. An error status bit is set any time an error associated with its classification is detected. These bits are set irrespective of the setting of the Error Reporting Enable bits within the Device Control Register. Because Unsupported Request errors are by default considered Non-Fatal Errors, when these errors occur both the Non-Fatal Error status bit and the Unsupported Request status bit will be set. Note that these bits are cleared by software when writing a one (1) to the bit field.

Chapter 10: Error Detection and Handling

Figure 10-10: Device Status Register Bit Fields Related to Error Handling

Link Errors

The physical link connecting two devices may fail causing a variety of errors. Link failures are typically detected within the physical layer and communicated to the Data Link Layer. Because the link has incurred errors, the error cannot be reported to the host via the failed link. Therefore, link errors must be reported via the upstream port of switches or by the Root Port itself. Also the related fields in the PCI Express Link Control and Status registers are only valid in Switch and Root downstream ports (never within endpoint devices or switch upsteam ports). This permits system software to access link-related error registers on the port that is closest to the host.

If software can isolate one or more errors to a given link, one method of attempting to clear a non-correctable error is to retrain the link. The Link Control Register includes a bit that when set forces the Root or Switch port to retrain the link. If transaction (upon completion of the retraining) can once again traverse the link without errors, the problem will have been solved. Figure 10-11 illustrates the Link Control Register and highlights the Retrain Link field that software sets to initiate retaining.

Chapter 10: Error Detection and Handling

Root's Response to Error Message

When a message is received by the Root Complex the action that it takes when reporting the error message to the host system is determined in part by the Root Control Register settings. Figure 10-13 depicts this register and highlights the three bit fields that specify whether an error should be reported as a fatal System Error (SERR set enables generation of a system error). In x86 systems it is likely that a Non-Maskable Interrupt (NMI) will be signaled if the error is to be reported as a SERR.

The PME Interrupt Enable bit (3) allows software to enable and disable interrupt generation upon the Root Complex detecting a PME Message transaction.

Other options for reporting error messages are not configurable via standard registers. The most likely scenario is that a system interrupt will be signaled to the processor that will call an Error Handler, which may attempt to clear the error condition and/or simply log the error.

Figure 10-13: Root Control Register

Advanced Error Reporting Mechanisms

Advanced Error Reporting requires implementation of the Advanced Error Reporting registers illustrated in Figure 10-14 on page 382. (Note that the lighter fields at the bottom of the Capability register diagram are used only for root ports, discussed later.) These registers provide several additional error reporting features:

finer granularity in defining the actual type of error that has occurred within each classification.

ability to specify the severity of each uncorrectable error type to determine whether it should be reported as a fatal or non-fatal error.

support for logging errors.

enable/disable Root Complex to report errors to the system.

identify source of the error.

ability to mask reporting individual types of errors.

Figure 10-14: Advanced Error Capability Registers

ECRC Generation and Checking

End-to-End CRC (ECRC) generation and checking can be enabled only if the Advanced Error Reporting Capability registers are implemented. Specifically, the Advanced Error Capability and Control register provides control over ECRC generation and checking as illustrated in Figure 10-15 on page 383.

Figure 10-15: The Advanced Error Capability & Control Register

This register reports whether this device supports ECRC generation and checking. If so, configuration software can enable one or both features.

In some cases, multiple uncorrectable errors may be detected prior to software reading and clearing the register. The First Error Pointer field identifies the bit position within the Advanced Uncorrectable Status register corresponding to the error that occurred first. (See Figure 10-18 on page 387.)

The First Error Pointer and the ECRC Check and Generation Enable bits must be implemented as sticky bits.

Handling Sticky Bits

Several of the Advanced Configuration Error Registers employ sticky fields. Many of these fields are single bits. The designations of sticky fields are as follows:

ROS - Read Only/Sticky

RWS - Read/Write/Sticky

RW1CS - Read/Write 1 to Clear/Sticky PCI Express System Architecture

Sticky error register fields behave differently in that a Hot Reset has no affect on the contents of these fields. For all other register fields, Hot Reset forces default values into the fields. Sticky bits are important in error handling to ensure that Error-related control and status information is not lost due to a Hot Reset. Software may initiate a Hot Reset in an attempt to clear errors.

Advanced Correctable Error Handling

Advanced error reporting provides the ability to pinpoint specific correctable errors. These errors can selectively cause the generation of a Correctable Error Message being sent to the host system:

Receiver Errors (optional) - caused when the Physical Layer detects an error in the incoming packet (TLP or DLLP). The Physical Layer discards the packet, frees buffer space allocated to the packet, and signals the Link Layer that a receive error has occurred. This error is reported by setting a status bit and must not result in the Link Layer also reporting an error for this packet (e.g., the Link Layer must not report a Bad TLP or Bad DLLP).

Bad TLPs - caused when the Link Layer detects a packet with a bad CRC check, an incorrectly nullified packet, or an incorrect Packet Sequence Number (not a duplicate). In each case, the Link Layer discards the packet and reports a NAK DLLP to the transmitter, which triggers a transaction retry.

Bad DLLPs - caused by a CRC check failure. This type of error may be corrected by a subsequent DLLP or a time-out that results ultimately in DLLP retry. The exact corrective action depends on the type of DLLP that has failed and the circumstances associated with packet transmission.

REPLAY_NUM Rollover - the REPLAY_NUM is a count maintained within the transmitting side of the link layer that keeps track of the number of times that a transaction had been retransmitted without successful delivery. When the count rolls over, hardware automatically retrains the link in an attempt to clear the fault condition.

Replay Timer Time-out - this timer is maintained within the transmitting side of the Link Layer and is intended to trigger a retry when forward progress in sending TLPs has stopped. A time-out occurs when unacknowledged TLPs have not received an acknowledgement within the time-out period. A time-out results in a retry of all unacknowledged TLPs.

Knowledge of which error has occurred can help system software to make better predictions of components that are likely to fail completely in the future. Software may also choose to mask recognition of some correctable errors while reporting others.

Advanced Correctable Error Status

When a correctable error occurs the corresponding bit within the Advanced Correctable Error Status register is set. (See Figure 10-16.) These bits are automatically set by hardware and are cleared by software when writing a " 1 " to the bit position. These bits are set whether or not the error is reported via an Error Message. Each status bit in this register is designated RW1CS.

Figure 10-16: Advanced Correctable Error Status Register

Advanced Correctable Error Reporting

Whether a particular correctable error is reported to the host is specified by the Correctable Mask register illustrated in Figure 10-17. The default state of the mask bits are cleared (0), thereby causing a Correctable Error message to be delivered when any of the correctable errors are detected. Software may choose to set one or more bits to prevent a Correctable Error Message from being sent when the selected error is detected. Each bit in this register is designated RWS.

PCI Express System Architecture

Figure 10-17: Advanced Correctable Error Mask Register

Advanced Uncorrectable Error Handling

Advanced error reporting provides the ability to pinpoint which uncorrectable error has occurred. Furthermore software can specify the severity of each error and select which errors will result in an Error Message being sent to the host system (Root Complex).

Training Errors (optional) - caused by failure in the link training sequence at the Physical Layer.

Data Link Protocol Errors - caused by Link Layer protocol errors including the ACK/NAK retry mechanism. [Note: The specification states that "Violations of Flow Control initialization protocol are Data Link Layer Protocol Errors" and that checking these errors is optional. It seems likely that these error should be treated as Flow Control Protocol errors (which are optional) and not Data Link Layer Protocol errors (which are required)].

Poisoned TLP Errors - caused by data corruption storage error within memory or data buffers.

Flow Control Protocol Errors (optional) - errors associated with failures of the flow control mechanism within the transaction layer.

Completion Time-out Errors - caused by excessive delays in the return of the expected completion. This error is detected by the Requester.

Completer Abort Errors (optional) - occurs when the Completer cannot fulfill the transaction request due to a variety of possible problems with the request or failure of the completer device.

Unexpected Completion Errors - occurs when a requester receives a completion transaction that does not match an request pending completion.

Receiver Overflow Errors (optional) - caused by a flow-control buffer overflow condition.

Malformed TLPs - caused by processing errors associated with the transaction header. This error is detected within the Transaction Layer of the device receiving the TLP.

ECRC Errors (optional) - caused by an End-to-End CRC (ECRC) check failure within the transaction layer of the receiving device.

Unsupported Request Errors - occurs when the Completer does not support the request. The request is correctly formed and has not incurred any detected error during transport, however the transaction cannot be fulfilled by the completer due to a variety of reasons including illegal access and invalid command for this device.

The errors noted as optional in the list above may not be implemented in the Advanced Uncorrectable register set.

Advanced Uncorrectable Error Status

When an uncorrectable error occurs the corresponding bit within the Advanced Uncorrectable Error Status register is set. (See Figure 10-18 on page 387.) These bits are automatically set by hardware and are cleared by software when writing a " 1 " to the bit position. These bits are set whether or not the error is reported via an Error Message. Each status bit in this register is designated RW1CS.

Figure 10-18: Advanced Uncorrectable Error Status Register

PCI Express System Architecture

Selecting the Severity of Each Uncorrectable Error

Advanced error handling permits software to select the severity of each error within the Advanced Uncorrectable Error Severity register. This gives software the opportunity to treat errors according to the severity associated with a given application. For example, Poisoned TLP data associated with audio data being sent to a speaker, while not correctable has no serious side effects relative to the integrity of the system. However, if real-time status information is being retrieved that will help make critical decisions, any errors in this data can be very serious. Figure 10-19 illustrates the Error Severity register. The default values are illustrated in the individual bit fields. These represent the default severity levels for each type of error

(1 = Fatal, 0 = Non-Fatal)

Figure 10-19: Advanced Uncorrectable Error Severity Register

Those uncorrectable errors that are selected to be non fatal will result in a NonFatal Error Message being delivered and those selected as Fatal errors will result in a Fatal Error Message delivered. However, whether or not an Error Message is generated for a given error is specified in the Advanced Uncorrectable Mask register.

Uncorrectable Error Reporting

Software can mask out specific errors so that they never cause an Error Message to be generated. The default condition is to generate Error Messages for each type of error (all bits are cleared). Figure 10-20 on page 389 depicts the Advanced Uncorrectable Error Mask register.

Chapter 10: Error Detection and Handling

Figure 10-20: Advanced Uncorrectable Error Mask Register

Error Logging

A four DWord portion of the Advanced Error Registers block is reserved for storing the header of the transaction that has incurred a failure. Only a select group of Transaction Layer errors result in the transaction header being logged. Table 10-6 lists the transactions that are logged.

Table 10-6: Transaction Layer Errors That are Logged

Name of Error	Default Classification
Poisoned TLP Received	Uncorrectable - Non Fatal
ECRC Check Failed	Uncorrectable - Non Fatal
Unsupported Request	Uncorrectable - Non Fatal
Completion Abort	Uncorrectable - Non Fatal
Unexpected Completion	Uncorrectable - Non Fatal
Malformed TLP	Uncorrectable - Fatal

The format of the header is preserved when captured and placed into the register. That is, the illustration of header format in this book is exactly how the headers will appear within the Error Logging register. Note also that the contents of this register are designated ROS.

Root Complex Error Tracking and Reporting

The Root Complex is the target of all Error Messages issued by devices within the PCI Express fabric. Errors received by the Root Complex result in status registers being updated and the error being conditionally reported to the appropriate software handler or handlers.

Root Complex Error Status Registers

When the Root Complex receives an Error Message, it sets status bits within the Root Error Status register (Figure 10-21 on page 391). This register indicates the types of errors received and also indicates when multiple errors of the same type have been received. Note that an error detected at the root port is treated as if the port sent itself an Error Message.

The Advanced Root Error Status register tracks the occurrence of errors as follows:

Correctable Errors

Sets the "Received Correctable Error" bit upon receipt of the first ERR_COR Message, or detection of a root port correctable error.

Sets the "Multiple Correctable Error Message Received" bit upon receipt of an ERR_COR Message, or detection of a root port correctable error when the "Received Correctable Error" bit is already set.

Uncorrectable Errors

Sets the "Received Uncorrectable Error" bit upon receipt of the first ERR_FATAL or ERR_NONFATAL Error Message, or detection of a root port uncorrectable error.

Set the "Multiple Uncorrectable Error Message Received" bit upon receipt of an ERR_FATAL or ERR_NONFATAL Message, or detection of a root port correctable error when the "Received Uncorrectable Error" bit is already set.

Detecting and Reporting First Uncorrectable Fatal versus Non-Fatal Errors

If the system wishes to implement separate software error handlers for Correctable, Non-Fatal, and Fatal errors. The Root Error Status register includes bits to differentiate Correctable from Uncorrectable Errors but needs additional bits to also determine whether Uncorrectable errors are fatal or non-fatal:

If the first Uncorrectable Error Message received is FATAL the "First Uncorrectable Fatal" bit is also set along with the "Fatal Error Message Received" bit.

If the first Uncorrectable Error Message received is NONFATAL the "NonFatal Error Message Received" bit is set. (If a subsequent Uncorrectable Error is Fatal, the "Fatal Error Message Received" bit will be set, but because the "First Uncorrectable Fatal" remains cleared, software knows that the first Uncorrectable Error received was Non-Fatal.)

Figure 10-21: Root Error Status Register

Advanced Source ID Register

Software error handlers may need to read and clear error status registers within the device that detected and reported the error. The Error Messages contain the ID of the device reporting the error. The Source ID register captures the Error Message ID associated with the first Fatal and first Non-Fatal Error message received by the Root Complex. The format of this register is shown in Figure 10- 22 on page 391.

Figure 10-22: Advanced Source ID Register

PCI Express System Architecture

Root Error Command Register

The Root Error Status register sets status bits that determines whether a Correctable, Fatal, or Non-Fatal error has occurred. In conjunction with these status bits the Root Complex can also generate separate interrupts that call handlers for each of the error categories. The Root Error Command register enables interrupt generation for all three categories as pictured in Figure 10-23 on page 392.

Figure 10-23: Advanced Root Error Command Register

Reporting Errors to the Host System

Software error handlers will initially read Root Complex status registers to determine the nature of the error, and may also need to read device-specific error registers of the device that reported the error.

Summary of Error Logging and Reporting

The actions taken by a function when an error is detected is governed by the type of error and the settings of the error-related configuration registers. The specification includes the flow chart in Figure 10-24 on page 393 that specifies the actions taken by a device upon detecting an error. This flow chart presumes that PCI Express compatible software is being used and does not cover the case of error handling when only legacy PCI software is used.

11 Physical Layer Logic

The Previous Chapter

The previous chapter discussed both correctable and non-correctable errors and the mechanisms used to log and report them. Prior to discussing the PCI Express error reporting capabilities, a brief review of the PCI error handling was included as background information.

This Chapter

This chapter describes the Logical characteristics of the Physical Layer core logic. It describes how an outbound packet is processed before clocking the packet out differentially. The chapter also describes how an inbound packet arriving from the Link is processed and sent to the Data Link Layer. The chapter describes sub-block functions of the Physical Layer such as Byte Striping and Un-Striping logic, Scrambler and De-Scrambler, 8b/10b Encoder and Decoder, Elastic Buffers and more.

The Next Chapter

The next chapter describes the electrical characteristics of the Physical Layer. It describes the analog characteristics of the differential drivers and receivers that connect a PCI Express device to the Link.

Physical Layer Overview

The Physical Layer shown in Figure 11-1 on page 398 connects to the Link on one side and interfaces to the Data Link Layer on the other side. The Physical Layer processes outbound packets before transmission to the Link and processes inbound packets received from the Link. The two sections of the Physical Layer associated with transmission and reception of packets are referred to as the transmit logic and the receive logic throughout this chapter.

PCI Express System Architecture

The transmit logic of the Physical Layer essentially processes packets arriving from the Data Link Layer, then converts them into a serial bit stream. The bit stream is clocked out at 2.5 Gbits/s/Lane onto the Link.

The receive logic clocks in a serial bit stream arriving on the Lanes of the Link with a clock that is recovered from the incoming bit stream. The receive logic converts the serial bit steam into a parallel symbol stream, processes the incoming symbols, assembles packets and sends them to the Data Link Layer.

Figure 11-1: Physical Layer

In the future, data rates per Lane are expected to go to 5 Gbits/s, 10 Gbits/s and beyond. When this happens, an existing design can be adapted to the higher data rates by redesigning the Physical Layer while maximizing reuse of the Data Link Layer, Transaction Layer and Device Core/Software Layer. The Phys-

ical Layer may be designed as a standalone entity separate from the Data Link Layer and Transaction Layer. This allows a design to be migrated to higher data rates or even to an optical implementation if such a Physical Layer is supported in the future.

Two sub-blocks make up the Physical Layer. These are the logical Physical Layer and the electrical Physical Layer as shown in Figure 11-2. This chapter describes the logical sub-block, and the next chapter describes the electrical subblock. Both sub-blocks are split into transmit logic and receive logic (independent of each other) which allow dual simplex communication.

Figure 11-2: Logical and Electrical Sub-Blocks of the Physical Layer

Disclaimer

To facilitate description of the Physical Layer functionality, an example implementation is described that is not necessarily the implementation assumed by the specification nor is a designer compelled to implement a Physical Layer in such a manner. A designer may implement the Physical Layer in any manner that is compliant with the functionality expected by the PCI Express specification.

Transmit Logic Overview

Figure 11-3 on page 401 shows the elements that make up the transmit logic:

a multiplexer (mux),

byte striping logic (only necessary if the link implements more than one data lane),

scramblers,

$8 b / 10 b$ encoders,

and parallel-to-serial converters.

TLPs and DLLPs from the Data Link layer are clocked into a Tx (transmit) Buffer. With the aid of a multiplexer, the Physical Layer frames the TLPs or DLLPs with Start and End characters. These characters are framing symbols which the receiver device uses to detect start and end of packet.

The framed packet is sent to the Byte Striping logic which multiplexes the bytes of the packet onto the Lanes. One byte of the packet is transferred on one Lane, the next byte on the next Lane and so on for the available Lanes.

The Scrambler uses an algorithm to pseudo-randomly scramble each byte of the packet. The Start and End framing bytes are not scrambled. Scrambling eliminates repetitive patterns in the bit stream. Repetitive patterns result in large amounts of energy concentrated in discrete frequencies which leads to significant EMI noise generation. Scrambling spreads energy over a frequency range, hence minimizing average EMI noise generated.

The scrambled 8-bit characters (8b characters) are encoded into 10-bit symbols (10b symbols) by the

8 b / 10 b

Encoder logic. And yes,there is a

25 %

loss in transmission performance due to the expansion of each byte into a 10-bit character. A Character is defined as the 8-bit un-encoded byte of a packet. A Symbol is defined as the 10-bit encoded equivalent of the 8-bit character. The purpose of Chapter 11: Physical Layer Logic

8 b / 10 b

Encoding the packet characters is primarily to create sufficient 1-to-0 and 0-to-1 transition density in the bit stream so that the receiver can re-create a receive clock with the aid of a receiver Phase Lock Loop (PLL). Note that the clock used to clock the serial data bit stream out of the transmitter is not itself transmitted onto the wire. Rather, the receive clock is used to clock in an inbound packet.

The 10b symbols are converted to a serial bit stream by the Parallel-to-Serial converter. This logic uses a

2.5 GHz

clock to serially clock the packets out on each Lane. The serial bit stream is sent to the electrical sub-block which differentially transmits the packet onto each Lane of the Link.

Figure 11-3: Physical Layer Details

PCI Express System Architecture

Receive Logic Overview

Figure 11-3 shows the elements that make up the receiver logic:

receive PLL,

serial-to-parallel converter,

elastic buffer,

$8 b / 10 b$ decoder,

de-scrambler,

byte un-striping logic (only necessary if the link implements more than one data lane),

control character removal circuit,

and a packet receive buffer.

As the data bit stream is received, the receiver PLL is synchronized to the clock frequency with which the packet was clocked out of the remote transmitter device. The transitions in the incoming serial bit stream are used to re-synchronize the PPL circuitry and maintain bit and symbol lock while generating a clock recovered from the data bit stream. The serial-to-parallel converter is clocked by the recovered clock and outputs

10 b

symbols.

The 10b symbols are clocked into the Elastic Buffer using the recovered clock associated with the receiver PLL. The Elastic Buffer is used for clock tolerance compensation; i.e. the Elastic Buffer is used to adjust for minor clock frequency variation between the recovered clock used to clock the incoming bit stream into the Elastic Buffer and the locally-generated clock associated that is used to clock data out of the Elastic Buffer.

The

10 b

symbols are converted back to

8 b

characters by the

8 b / 10 b

Decoder. The Start and End characters that frame a packet are eliminated. The

8 b / 10 b

Decoder also looks for errors in the incoming 10b symbols. For example, error detection logic can check for invalid

10 b

symbols or detect a missing Start or End character.

The De-Scrambler reproduces the de-scrambled packet stream from the incoming scrambled packet stream. The De-Scrambler implements the inverse of the algorithm implemented in the transmitter Scrambler.

The bytes from each Lane are un-striped to form a serial byte stream that is loaded into the receive buffer to feed to the Data Link layer.

Physical Layer Link Active State Power Management

The full-on power state of the Physical Layer and Link is called the L0 state. Devices support two lower power states, L0s (L0 suspend) and L1 Active that are actively and automatically managed by the Physical Layer. L1 Active power state is a lower power state than L0s and is optionally supported. The L0s power state is managed by the Physical Layer. The L1 Active power state is managed by a combination of Data Link Layer and Physical Layer.

A Link can be placed in the L0s power state in one direction independent of the other direction while a Link in the L1 Active power state is in this state in both directions.

Software enables support of the L0s and L1 Active power states via configuration registers. After reset, these registers are in a state that disables lower power state functionality. The Physical Layer automatically manages entering these lower power states upon detection of a period of inactivity on the Link. Once a device is in L0s or L1 Active and it intends to transmit packets, it can transition its Link power state back to L0. The exit latency from L1 Active is greater than the exit latency from L0s.

Additional details on Link Active State Power Management are covered in "Link Training and Status State Machine (LTSSM)" on page 508 on Link Training and "Link Active State Power Management" on page 608.

Link Training and Initialization

The Physical Layer is responsible for the Link Initialization and Training. The process is described in "Link Initialization and Training Overview" on page 500 .

Transmit Logic Details

Figure 11-4 on page 406 shows the transmit logic of the Logical Physical Layer. This section describes packet processing from the time packets are received from the Data Link Layer until the packet is clocked out of the Physical Layer onto the Link.

Tx Buffer

The Tx Buffer receives TLPs and DLLPs from the Data Link Layer. Along with the packets, the Data Link Layer indicates the start and end of the packet using a 'Control' signal so that the Physical Layer can append Start and End framing characters to the packet. The Tx Buffer uses a 'throttle' signal to throttle the flow of packets from the Data Link Layer in case the Tx Buffer fills up.

Multiplexer (Mux) and Mux Control Logic

General

The Mux shown in Figure 11-5 on page 407 primarily gates packet characters from the Tx Buffer to the Byte Striping logic (only necessary if the link implements more than one data lane). However, under certain circumstances, the Mux may gate other inputs to the Byte Striping logic. Here is a summary of the four Mux inputs and when they are gated:

Transmit Data Buffer. When the Data Link Layer supplies a packet to be transmitted, the Mux gates the packet's character stream through to the Byte Striping logic. Characters within the Tx Buffer are Data or 'D' characters. Hence the D/K# signal is driven High when Tx Buffer contents are gated. See "Definition of Characters and Symbols" on page 405.

Start and End characters. These Control characters are appended to the start and end of every TLP and DLLP as shown in Figure 11-6 on page 408. These framing characters allow a receiver to easily detect the start and end of a packet. There are two types of Start characters, one is the start TLP character (STP) and the other is the start DLLP character (SDP). There are two types of end characters, the End Good TLP or DLLP character (END), and the End Bad TLP character (EDB). See Table 11-5 on page 432 for a list of Control characters. A control signal from the Data Link Layer in combination with the packet type determine what type of framing character to gate. Start and End characters are Control or ’ $K$ ’ characters,hence the D/K# signal is driven low when the Start and End characters are gated out at the start and end of a packet, respectively.

Ordered-Sets. Ordered-Sets are multiples of 4 character sequences that starts with a comma (COM) control character followed by other characters. They are transmitted during special events as described below:

During Link training, Training Sequence 1 and 2 (TS1 and TS2) Ordered-Sets are transmitted over the Link. Link training occurs after fundamental reset, hot reset, or after certain error conditions occur. Refer to "Ordered-Sets Used During Link Training and Initialization" on page 504 for detailed usage of TS1 and TS2 Ordered-Sets.

At periodic intervals, the Mux gates the SKIP Ordered-Set pattern through to the Byte Striping logic to facilitate clock tolerance compensation in the receiver circuit of the port at the other end of the Link. For a detailed description, refer to "Inserting Clock Compensation Zones" on page 436 and "Receiver Clock Compensation Logic" on page 442.

When a device wants to place its transmitter in the electrical Idle state, it must inform the remote receiver at the other end of the Link. The device gates an electrical Idle Ordered-Set to do so.

When a device wants to change the Link power state from L0s low power state to the L0 full-on power state, it transmits Fast Training Sequence (FTS) Ordered-Sets to the receiver. The receiver uses this Ordered-Set to re-synchronize its PLL to the transmitter clock.

Ordered-Sets begin with a $K$ character and,depending on the type of set, may contain D or K characters. Hence, during transmission of an Ordered-Set, the D/K# signal is driven Low for a clock and then may be driven High or Low there after.

Logical Idle Sequence. When there are no packets to transmit on the Link (referred to as Logical Idle Link), rather than leave the Link in a floating state or drive nothing, logical Idle characters are gated. Doing so guarantees signal transitions on the Link thus allowing the receiver's PLL to maintain clock synchronization with the transmit clock. In addition, the receiver is able to maintain bit and symbol lock. The logical Idle sequence consists of transmitting $00 h$ characters. It therefore consists of D type characters,hence, the D/K# signal is high while the Mux is gating logical Idle sequences.

Definition of Characters and Symbols

Each character is 8-bits in size. They can be grouped into two categories, Control or ’

K

’ characters,and Data or ’

D

’ characters. From the standpoint of

8 b / 10 b

Encoding, D characters are encoded into a different 10-bit symbol than K characters of the same 8-bit value. Each 10-bit encoded character is referred to as a symbol.

Figure 11-6: TLP and DLLP Packet Framing with Start and End Control Characters

Byte Striping (Optional)

When a port implements more than one data Lane (i.e., more than one serial data path on the external Link), the packet data is striped across the 2, 4, 8, 12, 16, or 32 Lanes by the Byte Striping logic. Striping means that each consecutive outbound character in a character stream is multiplexed onto the consecutive Lanes. Examples of Byte Striping are illustrated in Figure 11-7 on page 409, Figure 11-8 on page 410, Figure 11-9 on page 411. The number of Lanes used is configured during the Link training process.

Disclaimer: This example assumes that the Byte Striping logic is implemented before the Scrambler and

8 b / 10 b

Encoder. Every Lane implements a Scrambler and an

8 b / 10 b

Encoder. This permits a receiver Physical Layer to detect errors on any Lane independent of the other Lanes. For example, an error that may have occurred in the transmitter Scrambler or

8 b / 10 b

Encoder is detectable if a receiver detects an invalid

10 b

character on a given Lane. When an error is detected on a Lane and cannot be cleared, the Lane could be disabled and the Link re-trained and re-initialized with fewer Lanes. This error recovery feature is suggested and not required by specification. Chapter 11: Physical Layer Logic

Chapter 11: Physical Layer Logic

Figure 11-9:

x 8, x 12, x 16, x 32

Byte Striping

Packet Format Ruies

After passing through the Byte Striping logic, a TLP or DLLP character stream is striped across the Lanes. This section describes the rules used to byte stripe packets so that the packets are correctly striped across the Lanes of the Link.

General Packet Format Rules. The following are the general packet format rules:

The total packet length (including Start and End characters) of each packet must be a multiple of four characters.

TLPs always start with the STP character.

DLLPs always start with SDP and are 8 characters long (6 characters + SDP + END)

All TLPs terminate with either an END or EDB character.

DLLPs terminate with the END character.

STP and SDP characters must be placed on Lane 0 when starting the transmission of a packet after the transmission of Logical Idles. If not starting a packet transmission from Logical Idle (i.e. back-to-back transmission of packets), then STP and SDP must start on a Lane number divisible by 4.

Any violation of these rules may be reported as a Receiver Error to the Data Link Layer.

x1 Packet Format Example. Figure 11-10 on page 413 illustrates the format of packets transmitted over a x1 Link (i.e., a Link with only one Lane operational). The illustration shows the following sequence of packets:

One TLP.

One 8-byte DLLP.

One clock compensation packet consisting of a SKIP Ordered-Set (i.e., a COM followed by three SKP characters).

Two TLPs.

One 8-byte DLLP.

One TLP.

A Flow Control Packet.

Logical Idles transmitted because there are no more packets to transmit.

x4 Packet Format Rules. The following rules apply when a packet is transmitted over a x4 Link (i.e., a Link with four Lanes):

STP and SDP characters are always transmitted on Lane 0.

END and EDB characters are always transmitted on Lane 3.

When an Ordered-Set such as the SKIP Ordered-Set is transmitted (for clock compensation in the receiver), it must be sent on all four Lanes simultaneously.

When Logical Idle sequences are transmitted, they must be transmitted on all Lanes.

Any violation of these rules may be reported as a Receiver Error to the Data Link Layer.

x4 Packet Format Example. Figure 11-11 on page 414 illustrates the format of packets transmitted over a

\times 4

Link (i.e.,a Link with four data Lanes operational). The illustration shows the following sequence of packets:

One TLP.

A SKIP Ordered-Set transmitted on all Lanes for periodic receiver clock compensation.

A DLLP.

Logical Idles on all Lanes because there are no more packets to transmit.

Figure 11-10: x1 Packet Format

x8, x12, x16 or x32 Packet Format Rules. The following rules apply when a packet is transmitted over a x8, x12, x16, or x32 Link:

STP/SDP characters are always transmitted on Lane 0 when transmission starts after a period during which Logical Idles are transmitted.

STP/SDP characters may only be transmitted on Lane numbers divisible by 4 when transmitting back-to-back packets.

END/EDB characters are transmitted on Lane numbers divisible by 4 less 1.

If a packet doesn't end on the last Lane and there are no more packet transmissions, PAD symbols are transmitted on the Lanes above the Lane on which the END/EDB character is transmitted. This keeps the Link aligned so that transmission of the Logical Idle sequence can start on all Lanes at the same time.

x8 Packet Format Example. Figure 11-12 on page 415 illustrates the format of packets transmitted over a x8 Link (i.e., a Link with 8 Lanes operational). The illustration shows the following sequence of packets:

A TLP.

A SKIP Ordered-Set transmitted on all Lanes for periodic receiver clock compensation.

A DLLP.

A TLP that ends on Lane 3. The remaining Lanes are filled with PADs so that the Link is aligned for the next transmission.

Logical Idles on all Lanes because there are no more packets to transmit.

Figure 11-12: x8 Packet Format

PAD characters are transmitted to maintain packet framing alignment

Scrambler

After byte striping, the outbound packets are transmitted across the Lanes. As shown in Figure 11-4 on page 406, each Lane in the Physical Layer design incorporates a Scrambler.

Purpose of Scrambling Outbound Transmission

The Scrambler eliminates generation of repetitive patterns on a transmitted data stream. As an example, when scrambled, a stream of 0 s will result in a pseudorandom bit pattern.

Repetitive patterns result in large amount of energy concentrated in discrete frequencies which results in significant EMI noise generated. By scrambling the transmitted data, repetitive patterns-such as 10101010-are eliminated. As a result, no single frequency component of the signal is transmitted for significant periods of time. Thus the radiated EMI energy of a transmission is spread over a range in the frequency spectrum. This technique referred to as 'spread spectrum' effectively 'whitens' the frequency content of a signal and reduces the radiated power at any particular frequency.

On a bare systemboard with the wires of the Link un-shielded and high frequency transmission of 2.5 Gbits/s, EMI noise generation is significant. Scrambling makes the radiated power from the Link effectively look like white noise. This helps meet FCC requirements.

Also, on a multi-Lane Link with wires routed in close proximity, a scrambled transmission on one Lane generates white noise which does not interfere or correlate with another Lane's data transmission. This 'spatial frequency de-correlation' or reduction of crosstalk noise assists the receiver on each Lane to distinguish the desired signal from the background white noise.

Scrambler Algorithm

The Scrambler in Figure 11-13 on page 418 is implemented with a 16-bit Linear Feedback Shift Registers (LFSR) that implements the polynomial:

G (x) = X^{16} + X^{5} + X^{4} + X^{3} + 1

The LFSR is clocked at the bit transfer rate. The LFSR output is serially clocked into an 8-bit register that is XORed with the 8-bit characters to form the scrambled data.

Implementation Note: The LFSR bit rate clock is 8 times the frequency (2GHz) of the byte clock (250MHz) that feeds the Scrambler output.

Some Scrambler implementation rules:

On a multi-Lane Link implementation, Scramblers associated with each Lane must operate in concert, maintaining the same simultaneous value in each LFSR.

Scrambling is applied to 'D' characters associated with TLP and DLLPs, including the Logical Idle (00h) sequence. ’D’ characters within the TS1 and TS2 Ordered-Set are not scrambled.

’ $K$ ’ characters and characters within Ordered-Sets-such as TS1,TS2,SKIP, FTS and Electrical Idle Ordered-Sets-are not scrambled. These characters bypass the scrambler logic.

Compliance Pattern related characters are not scrambled.

When a COM character exits the Scrambler, (COM does not get scrambled) it initializes the LFSR. The initialized value of the 16-bit LFSR is FFFFh. Similarly on the receiver side, when a COM character enters the De-Scrambler, it is initialized.

With one exception, the LFSR serially advances eight times for every character (D or K character) transmission. The LFSR does NOT advance on SKP characters associated with the SKIP Ordered-Set. The reason the LFSR is not advanced for SKPs is because a receiver of inbound packets may add or delete SKP symbols to perform clock tolerance compensation. Changing the number of characters in the receiver from the number of characters transmitted will cause the value in the receiver LFSR to lose synchronization with the transmitter LFSR value. For a detailed description, refer to "Inserting Clock Compensation Zones" on page 436 and "Receiver Clock Compensation Logic" on page 442.

By default, Scrambling is always enabled. Although the specification does allow the Scrambler to be disabled for test and debug purposes, it does not provide a standard software or configuration register-related method to disable the Scrambler.

PCI Express System Architecture

Figure 11-13: Scrambler

Disabling Scrambling

As stated in the previous section, the Scrambler can be disabled to help facilitate test and debug. Software or test equipment may tell a device to disable scrambling. However, the specification does not indicate the mechanism by which a device's Physical Layer is instructed to disable scrambling.

Scrambling is disabled in the Configuration State of Link Training described on the page 519. The device receiving the software request to disable scrambling will do so during the Link Training Configuration State, and transmits at least two TS1/TS2 Ordered-Sets with the disable scrambling bit set on all its configured Lanes to the remote device it is connected to. The remote receiver device disables its Scrambler/De-Scrambler. It is required that the Port that is sending the Disable Scrambling request will also disable scrambling.

8b/10b Encoding

General

Each Lane of a device's transmitter implements an 8-bit to 10-bit Encoder that encodes 8-bit data or control characters into 10-bit symbols. The coding scheme was invented by IBM in 1982 and is documented in the ANSI X3.230-1994 document, clause 11 (and also IEEE 802.3z, 36.2.4) and US Patent Number 4,486,739 entitled "Byte Oriented DC Balanced 8b/10b Partitioned Block Transmission Code". 8b/10b coding is now widely used in architectures such as Gigabit Ethernet, Fibre Channel, ServerNet, FICON, IEEE1394b, InfiniBand, etc.

Purpose of Encoding a Character Stream

The primary purpose of this scheme is to embed a clock into the serial bit stream transmitted on all Lanes. No clock is therefore transmitted along with the serial data bit stream. This eliminates the need for a high frequency

2.5 GHz

clock signal on the Link which would generate significant EMI noise and would be a challenge to route on a standard FR4 board. Link wire routing between two ports is much easier given that there is no clock to route, removing the need to match clock length to Lane signal trace lengths. Two devices are connected by simply wiring their Lanes together.

Below is a summary of the advantages of

8 b / 10 b

encoding scheme:

Embedded Clock. Creates sufficient 0-to-1 and 1-to-0 transition density (i.e., signal changes) to facilitate re-creation of the receive clock on the receiver end using a PLL (by guaranteeing a limited run length of consecutive ones or zeros). The recovered receive clock is used to clock inbound 10- bit symbols into an elastic buffer. Figure 11-14 on page 420 illustrates the example case wherein $00 h$ is converted to 1101000110b,where an 8-bit character with no transitions has 5 transitions when converted to a 10b symbol. These transitions keep the receiver PLL synchronized to the transmit circuit clock:

Limited 'run length' means that the encoding scheme ensures the signal line will not remain in a high or low state for an extended period of time. The run length does not exceed five consecutive 1s or 0s.

1s and 0s are clocked out on the rising-edge of the transmit clock. At the receiver, a PLL can recreate the clock by sync'ing to the leading edges of 1s and 0s.

Limited run length ensures minimum frequency drift in the receiver's PLL relative to the local clock in the transmit circuit.

Figure 11-14: Example of 8-bit Character of 00h Encoded to 10-bit Symbol

DC Balance. Keeps the number of 1s and 0s transmitted as close to equal as possible, thus maintaining DC balance on the transmitted bit stream to an average of half the signal threshold voltage. This is very important in capacitive- and transformer-coupled circuits.

Maintains a balance between the number of $1 s$ and $0 s$ on the signal line, thereby ensuring that the received signal is free of any DC component. This reduces the possibility of inter-bit interference. Inter-bit interference results from the inability of a signal to switch properly from one logic level to the other because the Lane coupling capacitor or intrinsic wire capacitance is over-charged.

Encoding of Special Control Characters. Permits the encoding of special control $(^{'} K^{'})$ characters such as the Start and End framing characters at the start and end of TLPs and DLLPs.

Error Detection. A secondary benefit of the encoding scheme is that it facilitates the detection of most transmission errors. A receiver can check for 'running disparity' errors, or the reception of invalid symbols. Via the running disparity mechanism (see "Disparity" on page 423), the data bit stream transmitted maintains a balance of $1 s$ and $0 s$ . The receiver checks the difference between the total number of $1 s$ and $0 s$ transmitted since link initialization and ensures that it is as close to zero as possible. If it isn't, a disparity error is detected and reported, implying that a transmission error occurred.

The disadvantage of

8 b / 10 b

encoding scheme is that,due to the expansion of each 8-bit character into a 10-bit symbol prior to transmission, the actual transmission performance is degraded by

20 %

or said another way,the transmission overhead is increased by

25 %

(everything good has a price tag).

Chapter 11: Physical Layer Logic

Properties of 10-bit (10b) Symbols

For 10-bit symbol transmissions, the average number of 1s transmitted over time is equal to the number of 0 s transmitted, no matter what the 8-bit character to be transmitted is; i.e., the symbol transmission is DC balanced.

The bit stream never contains more than five continuous $1 s$ or $0 s$ (limited-run length).

Each 10-bit symbol contains:

Four 0s and six 1s (not necessarily contiguous), or

Six 0s and four 1s (not necessarily contiguous), or

Five 0s and five 1s (not necessarily contiguous).

Each 10-bit symbol is subdivided into two sub-blocks: the first is six bits wide and the second is four bits wide.

The 6-bit sub-block contains no more than four 1s or four 0s.

The 4-bit sub-block contains no more than three 1s or three 0s.

Any symbol with other than the above properties is considered invalid and a receiver consider this an error.

An 8-bit character is submitted to the $8 b / 10 b$ encoder along with a signal indicating whether the character is a Data (D) or Control (K) character. The encoder outputs the equivalent 10-bit symbol along with a current running disparity (CRD) that represents the sum of $1 s$ and $0 s$ for this transmission link since link initialization. See "Disparity" on page 423 for more information.

The PCI Express specification defines Control characters that encode into the following Control symbols: STP, SDP, END, EDB, COM, PAD, SKP, FTS, and IDL (see "Control Character Encoding" on page 430).

PCI Express System Architecture

Preparing 8-bit Character Notation

8 b / 10 b

conversion lookup tables refer to all 8-bit characters using a special notation (represented by Dxx.y for Data characters and Kxx.y. for Control characters). Figure 11-15 on page 422 illustrates the notation equivalent for any 8-bit D or K character. Below are the steps to covert the 8-bit number to its notation equivalent.

In Figure 11-15 on page 422, the example character is the Data character, 6Ah.

The bits in the character are identified by the capitalized alpha designators A through H.

The character is partitioned into two sub-blocks: one 3-bits wide and the other 5-bits wide.

The two sub-blocks are flipped.

The character takes the written form Zxx.y, where:

$Z = D$ or $K$ for Data or Control,

$xx =$ the decimal value of the 5-bit field,

$y =$ the decimal value of the 3-bit field.

The example character is represented as D10.3 in the $8 b / 10 b$ lookup tables.

Figure 11-15: Preparing 8-bit Character for Encode

Disparity

Definition. Character disparity refers to the difference between the number of

1 s

and

0 s

in a 10-bit symbol:

When a symbol has more 0s than 1s, the symbol has negative (-) disparity (e.g., 0101000101b).

When a symbol has more $1 s$ than $0 s$ ,the symbol has positive (+) disparity (e.g., 1001101110b).

When a symbol has an equal number of $1 s$ and $0 s$ ,the symbol has neutral disparity (e.g., 0110100101b).

Each 10-bit symbol contains one of the following numbers of ones and zeros (not necessarily contiguous):

Four 0s and six 1s (+ disparity).

Six 0s and four $1 s$ (-disparity).

Five 0s and five 1s (neutral disparity).

Two Categories of 8-bit Characters. There are two categories of 8-bit characters:

Those that encode into 10-bit symbols with + or - disparity.

Those that encode into 10-bit symbols with neutral disparity.

CRD (Current Running Disparity). The CRD reflects the total number of

1 s

and

0 s

transmitted over the link since link initialization and has the following characteristics:

Its current state indicates the balance of $1 s$ and $0 s$ transmitted since link initialization.

The CRD's initial state (before any characters are transmitted) can be + or - .

The CRD's current state can be either positive (if more 1s than 0s have been transmitted) or negative (if more 0s than 1s).

Each character is converted via a table lookup with the current state of the CRD factored in.

As each new character is encoded, the CRD either remains the same (if the newly generated 10-bit character has neutral disparity) or it flips to the opposite polarity (if the newly generated character has + or - disparity). PCI Express System Architecture

8b/10b Encoding Procedure

Refer to Figure 11-16 on page 425. The encode is accomplished by performing two table lookups in parallel (not shown separately in the illustration):

First Table Lookup: Three elements are submitted to a 5-bit to 6-bit table for a lookup (see Table 11-1 on page 427 and Table 11-2 on page 429):

The 5-bit portion of the 8-bit character (bits A through E).

The Data/Control (D/K#) indicator.

The current state of the CRD (positive or negative).

The table lookup yields the upper 6-bits of the 10-bit symbol (bits abc-dei).

Second Table Lookup: Three elements are submitted to a 3-bit to 4-bit table

for a lookup (see Table 11-3 on page 429 and Table 11-4 on page 430):

The 3-bit portion of the 8-bit character (bits F through H).

The same Data/Control (D/K#) indicator.

The current state of the CRD (positive or negative).

The table lookup yields the lower 4-bits of the 10-bit symbol (bits $f g h j$ ).

The

8 b / 10 b

encoder computes a new CRD based on the resultant 10-bit symbol and supplies this CRD for the

8 b / 10 b

encode of the next character. If the resultant 10-bit symbol is neutral (i.e., it has an equal number of 1s and 0s), the polarity of the CRD remains unchanged. If the resultant 10-bit symbol is + or -, the CRD flips to its opposite state. It is an error if the CRD is currently + or - and the next 10-bit symbol produced has the same polarity as the CRD (unless the next symbol has neutral disparity, in which case the CRD remains the same).

The

8 b / 10 b

encoder feeds a Parallel-to-Serial converter which clocks 10-bit symbols out in the bit order 'abcdeifghj' (shown in Figure 11-16 on page 425).

Example Encodings. Figure 11-17 on page 426 illustrates some example 8-bit to 10-bit encodings. The following is an explanation of the conversion of the 8-bit Data character 6Ah:

The 8-bit character is broken down into its two sub-blocks: 011b and $01010 b$ .

The two sub-blocks are flipped and represented as the D10.3 character. The binary-weighted value of the 5-bit block is 10d and the value of the 3-bit field is $3 d$ .

The two blocks are submitted to the data character lookup tables (Table 11-1 on page 427 and Table 11-3 on page 429 are for D lookups) along with the current state of the CRD).

The last two columns show the 10-bit symbol produced by the two parallel table lookups (Table 11-1 on page 427 and Table 11-3 on page 429) when the CRD is negative or positive. and transmission of three characters: the first one is the control character BCh (K28.5), the second character is also BCh (K28.5) and the third character is the data character $6 Ah$ (D10.3):

Figure 11-16: 8-bit to 10-bit (8b/10b) Encoder

Example Transmission. Figure 11-18 on page 427 illustrates the encode

If the initial CRD is negative at the time of the encode, the K28.5 is encoded into 001111010b (positive disparity), flipping the CRD from negative to positive.

If the CRD is positive at the time of the encode, the K28.5 is encoded into 1100000101b (negative disparity), flipping the CRD from positive to negative.

PCI Express System Architecture

The D10.3 is encoded into 0101011100b (neutral disparity). The CRD therefore remains unchanged (negative) for the next encoding (not shown).

Notice that the resultant symbol stream is DC balanced.

Figure 11-17: Example 8-bit/10-bit Encodings

D or K Character	Hex Byte	Binary Bits HGF EDCBA	Byte Name	CRD – abcdeifghj	CRD + abcdei fghj
Data (D)	6A	011 01010	D10.3	0101011100	010101 0011
Data (D)	1B	000 11011	D27.0	1101100100	001001	1011
Data (D)	F7	111 10111	D23.7	1110100001	000101	1110
Control (K)	F7	111 10111	K23.7	1110101000	000101 0111
Control (K)	BC	101 11100	K28.5	0011111010	110000 0101

If character encode yields neutral disparity, then CRD remains unchanged, else it flips

Chapter 11: Physical Layer Logic

Figure 11-18: Example 8-bit/10-bit Transmission

Use these two characters in the example below:

D/K#	Hex Byte	Binary Bits HGF EDCBA	Byte Name	CRD – abcdei fghj	CRD + abcdei fghj
Control (K)	BC	101 11100	K28.5	001111 1010	110000 0101
Data (D)	6A	011 01010	D10.3	010101 1100	010101 0011

Example Transmission

	CRD	Character	CRD	Character	CRD	Character	CRD
Character to be transmitted	-	K28.5 (BCh)	$+$	K28.5 (BCh)		D10.3 (6Ah)	-
Bit stream transmitted	-	Yields 001111 1010 CRD is +	$+$	Yields 110000 0101 CRD is		Yields 010101 1100 CRD is neutral	-

Initialized value of CRD is don't care. Receiver can determine from incoming bit stream

The Lookup Tables

The following four tables define the table lookup for the two sub-blocks of 8-bit Data and Control characters.

Table 11-1: 5-bit to 6-bit Encode Table for Data Characters

Data Byte Name	Unencoded Bits EDCBA	Current RD – abcdei	Current RD + abcdei
D0	00000	100111	011000
D1	00001	011101	100010
D2	00010	101101	010010
D3	00011	110001	110001
D4	00100	110101	001010
D5	00101	101001	101001

Table 11-1: 5-bit to 6-bit Encode Table for Data Characters (Continued)

Data Byte Name	Unencoded Bits EDCBA	Current RD – abcdei	Current RD + abcdei
D6	00110	011001	011001
D7	00111	111000	000111
D8	01000	111001	000110
D9	01001	100101	100101
D10	01010	010101	010101
D11	01011	110100	110100
D12	01100	001101	001101
D13	01101	101100	101100
D14	01110	011100	011100
D15	01111	010111	101000
D16	10000	011011	100100
D17	10001	100011	100011
D18	10010	010011	010011
D19	10011	110010	110010
D20	10100	001011	001011
D21	10101	101010	101010
D22	10110	011010	011010
D23	10111	111010	000101
D24	11000	110011	001100
D25	11001	100110	100110
D26	11010	010110	010110
D27	11011	110110	001001
D28	11100	001110	001110

Table 11-1: 5-bit to 6-bit Encode Table for Data Characters (Continued)

Data Byte Name	Unencoded Bits EDCBA	Current RD – abcdei	Current RD + abcdei
D29	11101	101110	010001
D30	11110	011110	100001
D31	11111	101011	010100

Table 11-2: 5-bit to 6-bit Encode Table for Control Characters

Data Byte Name	Unencoded Bits EDCBA	Current RD – abcdei	Current RD + abcdei
K28	11100	001111	110000
K23	10111	111010	000101
K27	11011	110110	001001
K29	11101	101110	010001
K30	11110	011110	100001

Table 11-3: 3-bit to 4-bit Encode Table for Data Characters

Data Byte Name	Unencoded Bits HGF	Current RD - fghj	Current RD + fghj
$- .0$	000	1011	0100
$- .1$	001	1001	1001
$- .2$	010	0101	0101
$- .3$	011	1100	0011
$- .4$	100	1101	0010
$- .5$	101	1010	1010

Table 11-3: 3-bit to 4-bit Encode Table for Data Characters

Data Byte Name	Unencoded Bits HGF	Current RD - fghj	Current RD + fghj
$- .6$	110	0110	0110
$- .7$	111	1110/0111	0001/1000

Table 11-4: 3-bit to 4-bit Encode Table for Control Characters

Data Byte Name	Unencoded Bits HGF	Current RD - fghj	Current RD + fghj
$- .0$	000	1011	0100
$- .1$	001	0110	1001
$- .2$	010	1010	0101
$- .3$	011	1100	0011
$- .4$	100	1101	0010
$- .5$	101	0101	1010
$- .6$	110	1001	0110
$- .7$	111	0111	1000

Control Character Encoding

Table 11-5 on page 432 shows the encoding of the PCI Express-defined Control characters. These characters are not scrambled by the transmitter logic, but are encoded into 10-bit symbols. Because these Control characters are not scrambled, the receiver logic can easily detect these symbols in an incoming symbol stream.

These Control characters have the following properties

COM (comma) character. The COM character is used as the first character of any Ordered-Set. Ordered-Sets are a collection of multiples of 4 characters that are used for specialized purposes (see "Ordered-Sets" on page 433). The 10-bit encoding of the COM (K28.5) character contains two bits of one polarity followed by five bits of the opposite polarity (001111 1010 or 1100000101). The COM (and FTS) symbols are the only two symbols that have this property, thereby making it easy to detect at the receiver's Physical Layer. A receiver detects the COM pattern to detect the start of an Ordered-Set. In particular, the COM character associated with TS1, TS2, or FTS Ordered-Sets are used by a receiver to achieve bit and symbol lock on the incoming symbol stream. See "Link Training and Initialization" on page 403 for more details.

PAD character. On a multi-Lane Link, assume the transmitter transmits the END character associated with a packet end on an intermediate Lane such as Lane 3 of a x8 Link. If the Link goes to the Logical Idle state after the transmission of the packet's END character, then the PAD character is used to fill in the remaining Lanes. This is done so packets as well as Logical Idle sequences always begin on Lane 0 . For more information, see "x8, x12, x16 or x32 Packet Format Rules" on page 413 and "x8 Packet Format Example" on page 415.

SKP (skip) character. The SKP character is used as part of the SKIP Ordered-Set. The SKIP Ordered-Set is transmitted for clock tolerance compensation. For a detailed description, refer to "Inserting Clock Compensation Zones" on page 436 and "Receiver Clock Compensation Logic" on page 442.

STP (Start TLP) character. This character is inserted to identify the start of a TLP.

SDP (Start DLLP) character. This character is inserted to identify the start of a DLLP.

END character. This character is inserted to identify the end of a TLP or DLLP that has not experienced any CRC errors on previously-traversed links.

EDB (EnD Bad packet) character. This character is inserted to identify the end of a TLP that a forwarding device (such as a switch) wishes to 'nullify'. Cut-through mode is a mode in which the switch forwards a packet from its ingress port to an egress port with minimal latency without having to buffer the incoming packet first. A switch may have started forwarding a packet in cut-through mode and then discovered that the packet is corrupted. It therefore must instruct the receiver of this packet to discard it. To nullify a TLP, the switch ends the packet with the EDB character and inverts the LCRC from its calculated value. A receiver that receives such a nullified PCI Express System Architecture packet discards it and does not return an ACK or NAK. Also see the chapter on Ack/Nak for a detailed description of the switch cut-through mode.

FTS (Fast Training Sequence) character. This character is used as part of the FTS Ordered-Set. FTS Ordered-Sets are transmitted by a device in order to transition a Link from the low power L0s low power state back to the full-on $L 0$ state.

IDL (Idle) character. This character is used as part of the Electrical Idle Ordered-Set. The Ordered-Set is transmitted to inform the receiver that the Link is about to transition to the L0s low power state (also referred to as the Electrical Idle state of the Link).

Table 11-5: Control Character Encoding and Definition

Character Name	8b Name	10b (CRD-)	10b (CRD+)	Description
COM	K28.5 (BCh)	001111 1010	110000 0101	First character in any Ordered-Set. Detected by receiver and used to achieve symbol lock dur- ing TS1 / TS2 Ordered-Set reception at receiver
PAD	K23.7 (F7h)	111010 1000	000101 0111	Packet Padding character
SKP	K28.0 (1Ch)	001111 0100	110000 1011	Used in SKIP Ordered- Set. This Ordered-Set is used for Clock Tolerance Compensation
STP	K27.7 (FBh)	110110 1000	001001 0111	Start of TLP character
SDP	K28.2 (5Ch)	001111 0101	110000 1010	Start of DLLP character
END	K29.7 (FDh)	101110 1000	010001 0111	End of Good Packet character
EDB	K30.7 (FEh)	011110 1000	100001 0111	Character used to mark the end of a ‘nullified’ TLP.

Table 11-5: Control Character Encoding and Definition

Character Name	8b Name	10b (CRD-)	10b (CRD+)	Description
FTS	K28.1 (3Ch)	001111 1001	110000 0110	Used in FTS Ordered-Set. This Ordered-Set used to exit from L0s low power state to $L 0$
IDL	K28.3 (7Ch)	001111 0011	110000 1100	Used in Electrical Idle Ordered-Set. This Ordered-Set used to place Link in Electrical Idle state

Ordered-Sets

General. Ordered-Sets are Physical Layer Packets (PLPs) consisting of a series of characters starting with the COM character and consisting a total of four characters. When transmitted, they are transmitted on all Lanes. Ordered-Sets are used for special functions such as:

Link Training. See "Link Training and Initialization" on page 403 for a detailed description.

Clock Tolerance Compensation. See "Inserting Clock Compensation Zones" on page 436 and "Receiver Clock Compensation Logic" on page 442.

Placing the Link into the low power L0s state (also referred to as the Electrical Idle Link state).

Changing the Link state from the low power L0s state (also referred to as Electrical Idle state) to the full-on L0 state.

The PCI Express specification defines five Ordered-Sets:

Training Sequence 1 (TS1),

Training Sequence 2 (TS2),

SKIP,

Fast Training Sequence (FTS)

and Electrical IDLE Ordered-Sets.

A brief description of each Ordered-Set follows.

TS1 and TS2 Ordered-Sets. These two Ordered-Sets are used during Link training. They are transmitted by a port's transmitter to the other port's receiver, where they are used to achieve bit and symbol lock. They are also used by the ports at opposite ends of a Link to number their Links and Lanes. These Ordered-Sets are used during Link speed and width negotiation.

SKIP Ordered-Set. In a multi-lane implementation, the SKIP Ordered-Set is periodically transmitted on all Lanes to allow the receiver clock tolerance compensation logic to compensate for clock frequency variations between the clock used by the transmitting device to clock out the serial bit stream and the receiver device's local clock. The receiver adds a SKP symbol to a SKIP Ordered-Set in the receiver elastic buffer to prevent a potential buffer underflow condition from occurring due to the transmitter clock being slower than the local receiver clock. Alternately, the receiver deletes a SKP symbol from the SKIP Ordered-Set in the receiver elastic buffer to prevent a potential buffer overflow condition from occurring due to the transmitter clock being faster than the local receiver clock. For a detailed description, refer to "Inserting Clock Compensation Zones" on page 436 and "Receiver Clock Compensation Logic" on page 442.

Electrical Idle Ordered-Set. A transmitter device that wishes to place the Link in the Electrical Idle state (aka the L0s low power state) transmits this Ordered-Set to a receiver. Upon receipt, the differential receivers prepare for this low power state during which the transmitter driver can be in the low- or high-impedance state and packet transmission stops. The differential receiver remains in the low-impedance state while in this state.

FTS Ordered-Set. FTS Ordered-Sets are transmitted by a device to transition a Link from the low power L0s state back to the full-on L0 state. The receiver detects the FTS Ordered-Set and uses it to achieve bit and symbol lock as well as to re-synchronize its receiver PLL to the transmitter clock used to transmit the serial bit stream. See the Link Training and Power Management chapters for more details on FTS Ordered-Set usage.

Parallel-to-Serial Converter (Serializer)

The

8 b / 10 b

Encoder on each Lane feeds the Parallel-to-Serial converter associated with that Lane. The Parallel-to-Serial converter clocks 10-bit symbols out in the bit order 'abcdeifghj', with the least significant bit (a) shifted out first and the most significant bit (j) shifted out last (as shown in Figure 11-16 on page 425). The symbols supplied by the

8 b / 10 b

Encoder are clocked into the con-

Chapter 11: Physical Layer Logic

verter at

250 MHz

. The serial bit stream is clocked out of the Parallel-to-Serial converter at

2.5 GHz

Differential Transmit Driver

The differential driver that actually drives the serialized bit stream onto the wire (or fiber) uses NRZ encoding and drives the serial bit stream at the

2.5 Gbit / s

transfer rate. The differential driver output per Lane consists of two signals (D+ and D-). A logical one is signaled by driving the D+ signal high and the D- signal low,thus creating a positive voltage difference between the D+ and D- signals. A logical zero is signaled by driving the D+ signal low and the D- signal high,thus creating a negative voltage difference between the D+ and D- signals.

Differential peak-to-peak voltage driven by the transmitter is between

800 mV

(min.) and

1200 mV

(max).

Logical 1 is signalled with a positive differential voltage.

Logical 0 is signalled with a negative differential voltage.

During the Link's electrical Idle state, the transmitter drives a differential peak voltage between

0 mV

and

20 mV

(the transmitter may be in the low- or high-impedance state).

Details regarding the electrical characteristics of the driver are discussed in "Transmitter Driver Characteristics" on page 477

Transmit (Tx) Clock

The serial output of the Parallel-to-Serial converter on each Lane is clocked out to the differential driver by the Tx Clock signal (see Figure 11-16 on page 425).

Tx

clock frequency is

2.5 GHz

and it must be accurate to

+ / - 300 ppm

from a center frequency of

2.5 GHz

(or

600 ppm

total). The clock can skew by one clock every 1666 clock cycles. Note that this Tx Clock is different from the local clock of the Physical Layer which is a much slower clock. The Physical Layer receives a clock from an external source. PCI Express devices on peripheral cards as well as system boards may use a

100 MHz

clock supplied by the system board. This clock is multiplied by a factor with the aid of a PLL internal to the Physical Layer. The resultant local clock, which runs at a much slower frequency than

2.5 GHz

,clocks Physical Layer logic such as the Byte Striping logic,the Scrambler,the

8 b / 10 b

Encoder,the buffers,etc. The PLL also produces the

2.5 GHz Tx

clock used to feed the Parallel-to-Serial converters.

Logical Idle Sequence

In order to keep the receiver's PLL sync'd up (i.e., to keep it from drifting), something must be transmitted during periods when there are no TLPs, DLLPs or PLPs to transmit. The logical Idle sequence is transmitted during these times. The Idle sequence is gated to the Mux as described in the section "Multiplexer (Mux) and Mux Control Logic" on page 404. Some properties of the Logical Idle sequence are:

The logical Idle sequence consists of the 8-bit Data character with a value of $00 h$ .

When transmitted, it is simultaneously transmitted on all Lanes. The Link is said to be in the logical Idle state (not to be confused with electrical Idle-the state when the Link is not driven and there are no packet transmissions and the receiver PLL loses synchronization).

The logical Idle sequence is scrambled. This implies that, on the Link, the logical Idle sequence has a pseudo-random value. A receiver can distinguish the logical Idle sequence from other packet transmissions because it occurs outside the packet framing context (i.e., the logical Idle sequence occurs after an END or EDB Control symbol, but before an STP or SDP Control symbol).

The logical Idle Sequence is $8 b / 10 b$ encoded.

During Logical Idle sequence transmission, SKIP Ordered-Sets are also transmitted periodically.

Inserting Clock Compensation Zones

Background. When the receiver logic receives a symbol stream, it sometimes needs to add or remove a symbol from the received symbol stream to compensate for transmitter verses receiver clock frequency variations (for background, refer to "Receiver Clock Compensation Logic" on page 442).

It should be obvious that the receiver logic can't arbitrarily pick a symbol to add or delete. This means that, on a periodic basis, the transmit logic must transmit a special Control character sequence that can be used for this purpose. This sequence is referred to as the SKIP Ordered-Set (see Figure 11-19) which consists of a COM character followed by three SKP characters.

SKIP Ordered-Set Insertion Rules. A transmitter is required to transmit SKIP Ordered-Sets on a periodic basis. The following rules apply:

The set must be scheduled for insertion at most once every $1180 symbol$ clocks (i.e., symbol times) and at least once every 1538 symbol clocks.

When it's time to insert a SKIP Ordered-Set, it is inserted at the next packet boundary (not in the middle of a packet). SKIP Ordered-Sets are inserted between packets simultaneously on all Lanes. If a long packet transmission is already in progress, the SKIP Ordered-Sets are accumulated and then inserted consecutively at the next packet boundary.

In a multi-Lane environment, the SKIP Ordered-Set must be transmitted on all Lanes simultaneously (see Figure 11-11 on page 414 and Figure 11-12 on page 415). When necessary, the Link is padded so as to allow all the transmission of the SKIP Ordered-Sets to start on the same clock (see Figure 11-12 on page 415).

During all lower power Link states, any counter(s) used to schedule SKIP Ordered-Sets must be reset.

SKIP Ordered-Sets must not be transmitted while the Compliance Pattern is in progress.

Figure 11-19: SKIP Ordered-Set Encoding

	$\leftrightarrow$ K28.5
	$\to$ K28.0
SKP	→ $K 28.0$
SKP	$\to$ K28.0

Receive Logic Details

Figure 11-20 shows the receiver logic of the Logical Physical Layer. This section describes packet processing from the time the data is received serially on each Lane until the packet byte stream is clocked to the Data Link Layer.

Figure 11-21 illustrates the receiver logic's front end on each Lane. This is comprised of:

The differential receiver.

The Rx Clock recovery logic.

The COM symbol and Ordered-Set detector.

The Serial-to-Parallel converter (Deserializer).

The Lane-to-Lane De-Skew logic (delay circuit).

The Elastic Buffer and Clock Tolerance Compensation logic.

Figure 11-21: Receiver Logic's Front End Per Lane

Differential Receiver

Refer to Figure 11-21. The differential receiver on each Lane senses differential peak-to-peak voltage differences

> 175 mV

but

< 1200 mV

+ difference $=$ Logical 1.

difference $=$ Logical 0 . PCI Express System Architecture

A signal peak-to-peak difference

< 65 mV

is considered a signal absent condition and the Link is in the electrical Idle state. During this time, the receiver de-gates its input to prevent the error circuit from detecting an error. A signal peak-to-peak differential voltage between

65 mV

and

175 mV

serves as noise guard band.

Rx Clock Recovery

General

Using a PLL (Phase-Locked Loop), the receiver circuit generates the Rx Clock from the data bit transitions in the input data stream. This recovered clock has the same frequency

(2.5 GHz)

as that of the

Tx

Clock used by the transmitting device to clock the data bit stream onto the wire (or fiber). The Rx Clock is used to clock the inbound serial symbol stream into the Serial-to-Parallel converter (Deserializer). The 10-bit symbol stream produced by the Deserializer is clocked into the elastic buffer with a divide by 10 version of the Rx Clock. The Rx Clock is different from the Local Clock that is used to clock symbols out of the Elastic Buffer to the

10 b / 8 b

decoder. The Local Clock must be accurate to within

+ / -

300 ppm

from center frequency.

Achieving Bit Lock

Recollect that the inbound serial symbol stream is guaranteed to have frequent 1-to-0 and 0-to-1 transitions due to the

8 b / 10 b

encoding scheme. A transition is guaranteed at least every 5 bit-times. The receiver PLL uses the transitions in the received bit-stream to synchronize the Rx Clock with the Tx Clock that was used at the transmitter to clock out the serialized bit stream. When the receiver PLL locks on to the Tx Clock frequency, the receiver is said to have achieved "Bit Lock".

During Link training, the transmitter device sends a long series of back-to-back TS1 and TS2 Ordered-Sets to the receiver and the receiver uses the bit transitions in these Ordered-Sets to achieve Bit Lock. Once the Link is in the full-on L0 state, transitions on the Link occur on a regular basis and the receiver PLL is able to maintain Bit Lock.

Losing Bit Lock

If the Link is put in a low power state (such as L0s) where packet transmission ceases, the receiver's PLL gradually loses synchronization. The transmitter sends an electrical Idle Ordered-Set to tell the receiver to de-gate its input to prevent the error circuit from detecting an error.

Regaining Bit Lock

When the Link is in the L0s state, the transmitter sends a few FTS Ordered-sets (on the order of four FTSs) to the receiver and the receiver uses these to regain Bit Lock. Only a few FTSs are needed by the receiver in order to achieve Bit Lock (thus the wake up latency is of short duration). Because the Link is in the L0s state for a short time, the receiver PLL does not completely lose synchronization with the Tx Clock before it receives the FTSs.

Serial-to-Parallel converter (Deserializer)

The incoming serial data on each Lane is clocked into that Lane's Deserializer (the serial-to-parallel converter) by the Rx clock (see Figure 11-21 on page 439). The 10-bit symbols produced are clocked into an Elastic Buffer using a divide-by-10 version of the Rx Clock.

Symbol Boundary Sensing (Symbol Lock)

When the receive logic starts receiving a bit stream, it is JABOB (just a bunch of bits) with no markers to differentiate one symbol from another. The receive logic must have some way to determine the start and end of a 10-bit symbol. The Comma (COM) symbol serves this purpose.

The 10-bit encoding of the COM (K28.5) symbol contains two bits of one polarity followed by five bits of the opposite polarity (0011111010b or 1100000101b). Unless an error occurs, no other character has this property, thereby making it easily detectable. Recollect that the COM Control character, like all other Control characters, is not scrambled by the transmitter. This makes the COM easily detectable by the COM detector which looks for two consecutive 0s or two consecutive 1s followed by a string of five 1s or five 0s, respectively. Upon detection of the COM symbol, the COM Detector knows that the next bit received after the COM symbol is the first bit of a valid 10-bit symbol. The Deserializer is then initialized so that it can henceforth generate valid 10-bit symbols. The Deserial-izer is said to achieve 'Symbol Lock'.

PCI EXPRESS System ARCHITECTURE

Are your company's technical training needs being addressed in the most effective manner?

training that fits your needs

MindShare recognizes and addresses your company's technical training issues with:

Engage MindShare

We are proud to be the preferred training provider at an extensive list of clients that include:

The PC System Architecture Series

PCI Express System Architecture

Part One: The Big Picture

Chapter 1: Architectural Perspective

About This Book

Part Two: Transaction Protocol

Contents

Chapter 4: Packet-Based Transactions

Chapter 5: ACK/NAK Protocol

Chapter 6: QoS/TCs/VCs and Arbitration

Chapter 9: Interrupts

Chapter 10: Error Detection and Handling

Part Three: The Physical Layer

Chapter 14: Link Initialization & Training

Part Four: Power-Related Topics

Chapter 16: Power Management

Part Five: Optional Topics

Chapter 17: Hot Plug

Chapter 18: Add-in Cards and Connectors

Part Six: PCI Express Configuration

Chapter 20: Configuration Mechanisms

Chapter 21: PCI Express Enumeration

Chapter 24: Express-Specific Configuration Registers

Appendices

Acknowledgments

The MindShare Architecture Series

Cautionary Note

Intended Audience

Prerequisite Knowledge

Topics and Organization

Documentation Conventions

PCI Express TM

Hexadecimal Notation

Binary Notation

Decimal Notation

Bits Versus Bytes Notation

Bit Fields

Active Signal States

Visit Our Web Site

PCI Express System Architecture

We Want Your Feedback

Architectural Perspective

This Chapter

The Next Chapter

Introduction To PCI Express

The Role of the Original PCI Solution

Don't Throw Away What is Good! Keep It

Make Improvements for the Future

Looking into the Future

Predecessor Buses Compared

Author's Disclaimer

Bus Performances and Number of Slots Compared

PCI Express Aggregate Throughput

PCI Express System Architecture

Performance Per Pin Compared

Chapter 1: Architectural Perspective

I/O Bus Architecture Perspective

33 MHz PCI Bus Based System

Electrical Load Limit of a 33 MHz PCI Bus

Chapter 1: Architectural Perspective

PCI Transaction Model Direct Memory Access (DMA)

PCI Express System Architecture

PCI Transaction Model Peer-to-Peer

PCI Bus Arbitration

Chapter 1: Architectural Perspective

PCI Delayed Transaction Protocol

PCI Interrupt Handling

PCI Error Handling

Chapter 1: Architectural Perspective

PCI Address Space Map

Chapter 1: Architectural Perspective

PCI Configuration Cycle Generation

PCI Function Configuration Register Space

PCI Programming Model

Limitations of $66 MHz PCl$ bus