Virtual I/O Device (VIRTIO) Version 1.0

Committee Specification 04

03 March 2016

Specification URIs

This version:

Previous version:

Latest version:

Technical Committee:

Chairs:

Editors:

Additional artifacts:

Example Driver Listing:

http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/listings/ This prose specification is one component of a Work Product that also includes:

Related work:

Virtio PCI Card Specification Version 0.9.5:

http://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf This specification replaces or supersedes:

Abstract:

This document describes the specifications of the “virtio” family of devices. These devices are found in virtual environments, yet by design they look like physical devices to the guest within the virtual machine - and this document treats them as such. This similarity allows the guest to use standard drivers and discovery mechanisms. The purpose of virtio and this specification is that virtual environments and guests should have a straightforward, efficient, standard and extensible mechanism for virtual devices, rather than boutique per-environment or per-OS mechanisms.

Status:

This document was last revised or approved by the Virtual I/O Device (VIRTIO) TC on the above date. The level of approval is also listed above. Check the “Latest version” location noted above for possible later revisions of this document. Any other numbered Versions and other technical work produced by the Technical Committee (TC) are listed at https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#technical Technical Committee members should send comments on this specification to the Technical Committee’s email list. Others should send comments to the Technical Committee by using the “Send A Comment” button on the Technical Committee’s web page at https://www.oasis-open.org/committees/virtio/. For information on whether any patents have been disclosed that may be essential to implementing this specification, and any offers of patent licensing terms, please refer to the Intellectual Property Rights section of the Technical Committee web page (https://www.oasis-open.org/committees/virtio/ipr.php).

Citation format:



When referencing this specification the following citation format should be used: [VIRTIO-v1.0]

Virtual I/O Device (VIRTIO) Version 1.0. Edited by Rusty Russell, Michael S. Tsirkin, Cornelia Huck, and Pawel Moll. 03 March 2016. OASIS Committee Specification 04. http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.html. Latest version: http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html.

__________________________________________________________________

Notices

Copyright © OASIS Open 2015. All Rights Reserved.

All capitalized terms in the following text have the meanings assigned to them in the OASIS Intellectual Property Rights Policy (the "OASIS IPR Policy"). The full Policy may be found at the OASIS website.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published, and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this section are included on all such copies and derivative works. However, this document itself may not be modified in any way, including by removing the copyright notice or references to OASIS, except as needed for the purpose of developing any document or deliverable produced by an OASIS Technical Committee (in which case the rules applicable to copyrights, as set forth in the OASIS IPR Policy, must be followed) or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be revoked by OASIS or its successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and OASIS DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY OWNERSHIP RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

OASIS requests that any OASIS Party or any other party that believes it has patent claims that would necessarily be infringed by implementations of this OASIS Committee Specification or OASIS Standard, to notify OASIS TC Administrator and provide an indication of its willingness to grant patent licenses to such patent claims in a manner consistent with the IPR Mode of the OASIS Technical Committee that produced this specification.

OASIS invites any party to contact the OASIS TC Administrator if it is aware of a claim of ownership of any patent claims that would necessarily be infringed by implementations of this specification by a patent holder that is not willing to provide a license to such patent claims in a manner consistent with the IPR Mode of the OASIS Technical Committee that produced this specification. OASIS may include such claims on its website, but disclaims any obligation to do so.

OASIS takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on OASIS’ procedures with respect to rights in any document or deliverable produced by an OASIS Technical Committee can be found on the OASIS website. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this OASIS Committee Specification or OASIS Standard, can be obtained from the OASIS TC Administrator. OASIS makes no representation that any information or list of intellectual property rights will at any time be complete, or that any claims in such list are, in fact, Essential Claims.

The name "OASIS" is a trademark of OASIS, the owner and developer of this specification, and should be used only to refer to the organization and its official outputs. OASIS welcomes reference to, and implementation and use of, specifications, while reserving the right to enforce its marks against misleading uses. Please see https://www.oasis-open.org/policies-guidelines/trademark for above guidance.



__________________________________________________________________

Table of Contents

1 Introduction

The purpose of virtio and this specification is that virtual environments and guests should have a straightforward, efficient, standard and extensible mechanism for virtual devices, rather than boutique per-environment or per-OS mechanisms.

Straightforward: Virtio devices use normal bus mechanisms of interrupts and DMA which should be familiar to any device driver author. There is no exotic page-flipping or COW mechanism: it’s just a normal device. Efficient: Virtio devices consist of rings of descriptors for both input and output, which are neatly laid out to avoid cache effects from both driver and device writing to the same cache lines. Standard: Virtio makes no assumptions about the environment in which it operates, beyond supporting the bus to which device is attached. In this specification, virtio devices are implemented over MMIO, Channel I/O and PCI bus transports , earlier drafts have been implemented on other buses not included here. Extensible: Virtio devices contain feature bits which are acknowledged by the guest operating system during device setup. This allows forwards and backwards compatibility: the device offers all the features it knows about, and the driver acknowledges those it understands and wishes to use.

1.1 Normative References

[RFC2119] Bradner S., “Key words for use in RFCs to Indicate Requirement Levels”, BCP 14, RFC 2119, March 1997.

http://www.ietf.org/rfc/rfc2119.txt [S390 PoP] z/Architecture Principles of Operation, IBM Publication SA22-7832,

http://publibfi.boulder.ibm.com/epubs/pdf/dz9zr009.pdf, and any future revisions [S390 Common I/O] ESA/390 Common I/O-Device and Self-Description, IBM Publication SA22-7204,

http://publibfp.dhe.ibm.com/cgi-bin/bookmgr/BOOKS/dz9ar501/CCONTENTS, and any future revisions [PCI] Conventional PCI Specifications,

http://www.pcisig.com/specifications/conventional/, PCI-SIG [PCIe] PCI Express Specifications

http://www.pcisig.com/specifications/pciexpress/, PCI-SIG [IEEE 802] IEEE Standard for Local and Metropolitan Area Networks: Overview and Architecture,

http://standards.ieee.org/about/get/802/802.html, IEEE [SAM] SCSI Architectural Model,

http://www.t10.org/cgi-bin/ac.pl?t=f&f=sam4r05.pdf [SCSI MMC] SCSI Multimedia Commands,

http://www.t10.org/cgi-bin/ac.pl?t=f&f=mmc6r00.pdf

1.2 Non-Normative References

[Virtio PCI Draft] Virtio PCI Draft Specification

http://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf

1.3 Terminology

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in [RFC2119].

1.3.1 Legacy Interface: Terminology

Earlier drafts of this specification (i.e. revisions before 1.0, see e.g. [Virtio PCI Draft]) defined a similar, but different interface between the driver and the device. Since these are widely deployed, this specification accommodates OPTIONAL features to simplify transition from these earlier draft interfaces.

Specifically devices and drivers MAY support:

Legacy Interface is an interface specified by an earlier draft of this specification (before 1.0) Legacy Device is a device implemented before this specification was released, and implementing a legacy interface on the host side Legacy Driver is a driver implemented before this specification was released, and implementing a legacy interface on the guest side

Legacy devices and legacy drivers are not compliant with this specification.

To simplify transition from these earlier draft interfaces, a device MAY implement:

Transitional Device a device supporting both drivers conforming to this specification, and allowing legacy drivers.

Similarly, a driver MAY implement:

Transitional Driver a driver supporting both devices conforming to this specification, and legacy devices.

Note:

Devices or drivers with no legacy compatibility are referred to as non-transitional devices and drivers, respectively.

1.3.2 Transition from earlier specification drafts

For devices and drivers already implementing the legacy interface, some changes will have to be made to support this specification.

In this case, it might be beneficial for the reader to focus on sections tagged "Legacy Interface" in the section title. These highlight the changes made since the earlier drafts.

1.4 Structure Specifications

Many device and driver in-memory structure layouts are documented using the C struct syntax. All structures are assumed to be without additional padding. To stress this, cases where common C compilers are known to insert extra padding within structures are tagged using the GNU C __attribute__((packed)) syntax.

For the integer data types used in the structure definitions, the following conventions are used:

u8, u16, u32, u64 An unsigned integer of the specified length in bits. le16, le32, le64 An unsigned integer of the specified length in bits, in little-endian byte order. be16, be32, be64 An unsigned integer of the specified length in bits, in big-endian byte order.

2 Basic Facilities of a Virtio Device

Device status field

Feature bits

Device Configuration space

One or more virtqueues

2.1 Device Status Field

During device initialization by a driver, the driver follows the sequence of steps specified in 3.1.

The device status field provides a simple low-level indication of the completed steps of this sequence. It’s most useful to imagine it hooked up to traffic lights on the console indicating the status of each device. The following bits are defined (listed below in the order in which they would be typically set):

ACKNOWLEDGE (1) Indicates that the guest OS has found the device and recognized it as a valid virtio device. DRIVER (2) Indicates that the guest OS knows how to drive the device. Note: There could be a significant (or infinite) delay before setting this bit. For example, under Linux, drivers can be loadable modules. FAILED (128) Indicates that something went wrong in the guest, and it has given up on the device. This could be an internal error, or the driver didn’t like the device for some reason, or even a fatal error during device operation. FEATURES_OK (8) Indicates that the driver has acknowledged all the features it understands, and feature negotiation is complete. DRIVER_OK (4) Indicates that the driver is set up and ready to drive the device. DEVICE_NEEDS_RESET (64) Indicates that the device has experienced an error from which it can’t recover.

2.1.1 Driver Requirements: Device Status Field

The driver MUST update device status, setting bits to indicate the completed steps of the driver initialization sequence specified in 3.1. The driver MUST NOT clear a device status bit. If the driver sets the FAILED bit, the driver MUST later reset the device before attempting to re-initialize.

The driver SHOULD NOT rely on completion of operations of a device if DEVICE_NEEDS_RESET is set. Note: For example, the driver can’t assume requests in flight will be completed if DEVICE_NEEDS_RESET is set, nor can it assume that they have not been completed. A good implementation will try to recover by issuing a reset.

2.1.2 Device Requirements: Device Status Field

The device MUST initialize device status to 0 upon reset.

The device MUST NOT consume buffers or notify the driver before DRIVER_OK.

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device MUST send a device configuration change notification to the driver.

2.2 Feature Bits

Each virtio device offers all the features it understands. During device initialization, the driver reads this and tells the device the subset that it accepts. The only way to renegotiate is to reset the device.

This allows for forwards and backwards compatibility: if the device is enhanced with a new feature bit, older drivers will not write that feature bit back to the device. Similarly, if a driver is enhanced with a feature that the device doesn’t support, it see the new feature is not offered.

Feature bits are allocated as follows:

0 to 23 Feature bits for the specific device type 24 to 32 Feature bits reserved for extensions to the queue and feature negotiation mechanisms 33 and above Feature bits reserved for future extensions.

Note:

In particular, new fields in the device configuration space are indicated by offering a new feature bit.

2.2.1 Driver Requirements: Feature Bits

The driver MUST NOT accept a feature which the device did not offer, and MUST NOT accept a feature which requires another feature which was not accepted.

The driver SHOULD go into backwards compatibility mode if the device does not offer a feature it understands, otherwise MUST set the FAILED device status bit and cease initialization.

2.2.2 Device Requirements: Feature Bits

The device MUST NOT offer a feature which requires another feature which was not offered. The device SHOULD accept any valid subset of features the driver accepts, otherwise it MUST fail to set the FEATURES_OK device status bit when the driver writes it.

2.2.3 Legacy Interface: A Note on Feature Bits

Transitional Drivers MUST detect Legacy Devices by detecting that the feature bit VIRTIO_F_VERSION_1 is not offered. Transitional devices MUST detect Legacy drivers by detecting that VIRTIO_F_VERSION_1 has not been acknowledged by the driver.

In this case device is used through the legacy interface.

Legacy interface support is OPTIONAL. Thus, both transitional and non-transitional devices and drivers are compliant with this specification.

Requirements pertaining to transitional devices and drivers is contained in sections named ’Legacy Interface’ like this one.

When device is used through the legacy interface, transitional devices and transitional drivers MUST operate according to the requirements documented within these legacy interface sections. Specification text within these sections generally does not apply to non-transitional devices.

2.3 Device Configuration Space

Device configuration space is generally used for rarely-changing or initialization-time parameters. Where configuration fields are optional, their existence is indicated by feature bits: Future versions of this specification will likely extend the device configuration space by adding extra fields at the tail. Note: The device configuration space uses the little-endian format for multi-byte fields.

Each transport also provides a generation count for the device configuration space, which will change whenever there is a possibility that two accesses to the device configuration space can see different versions of that space.

2.3.1 Driver Requirements: Device Configuration Space

Drivers MUST NOT assume reads from fields greater than 32 bits wide are atomic, nor are reads from multiple fields: drivers SHOULD read device configuration space fields like so:

u32 before , after ;

do {

before = get_config_generation ( device ) ;

// read config entry / entries .

after = get_config_generation ( device ) ;

} while ( after != before ) ;

For optional configuration space fields, the driver MUST check that the corresponding feature is offered before accessing that part of the configuration space. Note: See section 3.1 for details on feature negotiation.

Drivers MUST NOT limit structure size and device configuration space size. Instead, drivers SHOULD only check that device configuration space is large enough to contain the fields necessary for device operation. Note: For example, if the specification states that device configuration space ’includes a single 8-bit field’ drivers should understand this to mean that the device configuration space might also include an arbitrary amount of tail padding, and accept any device configuration space size equal to or greater than the specified 8-bit size.

2.3.2 Device Requirements: Device Configuration Space

The device MUST allow reading of any device-specific configuration field before FEATURES_OK is set by the driver. This includes fields which are conditional on feature bits, as long as those feature bits are offered by the device.

2.3.3 Legacy Interface: A Note on Device Configuration Space endian-ness

Note that for legacy interfaces, device configuration space is generally the guest’s native endian, rather than PCI’s little-endian. The correct endian-ness is documented for each device.

2.3.4 Legacy Interface: Device Configuration Space

Legacy devices did not have a configuration generation field, thus are susceptible to race conditions if configuration is updated. This affects the block capacity (see 5.2.4) and network mac (see 5.1.4) fields; when using the legacy interface, drivers SHOULD read these fields multiple times until two reads generate a consistent result.

2.4 Virtqueues

The mechanism for bulk data transport on virtio devices is pretentiously called a virtqueue. Each device can have zero or more virtqueues. Each queue has a 16-bit queue size parameter, which sets the number of entries and implies the total size of the queue.

Each virtqueue consists of three parts:

Descriptor Table

Available Ring

Used Ring

where each part is physically-contiguous in guest memory, and has different alignment requirements.

The memory aligment and size requirements, in bytes, of each part of the virtqueue are summarized in the following table:

Virtqueue Part Alignment Size Descriptor Table 16 16 ∗ (Queue Size) Available Ring 2 6 + 2 ∗ (Queue Size) Used Ring 4 6 + 8 ∗ (Queue Size)

The Alignment column gives the minimum alignment for each part of the virtqueue.

The Size column gives the total number of bytes for each part of the virtqueue.

Queue Size corresponds to the maximum number of buffers in the virtqueue. Queue Size value is always a power of 2. The maximum Queue Size value is 32768. This value is specified in a bus-specific way.

When the driver wants to send a buffer to the device, it fills in a slot in the descriptor table (or chains several together), and writes the descriptor index into the available ring. It then notifies the device. When the device has finished a buffer, it writes the descriptor index into the used ring, and sends an interrupt.

2.4.1 Driver Requirements: Virtqueues

The driver MUST ensure that the physical address of the first byte of each virtqueue part is a multiple of the specified alignment value in the above table.

2.4.2 Legacy Interfaces: A Note on Virtqueue Layout

For Legacy Interfaces, several additional restrictions are placed on the virtqueue layout:

Each virtqueue occupies two or more physically-contiguous pages (usually defined as 4096 bytes, but depending on the transport; henceforth referred to as Queue Align) and consists of three parts:

Descriptor Table Available Ring (…padding…) Used Ring

The bus-specific Queue Size field controls the total number of bytes for the virtqueue. When using the legacy interface, the transitional driver MUST retrieve the Queue Size field from the device and MUST allocate the total number of bytes for the virtqueue according to the following formula (Queue Align given in qalign and Queue Size given in qsz):

# define ALIGN ( x ) ((( x ) + qalign ) & " qalign )

static inline unsigned virtq_size ( unsigned int qsz )

{

return ALIGN ( sizeof ( struct virtq_desc ) * qsz + sizeof ( u16 ) *(3 + qsz ) )

+ ALIGN ( sizeof ( u16 ) *3 + sizeof ( struct virtq_used_elem ) * qsz ) ;

}

This wastes some space with padding. When using the legacy interface, both transitional devices and drivers MUST use the following virtqueue layout structure to locate elements of the virtqueue:

struct virtq {

// The actual descriptors (16 bytes each )

struct virtq_desc desc [ Queue Size ];



// A ring of available descriptor heads with free - running index .

struct virtq_avail avail ;



// Padding to the next Queue Align boundary .

u8 pad [ Padding ];



// A ring of used descriptor heads with free - running index .

struct virtq_used used ;

};

2.4.3 Legacy Interfaces: A Note on Virtqueue Endianness

Note that when using the legacy interface, transitional devices and drivers MUST use the native endian of the guest as the endian of fields and in the virtqueue. This is opposed to little-endian for non-legacy interface as specified by this standard. It is assumed that the host is already aware of the guest endian.

2.4.4 Message Framing

The framing of messages with descriptors is independent of the contents of the buffers. For example, a network transmit buffer consists of a 12 byte header followed by the network packet. This could be most simply placed in the descriptor table as a 12 byte output descriptor followed by a 1514 byte output descriptor, but it could also consist of a single 1526 byte output descriptor in the case where the header and packet are adjacent, or even three or more descriptors (possibly with loss of efficiency in that case).

Note that, some device implementations have large-but-reasonable restrictions on total descriptor size (such as based on IOV_MAX in the host OS). This has not been a problem in practice: little sympathy will be given to drivers which create unreasonably-sized descriptors such as by dividing a network packet into 1500 single-byte descriptors!

2.4.4.1 Device Requirements: Message Framing

The device MUST NOT make assumptions about the particular arrangement of descriptors. The device MAY have a reasonable limit of descriptors it will allow in a chain.

2.4.4.2 Driver Requirements: Message Framing

The driver MUST place any device-writable descriptor elements after any device-readable descriptor elements.

The driver SHOULD NOT use an excessive number of descriptors to describe a buffer.

2.4.4.3 Legacy Interface: Message Framing

Regrettably, initial driver implementations used simple layouts, and devices came to rely on it, despite this specification wording. In addition, the specification for virtio_blk SCSI commands required intuiting field lengths from frame boundaries (see 5.2.6.3 Legacy Interface: Device Operation)

Thus when using the legacy interface, the VIRTIO_F_ANY_LAYOUT feature indicates to both the device and the driver that no assumptions were made about framing. Requirements for transitional drivers when this is not negotiated are included in each device section.

2.4.5 The Virtqueue Descriptor Table

The descriptor table refers to the buffers the driver is using for the device. addr is a physical address, and the buffers can be chained via next. Each descriptor describes a buffer which is read-only for the device (“device-readable”) or write-only for the device (“device-writable”), but a chain of descriptors can contain both device-readable and device-writable buffers.

The actual contents of the memory offered to the device depends on the device type. Most common is to begin the data with a header (containing little-endian fields) for the device to read, and postfix it with a status tailer for the device to write.

struct virtq_desc {

/* Address ( guest - physical ) . */

le64 addr ;

/* Length . */

le32 len ;



/* This marks a buffer as continuing via the next field . */

# define VIRTQ_DESC_F_NEXT 1

/* This marks a buffer as device write - only ( otherwise device read - only ) . */

# define VIRTQ_DESC_F_WRITE 2

/* This means the buffer contains a list of buffer descriptors . */

# define VIRTQ_DESC_F_INDIRECT 4

/* The flags as indicated above . */

le16 flags ;

/* Next field if flags & NEXT */

le16 next ;

};

The number of descriptors in the table is defined by the queue size for this virtqueue: this is the maximum possible descriptor chain length. Note: The legacy [Virtio PCI Draft] referred to this structure as vring_desc, and the constants as VRING_DESC_F_NEXT, etc, but the layout and values were identical.

2.4.5.1 Device Requirements: The Virtqueue Descriptor Table

A device MUST NOT write to a device-readable buffer, and a device SHOULD NOT read a device-writable buffer (it MAY do so for debugging or diagnostic purposes).

2.4.5.2 Driver Requirements: The Virtqueue Descriptor Table

Drivers MUST NOT add a descriptor chain over than 232 bytes long in total; this implies that loops in the descriptor chain are forbidden!

2.4.5.3 Indirect Descriptors

Some devices benefit by concurrently dispatching a large number of large requests. The VIRTIO_F_INDIRECT_DESC feature allows this (see A virtio_queue.h). To increase ring capacity the driver can store a table of indirect descriptors anywhere in memory, and insert a descriptor in main virtqueue (with flags&VIRTQ_DESC_F_INDIRECT on) that refers to memory buffer containing this indirect descriptor table; addr and len refer to the indirect table address and length in bytes, respectively.

The indirect table layout structure looks like this (len is the length of the descriptor that refers to this table, which is a variable, so this code won’t compile):

struct indirect_descriptor_table {

/* The actual descriptors (16 bytes each ) */

struct virtq_desc desc [ len / 16];

};

The first indirect descriptor is located at start of the indirect descriptor table (index 0), additional indirect descriptors are chained by next. An indirect descriptor without a valid next (with flags&VIRTQ_DESC_F_NEXT off) signals the end of the descriptor. A single indirect descriptor table can include both device-readable and device-writable descriptors.

2.4.5.3.1 Driver Requirements: Indirect Descriptors

A driver MUST NOT create a descriptor chain longer than the Queue Size of the device.

A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT in flags.

2.4.5.3.2 Device Requirements: Indirect Descriptors

flags

The device MUST handle the case of zero or more normal chained descriptors followed by a single descriptor with flags&VIRTQ_DESC_F_INDIRECT. Note: While unusual (most implementations either create a chain solely using non-indirect descriptors, or use a single indirect element), such a layout is valid.

2.4.6 The Virtqueue Available Ring

struct virtq_avail {

# define VIRTQ_AVAIL_F_NO_INTERRUPT 1

le16 flags ;

le16 idx ;

le16 ring [ /* Queue Size */ ];

le16 used_event ; /* Only if VIRTIO_F_EVENT_IDX */

};

The driver uses the available ring to offer buffers to the device: each ring entry refers to the head of a descriptor chain. It is only written by the driver and read by the device.

idx field indicates where the driver would put the next descriptor entry in the ring (modulo the queue size). This starts at 0, and increases. Note: The legacy [Virtio PCI Draft] referred to this structure as vring_avail, and the constant as VRING_AVAIL_F_NO_INTERRUPT, but the layout and value were identical.

2.4.7 Virtqueue Interrupt Suppression

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated, the flags field in the available ring offers a crude mechanism for the driver to inform the device that it doesn’t want interrupts when buffers are used. Otherwise used_event is a more performant alternative where the driver specifies how far the device can progress before interrupting.

Neither of these interrupt suppression methods are reliable, as they are not synchronized with the device, but they serve as useful optimizations.

2.4.7.1 Driver Requirements: Virtqueue Interrupt Suppression

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:

The driver MUST set flags to 0 or 1.

to 0 or 1. The driver MAY set flags to 1 to advise the device that interrupts are not needed.

Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:

The driver MUST set flags to 0.

to 0. The driver MAY use used_event to advise the device that interrupts are unnecessary until the device writes entry with an index specified by used_event into the used ring (equivalently, until idx in the used ring will reach the value used_event + 1).

The driver MUST handle spurious interrupts from the device.

2.4.7.2 Device Requirements: Virtqueue Interrupt Suppression

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:

The device MUST ignore the used_event value.

value. After the device writes a descriptor index into the used ring: If flags is 1, the device SHOULD NOT send an interrupt. If flags is 0, the device MUST send an interrupt.



Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:

The device MUST ignore the lower bit of flags .

. After the device writes a descriptor index into the used ring: If the idx field in the used ring (which determined where that descriptor index was placed) was equal to used_event , the device MUST send an interrupt. Otherwise the device SHOULD NOT send an interrupt.



Note:

used_event

2.4.8 The Virtqueue Used Ring

struct virtq_used {

# define VIRTQ_USED_F_NO_NOTIFY 1

le16 flags ;

le16 idx ;

struct virtq_used_elem ring [ /* Queue Size */];

le16 avail_event ; /* Only if VIRTIO_F_EVENT_IDX */

};



/* le32 is used here for ids for padding reasons . */

struct virtq_used_elem {

/* Index of start of used descriptor chain . */

le32 id ;

/* Total length of the descriptor chain which was used ( written to ) */

le32 len ;

};

The used ring is where the device returns buffers once it is done with them: it is only written to by the device, and read by the driver.

Each entry in the ring is a pair: id indicates the head entry of the descriptor chain describing the buffer (this matches an entry placed in the available ring by the guest earlier), and len the total of bytes written into the buffer. Note: len is particularly useful for drivers using untrusted buffers: if a driver does not know exactly how much has been written by the device, the driver would have to zero the buffer in advance to ensure no data leakage occurs.

For example, a network driver may hand a received buffer directly to an unprivileged userspace application. If the network device has not overwritten the bytes which were in that buffer, this could leak the contents of freed memory from other processes to the application.

idx field indicates where the driver would put the next descriptor entry in the ring (modulo the queue size). This starts at 0, and increases. Note: The legacy [Virtio PCI Draft] referred to these structures as vring_used and vring_used_elem, and the constant as VRING_USED_F_NO_NOTIFY, but the layout and value were identical.

2.4.8.1 Legacy Interface: The Virtqueue Used Ring

Historically, many drivers ignored the len value, as a result, many devices set len incorrectly. Thus, when using the legacy interface, it is generally a good idea to ignore the len value in used ring entries if possible. Specific known issues are listed per device type.

2.4.8.2 Device Requirements: The Virtqueue Used Ring

The device MUST set len prior to updating the used idx.

The device MUST write at least len bytes to descriptor, beginning at the first device-writable buffer, prior to updating the used idx.

The device MAY write more than len bytes to descriptor. Note: There are potential error cases where a device might not know what parts of the buffers have been written. This is why len is permitted to be an underestimate: that’s preferable to the driver believing that uninitialized memory has been overwritten when it has not.

2.4.8.3 Driver Requirements: The Virtqueue Used Ring

The driver MUST NOT make assumptions about data in device-writable buffers beyond the first len bytes, and SHOULD ignore this data.

2.4.9 Virtqueue Notification Suppression

The device can suppress notifications in a manner analogous to the way drivers can suppress interrupts as detailed in section 2.4.7. The device manipulates flags or avail_event in the used ring the same way the driver manipulates flags or used_event in the available ring.

2.4.9.1 Driver Requirements: Virtqueue Notification Suppression

The driver MUST initialize flags in the used ring to 0 when allocating the used ring.

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:

The driver MUST ignore the avail_event value.

value. After the driver writes a descriptor index into the available ring: If flags is 1, the driver SHOULD NOT send a notification. If flags is 0, the driver MUST send a notification.



Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:

The driver MUST ignore the lower bit of flags .

. After the driver writes a descriptor index into the available ring: If the idx field in the available ring (which determined where that descriptor index was placed) was equal to avail_event , the driver MUST send a notification. Otherwise the driver SHOULD NOT send a notification.



2.4.9.2 Device Requirements: Virtqueue Notification Suppression

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:

The device MUST set flags to 0 or 1.

to 0 or 1. The device MAY set flags to 1 to advise the driver that notifications are not needed.

Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:

The device MUST set flags to 0.

to 0. The device MAY use avail_event to advise the driver that notifications are unnecessary until the driver writes entry with an index specified by avail_event into the available ring (equivalently, until idx in the available ring will reach the value avail_event + 1).

The device MUST handle spurious notifications from the driver.

2.4.10 Helpers for Operating Virtqueues

The Linux Kernel Source code contains the definitions above and helper routines in a more usable form, in include/uapi/linux/virtio_ring.h. This was explicitly licensed by IBM and Red Hat under the (3-clause) BSD license so that it can be freely used by all other projects, and is reproduced (with slight variation) in A virtio_queue.h.

3 General Initialization And Device Operation

3.1 Device Initialization

3.1.1 Driver Requirements: Device Initialization

The driver MUST follow this sequence to initialize a device:

Reset the device. Set the ACKNOWLEDGE status bit: the guest OS has notice the device. Set the DRIVER status bit: the guest OS knows how to drive the device. Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it. Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step. Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable. Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading and possibly writing the device’s virtio configuration space, and population of virtqueues. Set the DRIVER_OK status bit. At this point the device is “live”.

If any of these steps go irrecoverably wrong, the driver SHOULD set the FAILED status bit to indicate that it has given up on the device (it can reset the device later to restart if desired). The driver MUST NOT continue initialization in that case.

The driver MUST NOT notify the device before setting DRIVER_OK.

3.1.2 Legacy Interface: Device Initialization

Legacy devices did not support the FEATURES_OK status bit, and thus did not have a graceful way for the device to indicate unsupported feature combinations. They also did not provide a clear mechanism to end feature negotiation, which meant that devices finalized features on first-use, and no features could be introduced which radically changed the initial operation of the device.

Legacy driver implementations often used the device before setting the DRIVER_OK bit, and sometimes even before writing the feature bits to the device.

The result was the steps 5 and 6 were omitted, and steps 4, 7 and 8 were conflated.

Therefore, when using the legacy interface:

The transitional driver MUST execute the initialization sequence as described in 3.1 but omitting the steps 5 and 6.

The transitional device MUST support the driver writing device configuration fields before the step 4.

The transitional device MUST support the driver using the device before the step 8.

3.2 Device Operation

There are two parts to device operation: supplying new buffers to the device, and processing used buffers from the device. Note: As an example, the simplest virtio network device has two virtqueues: the transmit virtqueue and the receive virtqueue. The driver adds outgoing (device-readable) packets to the transmit virtqueue, and then frees them after they are used. Similarly, incoming (device-writable) buffers are added to the receive virtqueue, and processed after they are used.

3.2.1 Supplying Buffers to The Device

The driver offers buffers to one of the device’s virtqueues as follows:

The driver places the buffer into free descriptor(s) in the descriptor table, chaining as necessary (see 2.4.5 The Virtqueue Descriptor Table). The driver places the index of the head of the descriptor chain into the next ring entry of the available ring. Steps 1 and 2 MAY be performed repeatedly if batching is possible. The driver performs suitable a memory barrier to ensure the device sees the updated descriptor table and available ring before the next step. The available idx is increased by the number of descriptor chain heads added to the available ring. The driver performs a suitable memory barrier to ensure that it updates the idx field before checking for notification suppression. If notifications are not suppressed, the driver notifies the device of the new available buffers.

Note that the above code does not take precautions against the available ring buffer wrapping around: this is not possible since the ring buffer is the same size as the descriptor table, so step (1) will prevent such a condition.

In addition, the maximum queue size is 32768 (the highest power of 2 which fits in 16 bits), so the 16-bit idx value can always distinguish between a full and empty buffer.

What follows is the requirements of each stage in more detail.

3.2.1.1 Placing Buffers Into The Descriptor Table

A buffer consists of zero or more device-readable physically-contiguous elements followed by zero or more physically-contiguous device-writable elements (each has at least one element). This algorithm maps it into the descriptor table to form a descriptor chain:

for each buffer element, b:

Get the next free descriptor table entry, d Set d.addr to the physical address of the start of b Set d.len to the length of b. If b is device-writable, set d.flags to VIRTQ_DESC_F_WRITE, otherwise 0. If there is a buffer element after this: Set d.next to the index of the next free descriptor element. Set the VIRTQ_DESC_F_NEXT bit in d.flags .

In practice, d.next is usually used to chain free descriptors, and a separate count kept to check there are enough free descriptors before beginning the mappings.

3.2.1.2 Updating The Available Ring

The descriptor chain head is the first d in the algorithm above, ie. the index of the descriptor table entry referring to the first part of the buffer. A naive driver implementation MAY do the following (with the appropriate conversion to-and-from little-endian assumed):

avail -> ring [ avail -> idx % qsz ] = head ;

However, in general the driver MAY add many descriptor chains before it updates idx (at which point they become visible to the device), so it is common to keep a counter of how many the driver has added:

avail -> ring [( avail -> idx + added ++) % qsz ] = head ;

3.2.1.3 Updating idx

idx always increments, and wraps naturally at 65536:

avail -> idx += added ;

Once available idx is updated by the driver, this exposes the descriptor and its contents. The device MAY access the descriptor chains the driver created and the memory they refer to immediately.

3.2.1.3.1 Driver Requirements: Updating idx

idx

3.2.1.4 Notifying The Device

The actual method of device notification is bus-specific, but generally it can be expensive. So the device MAY suppress such notifications if it doesn’t need them, as detailed in section 2.4.9.

The driver has to be careful to expose the new idx value before checking if notifications are suppressed.

3.2.1.4.1 Driver Requirements: Notifying The Device

flags

avail_event

3.2.2 Receiving Used Buffers From The Device

Once the device has used buffers referred to by a descriptor (read from or written to them, or parts of both, depending on the nature of the virtqueue and the device), it interrupts the driver as detailed in section 2.4.7. Note: For optimal performance, a driver MAY disable interrupts while processing the used ring, but beware the problem of missing interrupts between emptying the ring and reenabling interrupts. This is usually handled by re-checking for more used buffers after interrups are re-enabled:

virtq_disable_interrupts ( vq ) ;



for (;;) {

if ( vq -> last_seen_used != le16_to_cpu ( virtq -> used . idx ) ) {

virtq_enable_interrupts ( vq ) ;

mb () ;



if ( vq -> last_seen_used != le16_to_cpu ( virtq -> used . idx ) )

break ;



virtq_disable_interrupts ( vq ) ;

}



struct virtq_used_elem * e = virtq . used -> ring [ vq -> last_seen_used % vsz ];

process_buffer ( e ) ;

vq -> last_seen_used ++;

}

3.2.3 Notification of Device Configuration Changes

For devices where the device-specific configuration information can be changed, an interrupt is delivered when a device-specific configuration change occurs.

In addition, this interrupt is triggered by the device setting DEVICE_NEEDS_RESET (see 2.1.2).

3.3 Device Cleanup

Once the driver has set the DRIVER_OK status bit, all the configured virtqueue of the device are considered live. None of the virtqueues of a device are live once the device has been reset.

3.3.1 Driver Requirements: Device Cleanup

A driver MUST NOT alter descriptor table entries which have been exposed in the available ring (and not marked consumed by the device in the used ring) of a live virtqueue.

A driver MUST NOT decrement the available idx on a live virtqueue (ie. there is no way to “unexpose” buffers).

Thus a driver MUST ensure a virtqueue isn’t live (by device reset) before removing exposed buffers.

4 Virtio Transport Options

4.1 Virtio Over PCI Bus

Virtio devices are commonly implemented as PCI devices.

A Virtio device can be implemented as any kind of PCI device: a Conventional PCI device or a PCI Express device. To assure designs meet the latest level requirements, see the PCI-SIG home page at http://www.pcisig.com for any approved changes.

4.1.1 Device Requirements: Virtio Over PCI Bus

A Virtio device using Virtio Over PCI Bus MUST expose to guest an interface that meets the specification requirements of the appropriate PCI specification: [PCI] and [PCIe] respectively.

4.1.2 PCI Device Discovery

Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device ID 0x1000 through 0x107F inclusive is a virtio device. The actual value within this range indicates which virtio device is supported by the device. The PCI Device ID is calculated by adding 0x1040 to the Virtio Device ID, as indicated in section 5. Additionally, devices MAY utilize a Transitional PCI Device ID range, 0x1000 to 0x103F depending on the device type.

4.1.2.1 Device Requirements: PCI Device Discovery

Devices MUST have the PCI Vendor ID 0x1AF4. Devices MUST either have the PCI Device ID calculated by adding 0x1040 to the Virtio Device ID, as indicated in section 5 or have the Transitional PCI Device ID depending on the device type, as follows:

Transitional PCI Device ID Virtio Device 0x1000 network card 0x1001 block device 0x1002 memory ballooning (traditional) 0x1003 console 0x1004 SCSI host 0x1005 entropy source 0x1009 9P transport

For example, the network card device with the Virtio Device ID 1 has the PCI Device ID 0x1041 or the Transitional PCI Device ID 0x1000.

The PCI Subsystem Vendor ID and the PCI Subsystem Device ID MAY reflect the PCI Vendor and Device ID of the environment (for informational purposes by the driver).

Non-transitional devices SHOULD have a PCI Device ID in the range 0x1040 to 0x107f. Non-transitional devices SHOULD have a PCI Revision ID of 1 or higher. Non-transitional devices SHOULD have a PCI Subsystem Device ID of 0x40 or higher.

This is to reduce the chance of a legacy driver attempting to drive the device.

4.1.2.2 Driver Requirements: PCI Device Discovery

Drivers MUST match devices with the PCI Vendor ID 0x1AF4 and the PCI Device ID in the range 0x1040 to 0x107f, calculated by adding 0x1040 to the Virtio Device ID, as indicated in section 5. Drivers for device types listed in section 4.1.2 MUST match devices with the PCI Vendor ID 0x1AF4 and the Transitional PCI Device ID indicated in section 4.1.2.

Drivers MUST match any PCI Revision ID value. Drivers MAY match any PCI Subsystem Vendor ID and any PCI Subsystem Device ID value.

4.1.2.3 Legacy Interfaces: A Note on PCI Device Discovery

Transitional devices MUST have a PCI Revision ID of 0. Transitional devices MUST have the PCI Subsystem Device ID matching the Virtio Device ID, as indicated in section 5. Transitional devices MUST have the Transitional PCI Device ID in the range 0x1000 to 0x103f.

This is to match legacy drivers.

4.1.3 PCI Device Layout

The device is configured via I/O and/or memory regions (though see 4.1.4.7 for access via the PCI configuration space), as specified by Virtio Structure PCI Capabilities.

Fields of different sizes are present in the device configuration regions. All 64-bit, 32-bit and 16-bit fields are little-endian. 64-bit fields are to be treated as two 32-bit fields, with low 32 bit part followed by the high 32 bit part.

4.1.3.1 Driver Requirements: PCI Device Layout

For device configuration access, the driver MUST use 8-bit wide accesses for 8-bit wide fields, 16-bit wide and aligned accesses for 16-bit wide fields and 32-bit wide and aligned accesses for 32-bit and 64-bit wide fields. For 64-bit fields, the driver MAY access each of the high and low 32-bit parts of the field independently.

4.1.3.2 Device Requirements: PCI Device Layout

For 64-bit device configuration fields, the device MUST allow driver independent access to high and low 32-bit parts of the field.

4.1.4 Virtio Structure PCI Capabilities

The virtio device configuration layout includes several structures:

Common configuration

Notifications

ISR Status

Device-specific configuration (optional)

PCI configuration access

Each structure can be mapped by a Base Address register (BAR) belonging to the function, or accessed via the special VIRTIO_PCI_CAP_PCI_CFG field in the PCI configuration space.

The location of each structure is specified using a vendor-specific PCI capability located on the capability list in PCI configuration space of the device. This virtio structure capability uses little-endian format; all fields are read-only for the driver unless stated otherwise:

struct virtio_pci_cap {

u8 cap_vndr ; /* Generic PCI field : PCI_CAP_ID_VNDR */

u8 cap_next ; /* Generic PCI field : next ptr . */

u8 cap_len ; /* Generic PCI field : capability length */

u8 cfg_type ; /* Identifies the structure . */

u8 bar ; /* Where to find it . */

u8 padding [3]; /* Pad to full dword . */

le32 offset ; /* Offset within bar . */

le32 length ; /* Length of the structure , in bytes . */

};

This structure can be followed by extra data, depending on cfg_type, as documented below.

The fields are interpreted as follows:

cap_vndr 0x09; Identifies a vendor-specific capability. cap_next Link to next capability in the capability list in the PCI configuration space. cap_len Length of this capability structure, including the whole of struct virtio_pci_cap, and extra data if any. This length MAY include padding, or fields unused by the driver. cfg_type identifies the structure, according to the following table: /* Common configuration */

# define VIRTIO_PCI_CAP_COMMON_CFG 1

/* Notifications */

# define VIRTIO_PCI_CAP_NOTIFY_CFG 2

/* ISR Status */

# define VIRTIO_PCI_CAP_ISR_CFG 3

/* Device specific configuration */

# define VIRTIO_PCI_CAP_DEVICE_CFG 4

/* PCI configuration access */

# define VIRTIO_PCI_CAP_PCI_CFG 5 Any other value is reserved for future use. Each structure is detailed individually below. The device MAY offer more than one structure of any type - this makes it possible for the device to expose multiple interfaces to drivers. The order of the capabilities in the capability list specifies the order of preference suggested by the device. Note: For example, on some hypervisors, notifications using IO accesses are faster than memory accesses. In this case, the device would expose two capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: the first one addressing an I/O BAR, the second one addressing a memory BAR. In this example, the driver would use the I/O BAR if I/O resources are available, and fall back on memory BAR when I/O resources are unavailable. bar values 0x0 to 0x5 specify a Base Address register (BAR) belonging to the function located beginning at 10h in PCI Configuration Space and used to map the structure into Memory or I/O Space. The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space or I/O Space. Any other value is reserved for future use. offset indicates where the structure begins relative to the base address associated with the BAR. The alignment requirements of offset are indicated in each structure-specific section below. length indicates the length of the structure. length MAY include padding, or fields unused by the driver, or future extensions. Note: For example, a future device might present a large structure size of several MBytes. As current devices never utilize structures larger than 4KBytes in size, driver MAY limit the mapped structure size to e.g. 4KBytes (thus ignoring parts of structure after the first 4KBytes) to allow forward compatibility with such devices without loss of functionality and without wasting resources.

4.1.4.1 Driver Requirements: Virtio Structure PCI Capabilities

The driver MUST ignore any vendor-specific capability structure which has a reserved cfg_type value.

The driver SHOULD use the first instance of each virtio structure type they can support.

The driver MUST accept a cap_len value which is larger than specified here.

The driver MUST ignore any vendor-specific capability structure which has a reserved bar value.

The drivers SHOULD only map part of configuration structure large enough for device operation. The drivers MUST handle an unexpectedly large length, but MAY check that length is large enough for device operation.

The driver MUST NOT write into any field of the capability structure, with the exception of those with cap_type VIRTIO_PCI_CAP_PCI_CFG as detailed in 4.1.4.7.2.

4.1.4.2 Device Requirements: Virtio Structure PCI Capabilities

The device MUST include any extra data (from the beginning of the cap_vndr field through end of the extra data fields if any) in cap_len. The device MAY append extra data or padding to any structure beyond that.

If the device presents multiple structures of the same type, it SHOULD order them from optimal (first) to least-optimal (last).

4.1.4.3 Common configuration structure layout

The common configuration structure is found at the bar and offset within the VIRTIO_PCI_CAP_COMMON_CFG capability; its layout is below.

struct virtio_pci_common_cfg {

/* About the whole device . */

le32 device_feature_select ; /* read - write */

le32 device_feature ; /* read - only for driver */

le32 driver_feature_select ; /* read - write */

le32 driver_feature ; /* read - write */

le16 msix_config ; /* read - write */

le16 num_queues ; /* read - only for driver */

u8 device_status ; /* read - write */

u8 config_generation ; /* read - only for driver */



/* About a specific virtqueue . */

le16 queue_select ; /* read - write */

le16 queue_size ; /* read - write , power of 2, or 0. */

le16 queue_msix_vector ; /* read - write */

le16 queue_enable ; /* read - write */

le16 queue_notify_off ; /* read - only for driver */

le64 queue_desc ; /* read - write */

le64 queue_avail ; /* read - write */

le64 queue_used ; /* read - write */

};

device_feature_select The driver uses this to select which feature bits device_feature shows. Value 0x0 selects Feature Bits 0 to 31, 0x1 selects Feature Bits 32 to 63, etc. device_feature The device uses this to report which feature bits it is offering to the driver: the driver writes to device_feature_select to select which feature bits are presented. driver_feature_select The driver uses this to select which feature bits driver_feature shows. Value 0x0 selects Feature Bits 0 to 31, 0x1 selects Feature Bits 32 to 63, etc. driver_feature The driver writes this to accept feature bits offered by the device. Driver Feature Bits selected by driver_feature_select . config_msix_vector The driver sets the Configuration Vector for MSI-X. num_queues The device specifies the maximum number of virtqueues supported here. device_status The driver writes the device status here (see 2.1). Writing 0 into this field resets the device. config_generation Configuration atomicity value. The device changes this every time the configuration noticeably changes. queue_select Queue Select. The driver selects which virtqueue the following fields refer to. queue_size Queue Size. On reset, specifies the maximum queue size supported by the hypervisor. This can be modified by driver to reduce memory requirements. A 0 means the queue is unavailable. queue_msix_vector The driver uses this to specify the queue vector for MSI-X. queue_enable The driver uses this to selectively prevent the device from executing requests from this virtqueue. 1 - enabled; 0 - disabled. queue_notify_off The driver reads this to calculate the offset from start of Notification structure at which this virtqueue is located. Note: this is not an offset in bytes. See 4.1.4.4 below. queue_desc The driver writes the physical address of Descriptor Table here. See section 2.4. queue_avail The driver writes the physical address of Available Ring here. See section 2.4. queue_used The driver writes the physical address of Used Ring here. See section 2.4.

4.1.4.3.1 Device Requirements: Common configuration structure layout

offset

The device MUST present at least one common configuration capability.

The device MUST present the feature bits it is offering in device_feature, starting at bit device_feature_select ∗ 32 for any device_feature_select written by the driver. Note: This means that it will present 0 for any device_feature_select other than 0 or 1, since no feature defined here exceeds 63.

The device MUST present any valid feature bits the driver has written in driver_feature, starting at bit driver_feature_select ∗ 32 for any driver_feature_select written by the driver. Valid feature bits are those which are subset of the corresponding device_feature bits. The device MAY present invalid bits written by the driver. Note: This means that a device can ignore writes for feature bits it never offers, and simply present 0 on reads. Or it can just mirror what the driver wrote (but it will still have to check them when the driver sets FEATURES_OK). Note: A driver shouldn’t write invalid bits anyway, as per 3.1.1, but this attempts to handle it.

The device MUST present a changed config_generation after the driver has read a device-specific configuration value which has changed since any part of the device-specific configuration was last read. Note: As config_generation is an 8-bit value, simply incrementing it on every configuration change could violate this requirement due to wrap. Better would be to set an internal flag when it has changed, and if that flag is set when the driver reads from the device-specific configuration, increment config_generation and clear the flag.

The device MUST reset when 0 is written to device_status, and present a 0 in device_status once that is done.

The device MUST present a 0 in queue_enable on reset.

The device MUST present a 0 in queue_size if the virtqueue corresponding to the current queue_select is unavailable.

4.1.4.3.2 Driver Requirements: Common configuration structure layout

device_feature

num_queues

config_generation

queue_notify_off

The driver MUST NOT write a value which is not a power of 2 to queue_size.

The driver MUST configure the other virtqueue fields before enabling the virtqueue with queue_enable.

After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing the device.

The driver MUST NOT write a 0 to queue_enable.

4.1.4.4 Notification structure layout

The notification location is found using the VIRTIO_PCI_CAP_NOTIFY_CFG capability. This capability is immediately followed by an additional field, like so:

struct virtio_pci_notify_cap {

struct virtio_pci_cap cap ;

le32 notify_off_multiplier ; /* Multiplier for queue_notify_off . */

};

notify_off_multiplier is combined with the queue_notify_off to derive the Queue Notify address within a BAR for a virtqueue:

cap . offset + queue_notify_off * notify_off_multiplier

The cap.offset and notify_off_multiplier are taken from the notification capability structure above, and the queue_notify_off is taken from the common configuration structure. Note: For example, if notifier_off_multiplier is 0, the device uses the same Queue Notify address for all queues.

4.1.4.4.1 Device Requirements: Notification capability

The cap.offset MUST be 2-byte aligned.

The device MUST either present notify_off_multiplier as an even power of 2, or present notify_off_multiplier as 0.

The value cap.length presented by the device MUST be at least 2 and MUST be large enough to support queue notification offsets for all supported queues in all possible configurations.

For all queues, the value cap.length presented by the device MUST satisfy:

cap . length >= queue_notify_off * notify_off_multiplier + 2

4.1.4.5 ISR status capability

The VIRTIO_PCI_CAP_ISR_CFG capability refers to at least a single byte, which contains the 8-bit ISR status field to be used for INT#x interrupt handling.

The offset for the ISR status has no alignment requirements.

The ISR bits allow the device to distinguish between device-specific configuration change interrupts and normal virtqueue interrupts:

Bits 0 1 2 to 31 Purpose Queue Interrupt Device Configuration Interrupt Reserved

To avoid an extra access, simply reading this register resets it to 0 and causes the device to de-assert the interrupt.

In this way, driver read of ISR status causes the device to de-assert an interrupt.

See sections 4.1.5.3 and 4.1.5.4 for how this is used.

4.1.4.5.1 Device Requirements: ISR status capability

The device MUST set the Device Configuration Interrupt bit in ISR status before sending a device configuration change notification to the driver.

If MSI-X capability is disabled, the device MUST set the Queue Interrupt bit in ISR status before sending a virtqueue notification to the driver.

If MSI-X capability is disabled, the device MUST set the Interrupt Status bit in the PCI Status register in the PCI Configuration Header of the device to the logical OR of all bits in ISR status of the device. The device then asserts/deasserts INT#x interrupts unless masked according to standard PCI rules [PCI].

The device MUST reset ISR status to 0 on driver read.

4.1.4.5.2 Driver Requirements: ISR status capability

ISR status

4.1.4.6 Device-specific configuration

The device MUST present at least one VIRTIO_PCI_CAP_DEVICE_CFG capability for any device type which has a device-specific configuration.

4.1.4.6.1 Device Requirements: Device-specific configuration

offset

4.1.4.7 PCI configuration access capability

The VIRTIO_PCI_CAP_PCI_CFG capability creates an alternative (and likely suboptimal) access method to the common configuration, notification, ISR and device-specific configuration regions.

The capability is immediately followed by an additional field like so:

struct virtio_pci_cfg_cap {

struct virtio_pci_cap cap ;

u8 pci_cfg_data [4]; /* Data for BAR access . */

};

The fields cap.bar, cap.length, cap.offset and pci_cfg_data are read-write (RW) for the driver.

To access a device region, the driver writes into the capability structure (ie. within the PCI configuration space) as follows:

The driver sets the BAR to access by writing to cap.bar .

. The driver sets the size of the access by writing 1, 2 or 4 to cap.length .

. The driver sets the offset within the BAR by writing to cap.offset .

At that point, pci_cfg_data will provide a window of size cap.length into the given cap.bar at offset cap.offset.

4.1.4.7.1 Device Requirements: PCI configuration access capability

Upon detecting driver write access to pci_cfg_data, the device MUST execute a write access at offset cap.offset at BAR selected by cap.bar using the first cap.length bytes from pci_cfg_data.

Upon detecting driver read access to pci_cfg_data, the device MUST execute a read access of length cap.length at offset cap.offset at BAR selected by cap.bar and store the first cap.length bytes in pci_cfg_data.

4.1.4.7.2 Driver Requirements: PCI configuration access capability

cap.offset

cap.length

The driver MUST NOT read or write pci_cfg_data unless cap.bar, cap.length and cap.offset address cap.length bytes within a BAR range specified by some other Virtio Structure PCI Capability of type other than VIRTIO_PCI_CAP_PCI_CFG.

4.1.4.8 Legacy Interfaces: A Note on PCI Device Layout

Transitional devices MUST present part of configuration registers in a legacy configuration structure in BAR0 in the first I/O region of the PCI device, as documented below. When using the legacy interface, transitional drivers MUST use the legacy configuration structure in BAR0 in the first I/O region of the PCI device, as documented below.

When using the legacy interface the driver MAY access the device-specific configuration region using any width accesses, and a transitional device MUST present driver with the same results as when accessed using the “natural” access method (i.e. 32-bit accesses for 32-bit fields, etc).

Note that this is possible because while the virtio common configuration structure is PCI (i.e. little) endian, when using the legacy interface the device-specific configuration region is encoded in the native endian of the guest (where such distinction is applicable).

When used through the legacy interface, the virtio common configuration structure looks as follows:

Bits 32 32 32 16 16 16 8 8 Read / Write R R+W R+W R R+W R+W R+W R Purpose Device Features bits 0:31 Driver Features bits 0:31 Queue Address queue_size queue_select Queue Notify Device Status ISR

Status

If MSI-X is enabled for the device, two additional fields immediately follow this header:

Bits 16 16 Read/Write R+W R+W Purpose (MSI-X) config_msix_vector queue_msix_vector

Note: When MSI-X capability is enabled, device-specific configuration starts at byte offset 24 in virtio common configuration structure structure. When MSI-X capability is not enabled, device-specific configuration starts at byte offset 20 in virtio header. ie. once you enable MSI-X on the device, the other fields move. If you turn it off again, they move back!

Any device-specific configuration space immediately follows these general headers:

Bits Device Specific … Read / Write Device Specific Purpose Device Specific

When accessing the device-specific configuration space using the legacy interface, transitional drivers MUST access the device-specific configuration space at an offset immediately following the general headers.

When using the legacy interface, transitional devices MUST present the device-specific configuration space if any at an offset immediately following the general headers.

Note that only Feature Bits 0 to 31 are accessible through the Legacy Interface. When used through the Legacy Interface, Transitional Devices MUST assume that Feature Bits 32 to 63 are not acknowledged by Driver.

As legacy devices had no config_generation field, see 2.3.4 Legacy Interface: Device Configuration Space for workarounds.

4.1.4.9 Non-transitional Device With Legacy Driver: A Note on PCI Device Layout

All known legacy drivers check either the PCI Revision or the Device and Vendor IDs, and thus won’t attempt to drive a non-transitional device.

A buggy legacy driver might mistakenly attempt to drive a non-transitional device. If support for such drivers is required (as opposed to fixing the bug), the following would be the recommended way to detect and handle them. Note: Such buggy drivers are not currently known to be used in production.

4.1.4.9.0.1 Device Requirements: Non-transitional Device With Legacy Driver

Present an I/O BAR in BAR0, and Respond to a single-byte zero write to offset 18 (corresponding to Device Status register in the legacy layout) of BAR0 by presenting zeroes on every BAR and ignoring writes.

4.1.5 PCI-specific Initialization And Device Operation

4.1.5.1 Device Initialization

This documents PCI-specific steps executed during Device Initialization.

4.1.5.1.1 Virtio Device Configuration Layout Detection

4.1.5.1.1.1 Legacy Interface: A Note on Device Layout Detection

Legacy devices did not have the Virtio PCI Capability in their capability list.

Therefore:

Transitional devices MUST expose the Legacy Interface in I/O space in BAR0.

Transitional drivers MUST look for the Virtio PCI Capabilities on the capability list. If these are not present, driver MUST assume a legacy device, and use it through the legacy interface.

Non-transitional drivers MUST look for the Virtio PCI Capabilities on the capability list. If these are not present, driver MUST assume a legacy device, and fail gracefully.

4.1.5.1.2 MSI-X Vector Configuration

config_msix_vector

queue_msix_vector

Writing a valid MSI-X Table entry number, 0 to 0x7FF, to config_msix_vector/queue_msix_vector maps interrupts triggered by the configuration change/selected queue events respectively to the corresponding MSI-X vector. To disable interrupts for an event type, the driver unmaps this event by writing a special NO_VECTOR value:

/* Vector value used to disable MSI for queue */

# define VIRTIO_MSI_NO_VECTOR 0 xffff

Note that mapping an event to vector might require device to allocate internal device resources, and thus could fail.

4.1.5.1.2.1 Device Requirements: MSI-X Vector Configuration

Table

Size

Note:

Device MUST support mapping any event type to any valid vector 0 to MSI-X Table Size. Device MUST support unmapping any event type.

The device MUST return vector mapped to a given event, (NO_VECTOR if unmapped) on read of config_msix_vector/queue_msix_vector. The device MUST have all queue and configuration change events are unmapped upon reset.

Devices SHOULD NOT cause mapping an event to vector to fail unless it is impossible for the device to satisfy the mapping request. Devices MUST report mapping failures by returning the NO_VECTOR value when the relevant config_msix_vector/queue_msix_vector field is read.

4.1.5.1.2.2 Driver Requirements: MSI-X Vector Configuration

Driver MAY intepret the Table Size as a hint from the device for the suggested number of MSI-X vectors to use.

Driver MUST NOT attempt to map an event to a vector outside the MSI-X Table supported by the device, as reported by Table Size in the MSI-X Capability.

After mapping an event to vector, the driver MUST verify success by reading the Vector field value: on success, the previously written value is returned, and on failure, NO_VECTOR is returned. If a mapping failure is detected, the driver MAY retry mapping with fewer vectors, disable MSI-X or report device failure.

4.1.5.1.3 Virtqueue Configuration

The driver typically does this as follows, for each virtqueue a device has:

Write the virtqueue index (first queue is 0) to queue_select . Read the virtqueue size from queue_size . This controls how big the virtqueue is (see 2.4 Virtqueues). If this field is 0, the virtqueue does not exist. Optionally, select a smaller virtqueue size and write it to queue_size . Allocate and zero Descriptor Table, Available and Used rings for the virtqueue in contiguous physical memory. Optionally, if MSI-X capability is present and enabled on the device, select a vector to use to request interrupts triggered by virtqueue events. Write the MSI-X Table entry number corresponding to this vector into queue_msix_vector . Read queue_msix_vector : on success, previously written value is returned; on failure, NO_VECTOR value is returned.

4.1.5.1.3.1 Legacy Interface: A Note on Virtqueue Configuration

4.1.5.2 Notifying The Device

The driver notifies the device by writing the 16-bit virtqueue index of this virtqueue to the Queue Notify address. See 4.1.4.4 for how to calculate this address.

4.1.5.3 Virtqueue Interrupts From The Device

If an interrupt is necessary for a virtqueue, the device would typically act as follows:

If MSI-X capability is disabled: Set the lower bit of the ISR Status field for the device. Send the appropriate PCI interrupt for the device.

If MSI-X capability is enabled: If queue_msix_vector is not NO_VECTOR, request the appropriate MSI-X interrupt message for the device, queue_msix_vector sets the MSI-X Table entry number.



4.1.5.3.1 Device Requirements: Virtqueue Interrupts From The Device

queue_msix_vector

4.1.5.4 Notification of Device Configuration Changes

Some virtio PCI devices can change the device configuration state, as reflected in the device-specific configuration region of the device. In this case:

If MSI-X capability is disabled: Set the second lower bit of the ISR Status field for the device. Send the appropriate PCI interrupt for the device.

If MSI-X capability is enabled: If config_msix_vector is not NO_VECTOR, request the appropriate MSI-X interrupt message for the device, config_msix_vector sets the MSI-X Table entry number.



A single interrupt MAY indicate both that one or more virtqueue has been used and that the configuration space has changed.

4.1.5.4.1 Device Requirements: Notification of Device Configuration Changes

config_msix_vector

4.1.5.4.2 Driver Requirements: Notification of Device Configuration Changes

4.1.5.5 Driver Handling Interrupts

The driver interrupt handler would typically:

If MSI-X capability is disabled: Read the ISR Status field, which will reset it to zero. If the lower bit is set: look through the used rings of all virtqueues for the device, to see if any progress has been made by the device which requires servicing. If the second lower bit is set: re-examine the configuration space to see what changed.

If MSI-X capability is enabled: Look through the used rings of all virtqueues mapped to that MSI-X vector for the device, to see if any progress has been made by the device which requires servicing. If the MSI-X vector is equal to config_msix_vector , re-examine the configuration space to see what changed.



4.2 Virtio Over MMIO

Virtual environments without PCI support (a common situation in embedded devices models) might use simple memory mapped device (“virtio-mmio”) instead of the PCI device.

The memory mapped virtio device behaviour is based on the PCI device specification. Therefore most operations including device initialization, queues configuration and buffer transfers are nearly identical. Existing differences are described in the following sections.

4.2.1 MMIO Device Discovery

Unlike PCI, MMIO provides no generic device discovery mechanism. For each device, the guest OS will need to know the location of the registers and interrupt(s) used. The suggested binding for systems using flattened device trees is shown in this example:

// EXAMPLE : virtio_block device taking 512 bytes at 0 x1e000 , interrupt 42.

virtio_block@1e000 {

compatible = " virtio , mmio ";

reg = <0 x1e000 0 x200 >;

interrupts = <42>;

}

4.2.2 MMIO Device Register Layout

MMIO virtio devices provide a set of memory mapped control registers followed by a device-specific configuration space, described in the table 4.1.

All register values are organized as Little Endian.

Name

Offset from base

Direction Function

Description MagicValue

0x000

R Magic value

0x74726976 (a Little Endian equivalent of the “virt” string). Version

0x004

R Device version number

0x2. Note: Legacy devices (see 4.2.4 Legacy interface) used 0x1. DeviceID

0x008

R Virtio Subsystem Device ID

See 5 Device Types for possible values. Value zero (0x0) is used to define a system memory map with placeholder devices at static, well known addresses, assigning functions to them depending on user’s needs. VendorID

0x00c

R Virtio Subsystem Vendor ID

DeviceFeatures

0x010

R Flags representing features the device supports

Reading from this register returns 32 consecutive flag bits, the least significant bit depending on the last value written to DeviceFeaturesSel. Access to this register returns bits DeviceFeaturesSel ∗ 32 to (DeviceFeaturesSel ∗ 32) + 31, eg. feature bits 0 to 31 if DeviceFeaturesSel is set to 0 and features bits 32 to 63 if DeviceFeaturesSel is set to 1. Also see 2.2 Feature Bits. DeviceFeaturesSel

0x014

W Device (host) features word selection.

Writing to this register selects a set of 32 device feature bits accessible by reading from DeviceFeatures. DriverFeatures

0x020

W Flags representing device features understood and activated by the driver

Writing to this register sets 32 consecutive flag bits, the least significant bit depending on the last value written to DriverFeaturesSel. Access to this register sets bits DriverFeaturesSel ∗ 32 to (DriverFeaturesSel ∗ 32) + 31, eg. feature bits 0 to 31 if DriverFeaturesSel is set to 0 and features bits 32 to 63 if DriverFeaturesSel is set to 1. Also see 2.2 Feature Bits. DriverFeaturesSel

0x024

W Activated (guest) features word selection

Writing to this register selects a set of 32 activated feature bits accessible by writing to DriverFeatures. QueueSel

0x030

W Virtual queue index

Writing to this register selects the virtual queue that the following operations on QueueNumMax, QueueNum, QueueReady, QueueDescLow, QueueDescHigh, QueueAvailLow, QueueAvailHigh, QueueUsedLow and QueueUsedHigh apply to. The index number of the first queue is zero (0x0). QueueNumMax

0x034

R Maximum virtual queue size

Reading from the register returns the maximum size (number of elements) of the queue the device is ready to process or zero (0x0) if the queue is not available. This applies to the queue selected by writing to QueueSel. QueueNum

0x038

W Virtual queue size

Queue size is the number of elements in the queue, therefore in each of the Descriptor Table, the Available Ring and the Used Ring. Writing to this register notifies the device what size of the queue the driver will use. This applies to the queue selected by writing to QueueSel. QueueReady

0x044

RW Virtual queue ready bit

Writing one (0x1) to this register notifies the device that it can execute requests from this virtual queue. Reading from this register returns the last value written to it. Both read and write accesses apply to the queue selected by writing to QueueSel. QueueNotify

0x050

W Queue notifier

Writing a queue index to this register notifies the device that there are new buffers to process in the queue. InterruptStatus

0x60

R Interrupt status

Reading from this register returns a bit mask of events that caused the device interrupt to be asserted. The following events are possible: Used Ring Update - bit 0 - the interrupt was asserted because the device has updated the Used Ring in at least one of the active virtual queues. Configuration Change - bit 1 - the interrupt was asserted because the configuration of the device has changed. InterruptACK

0x064

W Interrupt acknowledge

Writing a value with bits set as defined in InterruptStatus to this register notifies the device that events causing the interrupt have been handled. Status

0x070

RW Device status

Reading from this register returns the current device status flags. Writing non-zero values to this register sets the status flags, indicating the driver progress. Writing zero (0x0) to this register triggers a device reset. See also p. 4.2.3.1 Device Initialization. QueueDescLow

0x080

QueueDescHigh

0x084

W Virtual queue’s Descriptor Table 64 bit long physical address

Writing to these two registers (lower 32 bits of the address to QueueDescLow, higher 32 bits to QueueDescHigh) notifies the device about location of the Descriptor Table of the queue selected by writing to QueueSel register. QueueAvailLow

0x090

QueueAvailHigh

0x094

W Virtual queue’s Available Ring 64 bit long physical address

Writing to these two registers (lower 32 bits of the address to QueueAvailLow, higher 32 bits to QueueAvailHigh) notifies the device about location of the Available Ring of the queue selected by writing to QueueSel. QueueUsedLow

0x0a0

QueueUsedHigh

0x0a4

W Virtual queue’s Used Ring 64 bit long physical address

Writing to these two registers (lower 32 bits of the address to QueueUsedLow, higher 32 bits to QueueUsedHigh) notifies the device about location of the Used Ring of the queue selected by writing to QueueSel. ConfigGeneration

0x0fc

R Configuration atomicity value

Reading from this register returns a value describing a version of the device-specific configuration space (see Config). The driver can then access the configuration space and, when finished, read ConfigGeneration again. If no part of the configuration space has changed between these two ConfigGeneration reads, the returned values are identical. If the values are different, the configuration space accesses were not atomic and the driver has to perform the operations again. See also 2.3. Config

0x100+

RW Configuration space

Device-specific configuration space starts at the offset 0x100 and is accessed with byte alignment. Its meaning and size depend on the device and the driver.

4.2.2.1 Device Requirements: MMIO Device Register Layout

The device MUST return 0x74726976 in MagicValue.

The device MUST return value 0x2 in Version.

The device MUST present each event by setting the corresponding bit in InterruptStatus from the moment it takes place, until the driver acknowledges the interrupt by writing a corresponding bit mask to the InterruptACK register. Bits which do not represent events which took place MUST be zero.

Upon reset, the device MUST clear all bits in InterruptStatus and ready bits in the QueueReady register for all queues in the device.

The device MUST change value returned in ConfigGeneration if there is any risk of a driver seeing an inconsistent configuration state.

The device MUST NOT access virtual queue contents when QueueReady is zero (0x0).

4.2.2.2 Driver Requirements: MMIO Device Register Layout

The driver MUST NOT access memory locations not described in the table 4.1 (or, in case of the configuration space, described in the device specification), MUST NOT write to the read-only registers (direction R) and MUST NOT read from the write-only registers (direction W).

The driver MUST only use 32 bit wide and aligned reads and writes to access the control registers described in table 4.1. For the device-specific configuration space, the driver MUST use 8 bit wide accesses for 8 bit wide fields, 16 bit wide and aligned accesses for 16 bit wide fields and 32 bit wide and aligned accesses for 32 and 64 bit wide fields.

The driver MUST ignore a device with MagicValue which is not 0x74726976, although it MAY report an error.

The driver MUST ignore a device with Version which is not 0x2, although it MAY report an error.

The driver MUST ignore a device with DeviceID 0x0, but MUST NOT report any error.

Before reading from DeviceFeatures, the driver MUST write a value to DeviceFeaturesSel.

Before writing to the DriverFeatures register, the driver MUST write a value to the DriverFeaturesSel register.

The driver MUST write a value to QueueNum which is less than or equal to the value presented by the device in QueueNumMax.

When QueueReady is not zero, the driver MUST NOT access QueueNum, QueueDescLow, QueueDescHigh, QueueAvailLow, QueueAvailHigh, QueueUsedLow, QueueUsedHigh.

To stop using the queue the driver MUST write zero (0x0) to this QueueReady and MUST read the value back to ensure synchronization.

The driver MUST ignore undefined bits in InterruptStatus.

The driver MUST write a value with a bit mask describing events it handled into InterruptACK when it finishes handling an interrupt and MUST NOT set any of the undefined bits in the value.

4.2.3 MMIO-specific Initialization And Device Operation

4.2.3.1 Device Initialization

4.2.3.1.1 Driver Requirements: Device Initialization

MagicValue

Version

DeviceID

Further initialization MUST follow the procedure described in 3.1 Device Initialization.

4.2.3.2 Virtqueue Configuration

The driver will typically initialize the virtual queue in the following way:

Select the queue writing its index (first queue is 0) to QueueSel . Check if the queue is not already in use: read QueueReady , and expect a returned value of zero (0x0). Read maximum queue size (number of elements) from QueueNumMax . If the returned value is zero (0x0) the queue is not available. Allocate and zero the queue pages, making sure the memory is physically contiguous. It is recommended to align the Used Ring to an optimal boundary (usually the page size). Notify the device about the queue size by writing the size to QueueNum . Write physical addresses of the queue’s Descriptor Table, Available Ring and Used Ring to (respectively) the QueueDescLow / QueueDescHigh , QueueAvailLow / QueueAvailHigh and QueueUsedLow / QueueUsedHigh register pairs. Write 0x1 to QueueReady .

4.2.3.3 Notifying The Device

The driver notifies the device about new buffers being available in a queue by writing the index of the updated queue to QueueNotify.

4.2.3.4 Notifications From The Device

The memory mapped virtio device is using a single, dedicated interrupt signal, which is asserted when at least one of the bits described in the description of InterruptStatus is set. This is how the device notifies the driver about a new used buffer being available in the queue or about a change in the device configuration.

4.2.3.4.1 Driver Requirements: Notifications From The Device

InterruptStatus

4.2.4 Legacy interface

The legacy MMIO transport used page-based addressing, resulting in a slightly different control register layout, the device initialization and the virtual queue configuration procedure.

Table 4.2 presents control registers layout, omitting descriptions of registers which did not change their function nor behaviour:

Name

Offset from base

Direction Function

Description MagicValue

0x000

R Magic value

Version

0x004

R Device version number

Legacy device returns value 0x1. DeviceID

0x008

R Virtio Subsystem Device ID

VendorID

0x00c

R Virtio Subsystem Vendor ID

HostFeatures

0x010

R Flags representing features the device supports

HostFeaturesSel

0x014

W Device (host) features word selection.

GuestFeatures

0x020

W Flags representing device features understood and activated by the driver

GuestFeaturesSel

0x024

W Activated (guest) features word selection

GuestPageSize

0x028

W Guest page size

The driver writes the guest page size in bytes to the register during initialization, before any queues are used. This value should be a power of 2 and is used by the device to calculate the Guest address of the first queue page (see QueuePFN). QueueSel

0x030

W Virtual queue index

Writing to this register selects the virtual queue that the following operations on the QueueNumMax, QueueNum, QueueAlign and QueuePFN registers apply to. The index number of the first queue is zero (0x0). . QueueNumMax

0x034

R Maximum virtual queue size

Reading from the register returns the maximum size of the queue the device is ready to process or zero (0x0) if the queue is not available. This applies to the queue selected by writing to QueueSel and is allowed only when QueuePFN is set to zero (0x0), so when the queue is not actively used. QueueNum

0x038

W Virtual queue size

Queue size is the number of elements in the queue, therefore size of the descriptor table and both available and used rings. Writing to this register notifies the device what size of the queue the driver will use. This applies to the queue selected by writing to QueueSel. QueueAlign

0x03c

W Used Ring alignment in the virtual queue

Writing to this register notifies the device about alignment boundary of the Used Ring in bytes. This value should be a power of 2 and applies to the queue selected by writing to QueueSel. QueuePFN

0x040

RW Guest physical page number of the virtual queue

Writing to this register notifies the device about location of the virtual queue in the Guest’s physical address space. This value is the index number of a page starting with the queue Descriptor Table. Value zero (0x0) means physical address zero (0x00000000) and is illegal. When the driver stops using the queue it writes zero (0x0) to this register. Reading from this register returns the currently used page number of the queue, therefore a value other than zero (0x0) means that the queue is in use. Both read and write accesses apply to the queue selected by writing to QueueSel. QueueNotify

0x050

W Queue notifier

InterruptStatus

0x60

R Interrupt status

InterruptACK

0x064

W Interrupt acknowledge

Status

0x070

RW Device status

Reading from this register returns the current device status flags. Writing non-zero values to this register sets the status flags, indicating the OS/driver progress. Writing zero (0x0) to this register triggers a device reset. The device sets QueuePFN to zero (0x0) for all queues in the device. Also see 3.1 Device Initialization. Config

0x100+

RW Configuration space



The virtual queue page size is defined by writing to GuestPageSize, as written by the guest. The driver does this before the virtual queues are configured.

The virtual queue layout follows p. 2.4.2 Legacy Interfaces: A Note on Virtqueue Layout, with the alignment defined in QueueAlign.

The virtual queue is configured as follows:

Select the queue writing its index (first queue is 0) to QueueSel . Check if the queue is not already in use: read QueuePFN , expecting a returned value of zero (0x0). Read maximum queue size (number of elements) from QueueNumMax . If the returned value is zero (0x0) the queue is not available. Allocate and zero the queue pages in contiguous virtual memory, aligning the Used Ring to an optimal boundary (usually page size). The driver should choose a queue size smaller than or equal to QueueNumMax . Notify the device about the queue size by writing the size to QueueNum . Notify the device about the used alignment by writing its value in bytes to QueueAlign . Write the physical number of the first page of the queue to the QueuePFN register.

Notification mechanisms did not change.

4.3 Virtio Over Channel I/O

S/390 based virtual machines support neither PCI nor MMIO, so a different transport is needed there.

virtio-ccw uses the standard channel I/O based mechanism used for the majority of devices on S/390. A virtual channel device with a special control unit type acts as proxy to the virtio device (similar to the way virtio-pci uses a PCI device) and configuration and operation of the virtio device is accomplished (mostly) via channel commands. This means virtio devices are discoverable via standard operating system algorithms, and adding virtio support is mainly a question of supporting a new control unit type.

As the S/390 is a big endian machine, the data structures transmitted via channel commands are big-endian: this is made clear by use of the types be16, be32 and be64.

4.3.1 Basic Concepts

As a proxy device, virtio-ccw uses a channel-attached I/O control unit with a special control unit type (0x3832) and a control unit model corresponding to the attached virtio device’s subsystem device ID, accessed via a virtual I/O subchannel and a virtual channel path of type 0x32. This proxy device is discoverable via normal channel subsystem device discovery (usually a STORE SUBCHANNEL loop) and answers to the basic channel commands:

NO-OPERATION (0x03)

BASIC SENSE (0x04)

TRANSFER IN CHANNEL (0x08)

SENSE ID (0xe4)

For a virtio-ccw proxy device, SENSE ID will return the following information:

Bytes Description Contents 0 reserved 0xff 1-2 control unit type 0x3832 3 control unit model 4-5 device type zeroes (unset) 6 device model zeroes (unset) 7-255 extended SenseId data zeroes (unset)

In addition to the basic channel commands, virtio-ccw defines a set of channel commands related to configuration and operation of virtio:

# define CCW_CMD_SET_VQ 0 x13

# define CCW_CMD_VDEV_RESET 0 x33

# define CCW_CMD_SET_IND 0 x43

# define CCW_CMD_SET_CONF_IND 0 x53

# define CCW_CMD_SET_IND_ADAPTER 0 x73

# define CCW_CMD_READ_FEAT 0 x12

# define CCW_CMD_WRITE_FEAT 0 x11

# define CCW_CMD_READ_CONF 0 x22

# define CCW_CMD_WRITE_CONF 0 x21

# define CCW_CMD_WRITE_STATUS 0 x31

# define CCW_CMD_READ_VQ_CONF 0 x32

# define CCW_CMD_SET_VIRTIO_REV 0 x83

4.3.1.1 Device Requirements: Basic Concepts

The virtio-ccw device acts like a normal channel device, as specified in [S390 PoP] and [S390 Common I/O]. In particular:

A device MUST post a unit check with command reject for any command it does not support.

If a driver did not suppress length checks for a channel command, the device MUST present a subchannel status as detailed in the architecture when the actual length did not match the expected length.

If a driver did suppress length checks for a channel command, the device MUST present a check condition if the transmitted data does not contain enough data to process the command. If the driver submitted a buffer that was too long, the device SHOULD accept the command.

4.3.1.2 Driver Requirements: Basic Concepts

A driver for virtio-ccw devices MUST check for a control unit type of 0x3832 and MUST ignore the device type and model.

A driver SHOULD attempt to provide the correct length in a channel command even if it suppresses length checks for that command.

4.3.2 Device Initialization

virtio-ccw uses several channel commands to set up a device.

4.3.2.1 Setting the Virtio Revision

CCW_CMD_SET_VIRTIO_REV is issued by the driver to set the revision of the virtio-ccw transport it intends to drive the device with. It uses the following communication structure:

struct virtio_rev_info {

be16 revision ;

be16 length ;

u8 data [];

};

revision contains the desired revision id, length the length of the data portion and data revision-dependent additional desired options.

The following values are supported:

revision length data remarks 0 0 legacy interface; transitional devices only 1 0 Virtio 1.0 2-n reserved for later revisions

Note that a change in the virtio standard does not necessarily correspond to a change in the virtio-ccw revision.

4.3.2.1.1 Device Requirements: Setting the Virtio Revision

revision

revision

length

data

A device MUST answer with command reject to any virtio-ccw specific channel command that is not contained in the revision selected by the driver.

A device MUST answer with command reject to any attempt to select a different revision after a revision has been successfully selected by the driver.

A device MUST treat the revision as unset from the time the associated subchannel has been enabled until a revision has been successfully set by the driver. This implies that revisions are not persistent across disabling and enabling of the associated subchannel.

4.3.2.1.2 Driver Requirements: Setting the Virtio Revision

A driver MUST NOT issue any other virtio-ccw specific channel commands prior to setting the revision.

After a revision has been successfully selected by the driver, it MUST NOT attempt to select a different revision.

4.3.2.1.3 Legacy Interfaces: A Note on Setting the Virtio Revision

A legacy driver will not issue the CCW_CMD_SET_VIRTIO_REV prior to issuing other virtio-ccw specific channel commands. A non-transitional device therefore MUST answer any such attempts with a command reject. A transitional device MUST assume in this case that the driver is a legacy driver and continue as if the driver selected revision 0. This implies that the device MUST reject any command not valid for revision 0, including a subsequent CCW_CMD_SET_VIRTIO_REV.

4.3.2.2 Configuring a Virtqueue

CCW_CMD_READ_VQ_CONF is issued by the driver to obtain information about a queue. It uses the following structure for communicating:

struct vq_config_block {

be16 index ;

be16 max_num ;

};

The requested number of buffers for queue index is returned in max_num.

Afterwards, CCW_CMD_SET_VQ is issued by the driver to inform the device about the location used for its queue. The transmitted structure is

struct vq_info_block {

be64 desc ;

be32 res0 ;

be16 index ;

be16 num ;

be64 avail ;

be64 used ;

};

desc, avail and used contain the guest addresses for the descriptor table, available ring and used ring for queue index, respectively. The actual virtqueue size (number of allocated buffers) is transmitted in num.

4.3.2.2.1 Device Requirements: Configuring a Virtqueue

res0

4.3.2.2.2 Legacy Interface: A Note on Configuring a Virtqueue

struct vq_info_block_legacy {

be64 queue ;

be32 align ;

be16 index ;

be16 num ;

};

queue contains the guest address for queue index, num the number of buffers and align the alignment. The queue layout follows 2.4.2 Legacy Interfaces: A Note on Virtqueue Layout.

4.3.2.3 Communicating Status Information

The driver changes the status of a device via the CCW_CMD_WRITE_STATUS command, which transmits an 8 bit status value.

As described in 2.2.2, a device sometimes fails to set the status field: For example, it might fail to accept the FEATURES_OK status bit during device initialization.

4.3.2.3.1 Driver Requirements: Communicating Status Information

status

4.3.2.3.2 Device Requirements: Communicating Status Information

status

status

4.3.2.4 Handling Device Features

Feature bits are arranged in an array of 32 bit values, making for a total of 8192 feature bits. Feature bits are in little-endian byte order.

The CCW commands dealing with features use the following communication block:

struct virtio_feature_desc {

le32 features ;

u8 index ;

};

features are the 32 bits of features currently accessed, while index describes which of the feature bit values is to be accessed. No padding is added at the end of the structure, it is exactly 5 bytes in length.

The guest obtains the device’s device feature set via the CCW_CMD_READ_FEAT command. The device stores the features at index to features.

For communicating its supported features to the device, the driver uses the CCW_CMD_WRITE_FEAT command, denoting a features/index combination.

4.3.2.5 Device Configuration

The device’s configuration space is located in host memory.

To obtain information from the configuration space, the driver uses CCW_CMD_READ_CONF, specifying the guest memory for the device to write to.

For changing configuration information, the driver uses CCW_CMD_WRITE_CONF, specifying the guest memory for the device to read from.

In both cases, the complete configuration space is transmitted. This allows the driver to compare the new configuration space with the old version, and keep a generation count internally whenever it changes.

4.3.2.6 Setting Up Indicators

In order to set up the indicator bits for host->guest notification, the driver uses different channel commands depending on whether it wishes to use traditional I/O interrupts tied to a subchannel or adapter I/O interrupts for virtqueue notifications. For any given device, the two mechanisms are mutually exclusive.

For the configuration change indicators, only a mechanism using traditional I/O interrupts is provided, regardless of whether traditional or adapter I/O interrupts are used for virtqueue notifications.

4.3.2.6.1 Setting Up Classic Queue Indicators

To communicate the location of the indicator bits for host->guest notification, the driver uses the CCW_CMD_SET_IND command, pointing to a location containing the guest address of the indicators in a 64 bit value.

If the driver has already set up two-staged queue indicators via the CCW_CMD_SET_IND_ADAPTER command, the device MUST post a unit check with command reject to any subsequent CCW_CMD_SET_IND command.

4.3.2.6.2 Setting Up Configuration Change Indicators

To communicate the location of the indicator bits used in the configuration change host->guest notification, the driver issues the CCW_CMD_SET_CONF_IND command, pointing to a location containing the guest address of the indicators in a 64 bit value.

4.3.2.6.3 Setting Up Two-Stage Queue Indicators

a summary indicator byte covering the virtqueues for one or more virtio-ccw proxy devices

a set of contigous indicator bits for the virtqueues for a virtio-ccw proxy device

To communicate the location of the summary and queue