Linux 4.0 has been released on Sun, 12 Apr 2015.

Summary: This release adds support for live patching the kernel code, aimed primarily at fixing security updates without rebooting; DAX, a way to avoid using the kernel cache when filesystems run on systems with persistent memory storage; kasan, a dynamic memory error detector that allows to find use-after-free and out-of-bounds bugs; lazytime, an alternative to relatime, which causes access, modified and changed time updates to only be made in the cache and written to the disk opportunistically; allow overlayfs to have multiple lower layers, support of Parallel NFS server architecture; and dm-crypt CPU scalability improvements. There are also new drivers and many other small improvements.

1. Prominent features

1.1. Arbitrary version change

This release increases the version to 4.0. This switch from 3.x to 4.0 version numbers is, however, entirely meaningless and it should not be associated to any important changes in the kernel. This release could have been 3.20, but Linus Torvalds just got tired of the old number, made a poll, and changed it. Yes, it is frivolous. The less you think about it, the better.

1.2. Live patching

This release introduces "livepatch", a feature for live patching the kernel code, aimed primarily at systems who want to get security updates without needing to reboot. This feature has been born as result of merging kgraft and kpatch, two attempts by SuSE and Red Hat that where started to replace the now propietary ksplice. It's relatively simple and minimalistic, as it's making use of existing kernel infrastructure (namely ftrace) as much as possible. It's also self-contained and it doesn't hook itself in any other kernel subsystems.

In this release livepatch is not feature complete, yet it provides a basic infrastructure for function "live patching" (i.e. code redirection), including API for kernel modules containing the actual patches, and API/ABI for userspace to be able to operate on the patches (look up what patches are applied, enable/disable them, etc). Most CVEs should be safe to apply this way. Only the x86 architecture is supported in this release, others will follow.

For more details see the merge commit

Sample live patching module: commit

Code commit

1.3. DAX - Direct Access, for persistent memory storage

Before being read by programs, files are usually first copied from the disk to the kernel caches, kept in RAM. But the possible advent of persistent non-volatile memory that would be also be used as disk changes radically the way the kernel deals with this process: the kernel cache would become unnecesary overhead.

Linux has had, in fact, support for this kind of setups since 2.6.13. But the code wasn't maintaned and only supported ext2. In this release, Linux adds DAX (Direct Access, the X is for eXciting). DAX removes the extra copy incurred by the buffer by performing reads and writes directly to the persistent-memory storage device. For file mappings, the storage device is mapped directly into userspace. Support for ext4 has been added.

Recommended LWN article: Supporting filesystems in persistent memory

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.4. kasan, kernel address sanitizer

Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. Linux already has the kmemcheck feature, but unlike kmemcheck, KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck.

The main idea of KASAN is to use shadow memory to record whether each byte of memory is safe to access or not, and use compiler's instrumentation to check the shadow memory on each memory access. Address sanitizer uses 1/8 of the memory addressable in kernel for shadow memory and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address.

Code: commit, commit, commit, commit, commit

Unix filesystems keep track of information about files, such as the last time a file was accessed or modified. Keeping track of this information is very expensive, specially the time when a file was accessed ("atime"), which encourages many people to disable it with the mount option "noatime". To alleviate this problem, the "relatime" mount option was added, the atime is only updated if the previous value is earlier than the modification time, or if the file was last accessed more than 24 hours ago. This behaviour, however, breaks some programs that rely on accurate access time tracking to work, and it's also against the POSIX standard.

In this release, Linux adds another alternative: "lazytime". Lazytime causes access, modified and changed time updates to only be made in the cache. The times will only be written to the disk if the inode needs to be updated anyway for some non-time related change, if fsync(), syncfs() or sync() are called, or just before an undeleted inode is evicted from memory. This is POSIX compliant, while at the same time improving the performance.

Recommended LWN article: Introducing lazytime

Code: commit, commit, commit

1.6. Multiple lower layers in overlayfs

In overlayfs, multiple lower layers can now be given using the the colon (":") as a separator character between the directory names. For example:

mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged

The specified lower directories will be stacked beginning from the rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer. "upperdir=" and "workdir=" may be omitted, in that case the overlay will be read-only.

Code: commit, commit

1.7. Support Parallel NFS server, default to NFS v4.2

Parallel NFS (pNFS) is a part of the NFS v4.1 standard that allows compute clients to access storage devices directly and in parallel. The pNFS architecture eliminates the scalability and performance issues associated with NFS servers deployed today. This is achieved by the separation of data and metadata, and moving the metadata server out of the data path.

This release adds support for pNFS server, and drivers for the block layout with XFS support to use XFS filesystems as a block layout target, and the flexfiles layout.

Also, in this release the NFS server defaults to NFS v4.2.

Code: commit, commit, commit, commit, commit, commit

1.8. dm-crypt scalability improvements

This release significantly increases the dm-crypt CPU scalability performance thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here

Merge: commit

2. Drivers and architectures

All the driver and architecture-specific changes can be found in the Linux_4.0-DriversArch page

3. File systems

XFS Adds support for sys_renameat2() commit Remove deprecated sysctls xfsbufd_centisecs and age_buffer_centisecs commit

EXT4 Support "readonly" filesystem flag to mark a FS image as read-only, tunable with tune2fs. It prevents the kernel and e2fsprogs from changing the image commit

Btrfs Add code to support file creation time commit

NFSv4.1 Allow parallel LOCK/LOCKU calls commit Allow parallel OPEN/OPEN_DOWNGRADE/CLOSE commit

UBIFS Add security.* XATTR support for the UBIFS commit Add xattr support for symlinks commit

OCFS2 Add a mount option journal_async_commit on ocfs2 filesystem. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement commit Currently in case of append O_DIRECT write (block not allocated yet), ocfs2 will fall back to buffered I/O. This has some disadvantages. In this version, the direct I/O write doesn't fallback to buffer I/O write any more because the allocate blocks are enabled in direct I/O now commit, commit, commit

F2FS Introduce a batched trim commit Support "norecovery" mount option, which is mostly same as "disable_roll_forward". The only difference is that "norecovery" should be activated with read-only mount option. This can be used when user wants to check whether f2fs is mountable or not without any recovery process commit Add F2FS_IOC_GETVERSION ioctl for getting i_generation from inode, after that, users can list file's generation number by using "lsattr -v commit



4. Block

Ported to blk-multiqueue loop: Add blk-mq support, which greatly improves performance for sequential and random reads commit Device-mapper commit rbd commit UBI commit

blk-multiqueue: Add support for tag allocation policies and make libata use this blk-mq tagging, instead of rolling their own commit, commit

UBI: Implement UBI_METAONLY, a new open mode for UBI volumes, it indicates that only meta data is being changed commit

5. Core (various)

pstore: Add pmsg - user-space accessible pstore object commit

rcu: Optionally run grace-period kthreads at real-time priority. Recent testing has shown that under heavy load, running RCU's grace-period kthreads at real-time priority can improve performance and reduce the incidence of RCU CPU stall warnings commit

GDB scripts for debugging the kernel. If you load vmlinux into gdb with the option enabled, the helper scripts will be automatically imported by gdb as well, and additional functions are available to analyze a Linux kernel instance. See Documentation/gdb-kernel-debugging.txt for further details commit

Remove CONFIG_INIT_FALLBACK commit

6. Memory management

cgroups: Per memory cgroup slab shrinkers commit

slub: optimize memory alloc/free fastpath by removing preemption on/off commit

Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can detect zero_page in /proc/kpageflags, and then do memory analysis more accurately commit

Make /dev/mem an optional device commit

Add support for resetting peak RSS, which can be retrieved from the VmHWM field in /proc/pid/status, by writing "5" to /proc/pid/clear_refs commit

Show page size in /proc/<pid>/numa_maps as "kernelpagesize_kB" field to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs commit

geneve: Add Geneve GRO support commit

zsmalloc: add statistics support commit

Incorporate read-only pages into transparent huge pages commit

memcontrol cgroup: Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. The old interface will be maintained, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually commit

Replace remap_file_pages() syscall with emulation commit

7. Virtualization

KVM: Add generic support for page modification logging, a new feature in Intel "Broadwell" Xeon CPUs that speeds up dirty page tracking commit

vfio: Add device request interface indicating that the device should be released commit

vmxnet3: Make Rx ring 2 size configurable by adjusting rx-jumbo parameter of ethtool -G commit

virtio_net: add software timestamp support commit

virtio_pci: modern driver commit, add an options to disable legacy driver commit, commit

8. Cryptography

aesni: Add support for 192 & 256 bit keys to AES-NI RFC4106 commit

algif_rng: add random number generator support commit

octeon: add MD5 module commit

qat: add support for CBC(AES) ablkcipher commit

9. Security

SELinux : Add security hooks to the Android Binder that enable security modules such as SELinux to implement controls over Binder IPC. The security hooks include support for controlling what process can become the Binder context manager, invoke a binder transaction/IPC to another process, transfer a binder reference to another process , transfer an open file to another process. These hooks have been included in the Android kernel trees since Android 4.3 (commit).

SMACK: secmark support for netfilter (commit).

TPM 2.0 support (commits: 1, 2, 3).

Device class for TPM, sysfs files are moved from /sys/class/misc/tpmX/device/ to /sys/class/tpm/tpmX/device/ (commit).

10. Tracing & perf

perf mem: Enable sampling loads and stores simultaneously, it could only do one or the other before yet there was no hardware restriction preventing simultaneous collection commit

perf tools: Support parameterized and symbolic events. See links for documentation commit, commit

AMD range breakpoints support: breakpoints are extended to support address range through perf event with initial backend support for AMD extended breakpoints. For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512): perf record -e mem:0x1000/512:w commit, commit

11. Networking

TCP: Add the possibility to define a per route/destination congestion control algorithm. This opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics commit

Mitigate TCP "ACK loop" DoS scenarios by rate-limiting outgoing duplicate ACKs sent in response to incoming "out of window" segments. For more details, see merge. Code: commit, commit, commit, commit

udpv6: Add lockless sendmsg() support, thus allowing multiple threads to send to a single socket more efficiently commit

ipv4: Automatically bring up DSA master network devices, which allows DSA slave network devices to be used as valid interfaces for e.g: NFS root booting by allowing kernel IP auto-configuration to succeed on these interfaces commit

ipv6: Add sysctl entry(accept_ra_mtu) to disable MTU updates from router advertisements commit

vxlan: Implement supports for the Group Policy VXLAN extension to provide a lightweight and simple security label mechanism across network peers based on VXLAN. It allows further mapping to a SELinux context using SECMARK, to implement ACLs directly with nftables, iptables, OVS, tc, etc commit

vxlan: Add support for remote checksum offload in VXLAN. It is described here. commit

net: openvswitch: Support masked set actions. commit

Infiniband: Add support for extensible query device capabilities verb to allow adding new features commit

Layer 2 Tunneling Protocol (l2tp): multicast notification to the registered listeners when the tunnels/sessions are created/modified/deleted commit

SUNRPC: Set SO_REUSEPORT socket option for TCP connections to bind multiple TCP connections to the same source address+port combination commit

tipc: involve namespace infrastructure commit

802.15.4: introduce support for cca settings commit

Wireless Add new GCMP, GCMP-256, CCMP-256, BIP-GMAC-128, BIP-GMAC-256, and BIP-CMAC-256 cipher suites. These new cipher suites were defined in IEEE Std 802.11ac-2013 commit, commit, commit, commit, commit New NL80211_ATTR_NETNS_FD which allows to set namespace via nl80211 by fd commit Support per-TID station statistics commit Allow including station info in delete event commit, commit Allow usermode to query wiphy specific regdom commit

bridge offload bridge port attributes to switch ASIC if feature flag set commit Support for allowing userspace to pack multiple vlans and VLAN ranges in setlink and dellink requests for improved performance commit Add ability to enable TSO commit

Near Field Communication (NFC) HCI over NCI protocol support (Some secure elements only understand HCI and thus we need to send them HCI frames) commit NCI NFCEE (NFC Execution Environment, typically an embedded or external secure element) discovery and enabling/disabling support commit, commit, commit, commit, commit, commit, commit NFC_EVT_TRANSACTION userspace API addition, it is sent through netlink in order for a specific application running on a secure element to notify userspace of an event commit Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. A new sysctl (tstamp_allow_data) optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW commit, commit, commit

Bluetooth Enable LE Data Length Extension feature from Bluetooth 4.2 specification commit Expose information in debugfs: Secure Simple Pairing commit, debug keys usage setting commit, hardware error code commit, remote OOB information commit HCI Read Stored Link Keys commit, commit HCI Delete Stored Link Key commit, commit Support static address when BR/EDR has been disabled commit

tc: add BPF-based action. This action provides a possibility to execute custom BPF code commit

net: sched: Introduce connmark action commit

Add Transparent Ethernet Bridging GRO support commit

netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads commit

netfilter: nft_compat: add ebtables support commit network namespace: Add rtnl cmd to add and get peer netns ids. A user can define an id for a peer netns by providing a FD or a PID. These ids are local to the netns where it is added (i.e. valid only into this netns) commit, commit

openvswitch: Add support for checksums on UDP tunnels. commit

openvswitch: Support VXLAN Group Policy extension commit

12. List of merges

13. Other news sites