"Dissecting linux kernel code" or "That syscall shouldn't give that error code!"

2019-07-30T03:54+01:00

While testing koios I came across an error triggering when I asked koios to create an initialization file in the temporary mount point



The following invocation

$ koios init .

was causing koios to trigger the error

Unable to generate config file: Setting extended attribute failed Result not representable

Well, hmm. 'Setting extended attribute failed' is the error code KERR_SETXAT. There are only two places that triggers KERR_SETXAT, and only one place in config generation that can trigger it. Spamming the surrounding with printfs (because unfortunately gdb will not let us watch errno) and running `mk do_test` again, we can see:

errno: 0; line: 329 errno: 0; line: 336 errno: 0; line: 281 errno: 61; line: 286 errno: 34; line: 288 errno: 34; line: 343 Unable to generate config file: Setting extended attribute failed Result not representable

A cursory investigation shows us that of the two errno values we get, only one is serious (The other is expected and a natural part of the functioning of the program):

$ less /usr/include/asm-generic/errno.h #define ENODATA 61 /* No data available */ $ less /usr/include/asm-generic/errno-base.h #define ERANGE 34 /* Math result not representable */

Hold on. Math result? But we don't do any math in the call that triggers that. Do we?



Let's look at the line that gives us that error:

if ((setxat(KOIOS_TESTFILE, KOIOS_TESTATTR, bp, n, 0)) < 0) {

Seems normal to me. setxat is a wrapper I wrote so that we can handle both the Linux and the OSX syscalls the same, it doesn't do any dark magic. A cursory scan and grep of the manual page for setxattr(2) shows no mention of the ERANGE error, however it does mention that it shares error codes with stat(2). stat(2), unfortunately, is not listed to produce the ERANGE error code.



So what's going on?



Well, I *do* have the linux kernel code on my computer, so let's investigate. After a short poke around using grep, it seems that the most likely candidate for the syscall is 'setxattr' at line 413 in fs/xattr.c. (I'll list the line numbers if you want to read the full code along with me, it's linux kernel version 5.1.11)



Well, hold on, immediately we notice that this _does_ mention ERANGE:

error = strncpy_from_user(kname, name, sizeof(kname)); if (error == 0 || error == sizeof(kname)) error = -ERANGE; if (error < 0) return error;

Ok, well kname is defined as:

char kname[XATTR_NAME_MAX + 1];

and 'XATTR_NAME_MAX' (after a quick grep) is defined as:

#define XATTR_NAME_MAX 255

So let's recompile and check what the length of 'name' that we're passing, to make sure we are not hitting that limit:

strlen(name): 15

Hmm. Ok. At this point in time I decided to back-pedal and check my assumptions. Given that I'm compiling under musl, I wonder if it does anything weird. Luckily enough, I have a copy of the musl source on my computer from that time I dug into the DNS name resolution routines. A grep reveals that setxattr is defined in 'src/linux/xattr.c', so let's take a look:

int setxattr(const char *path, const char *name, const void *value, size_t size, int flags) { return syscall(SYS_setxattr, path, name, value, size, flags); }

Oh. Ok so my assumption was right, that it goes straight to a system call. Next let's figure out what function gets called less. Another grep finds us:

fs/xattr.c:480:SYSCALL_DEFINE5(setxattr, const char __user *, pathname,

Let's take a look:

SYSCALL_DEFINE5(setxattr, const char __user *, pathname, const char __user *, name, const void __user *, value, size_t, size, int, flags) { return path_setxattr(pathname, name, value, size, flags, LOOKUP_FOLLOW); }

At this point I moved to cscope to make it easier to track down definitions. For the sake of prosperity, I was invoking it like this:

$ cscope */*.c */*.h */*/*.c */*/*.h */*/*/*.c */*/*/*.h

I also decided to go through it function by function and figure out which parts called by path_setxattr could possibly return ERANGE. I won't make you bored by lazily reusing the above sentence structure ("I grepped for X, which showed Y and made me grep for Z, which showed ..."). Instead, here's a nice handwritten tree of what calls what and if it can return ERANGE (and some notes):

path_setxattr: \ user_path_at: NO |\ user_path_at_empty: NO | \ filename_lookup: NO | \ path_lookupat: NO | \ handle_lookup_down: NO | |\ follow_managed: NO | | \ d_manage -> autofs_d_manage | | |\ autofs_mount_wait | | | \ autofs_wait | | | \ validate_request: NO | | \ follow_automount | | \ d_automount | | \ finish_automount: NO | | \ do_add_mount | | \ lock_mount | | |\ get_mountpoint | | | \ d_set_mounted: NO | | \ graft_tree: NO | | \ attach_recursive_mnt: NO | | \ count_mounts: NO | | \ invent_group_ids: NO | | |\ mnt_alloc_group_id: NO | | | \ ida_alloc_min | | | \ ida_alloc_range: NO | | | \ xas_error | | | \ xa_err / xa_is_err: doesn't seem to | | \ propagate_mnt | | \ propagate_one | | \ count_mounts: NO | \ link_path_walk | |\ may_lookup | | \ inode_permission: NO | | |\ sb_permission: NO | | |\ do_inode_permission: | | | \ permission (fuse_permission) | | | | \ fuse_do_getattr | | | | \ fuse_simple_request: NO | | | | \ fuse_get_req | | | | \ __fuse_get_req: NO | | | | \ fuse_request_alloc | | | | \ __fuse_request_alloc: NO | | | | \ kmem_cache_zalloc | | | | \ kmem_cache_alloc: docs state returns only NULL or a | | | | pointer. At this point I decided | | | | to stop in this tree | | | |\ generic_permission: NO | | | | \ acl_permission_check | | | | \ check_acl | | | | \ posix_acl_permission: NO | | | |\ fuse_access | | | | \ fuse_simple_request: NO | | | | \ (... explored elsewhere) | | | |\ fuse_perm_getattr: NO | | | | \ (... explored elsewhere) | | | \ generic_permission: NO | | | \ (... explored elsewhere) | | |\ devcgroup_inode_permission | | | \ devcgroup_check_permission | | | \ __devcgroup_check_permission: NO | | |\ security_inode_permission | | | \ call_int_hook: NO | |\ d_hash: NO (I couldn't find a fuse handler. the one on line 99 | | of fs/dcache.c didn't return any errors whatsoever, | | though) | |\ walk_component | | \ handle_dots: NO | | |\ follow_dotdot_rcu: NO | | || follow_dotdot | | | \ path_parent_directory: NO | | |\ lookup_fast | | | \ d_revalidate | | | |\ d_revalidate (fuse_dentry_revalidate) | | | | \ fuse_simple_request: NO | | | | \ (... explored elsewhere) | | | | follow_managed: NO | | | \ (... explored elsewhere) | | | step_into | | \ pick_link | | \ nd_alloc_stack | | \ __nd_alloc_stack: NO | \ lookup_last | |\ list_entry: PROBABLY NO (returns a pointer, so an invalid value is | | going to be NULL) | \ complete_walk | \ d_weak_revalidate: PROBABLY NO (I can't find a valid fuse value for | it) \ mnt_want_write: NO |\ __mnt_want_write: NO \ setxattr: YES \ strncpy_from_user: NO |\ __do_strncpy_from_user: NO | cap_convert_nscap: NO | vfs_setxattr \ xattr_permission |\ inode_permission: NO | \ (... explored elsewhere) \ security_inode_setxattr (v1 -- this seems to have two versions?) |\ call_int_hook (inode_setxattr) | \ selinux_inode_setxattr: NO | |\ cap_inode_setxattr: NO | || dentry_has_perm: NO | ||\ inode_has_perm: NO | || \ avc_has_perm: NO | || \ avc_has_perm_noaudit | || |\ avc_denied: NO | || | avc_audit: NO | || \ slow_avc_audit: NO | || avc_has_perm: NO | ||\ (... explored elsewhere) | || security_context_to_sid: NO | ||\ security_context_to_sid_core: NO | || \ string_to_context_struct: NO | || |\ mls_context_to_sid: NO | || | \ mls_context_cpy: NO | || | |\ ebitmap_cpy: NO | || | | ebitmap_set_bit: NO | || \ sidtab_context_to_sid: NO | || \ sidtab_reverse_lookup: NO | || \ sidtab_rcache_search: NO | || | sidtab_find_context: NO | || |\ sidtab_find_context: NO | || | context_cpy: NO | || |\ mls_context_cpy: NO | || | \ ebitmap_cpy: NO | || \ 'func' (convert->func): ? I couldn't find where this is assigned | || security_context_to_sid_force: NO | ||\ security_context_to_sid_core: NO | || \ (... explored elsewhere) | |\ security_validate_transition: NO | | \ security_compute_validatetrans: NO | | \ security_validtrans_handle_fail: NO | \ smack_inode_setxattr: NO | \ smk_curacc: NO | |\ smk_tskacc: NO | | \ smk_access: NO | | smk_bu_inode: NO |\ ima_inode_setxattr: NO | \ ima_protect_xattr: NO |\ evm_inode_setxattr: NO | \ evm_protect_xattr: NO \ security_inode_setxattr (v2) |\ cap_inode_setxattr: NO | \ (... explored elsewhere) \ __vfs_setxattr_noperm: NO | __vfs_setxattr (returns the result of the handler) |\ xattr_resolve_name (returns fuse_xattr_handlers or another handler) | \ fuse_xattr_handlers: NO | \ get = fuse_xattr_get | |\ fuse_getxattr: NO | | \ fuse_simple_request: NO | | |\ (... explored elsewhere) | | | min_t: NO | | set = fuse_xattr_set | |\ fuse_removexattr: NO | ||\ fuse_simple_request: NO | || \ (... explored elsewhere) | |\ fuse_setxattr: NO | | \ fuse_simple_request: NO | | \ (... explored elsewhere) | security_inode_setsecurity \ hook.inode_setsecurity \ selinux_inode_setsecurity \ security_context_to_sid: NO \ (... explored elsewhere)

Unfortunately, as you can see from the completed call graph, it looks like there are only two places that can return ERANGE, and I'm sure that I'm not meeting the condition for either of them. However, while re-reading through fuse_simple_request, I noticed that there was the possibility that it returned req->out.h.error. I figured that, perhaps somewhere along the lines, h.error is getting set to ERANGE. However, doing a rough search for that came up negative.



Next approach was what I originally was trying to avoid: Adding debugging statements to the kernel and recompiling it. I was trying to avoid this because on my old X200 it takes over an hour to compile the kernel from scratch. On the bright side, I managed to use that time to organize my sheet music, so it wasn't a complete loss. After adding some debug statements we can see:

[ 86.303425] Entered path_setxattr [ 86.303428] user_path_at returned error: 0 [ 86.303428] mnt_want_write returned error: 0 [ 86.303452] setxattr returned error: -34 [ 86.303453] returning error: -34

So, setxattr can only return ERANGE in one condition, that I was pretty sure I was _not_ hitting. I should have added debugging statements there as well.

[ 76.889753] Entered path_setxattr [ 76.889757] user_path_at returned error: 0 [ 76.889758] mnt_want_write returned error: 0 [ 76.889759] Entered setxattr [ 76.889760] Not returning EINVAL [ 76.889762] strncpy_from_user returned error: 15 [ 76.889763] We didn't trigger the other 'return error' condition either [ 76.889784] Entered vfs_setxattr [ 76.889785] xattr_permission returned error: 0 [ 76.889786] security_inode_setxattr returned error: 0 [ 76.889788] __vfs_setxattr_noperm returned error: -34 [ 76.889789] vfs_setxattr returned error: -34 [ 76.889791] setxattr returned error: -34 [ 76.889792] returning error: -34

Hmm, we have some more information but it's still not great. As you can see, we're not hitting the ERANGE conditions in setxattr, and by an excerpt of the the next log (after adding more printks) we can see that whatever handler it's choosing is returning the error:

[ 154.153282] security_inode_setxattr returned error: 0 [ 154.153283] Entered __vfs_setxattr_noperm [ 154.153284] Entered __vfs_setxattr [ 154.153286] handler->set returned error: -34 [ 154.153287] __vfs_setxattr returned error: -34

Given that fuse_xattr_handlers does not return ERANGE, we can assume that it my assumption about the handler that it returned was wrong, let's verify this by adding "%pf" which the elinux "Debugging by printing" guide tells will print a symbol name for a pointer value.

[ 51.653248] Handler chosen is: ext2_xattr_user_handler [ext2]

Ok! Now that we have this information, let's do another deep dive into the ext2 handler.



ext2_xattr_user_handler is defined as a structure in xattr_user.c:

const struct xattr_handler ext2_xattr_user_handler = { .prefix = XATTR_USER_PREFIX, .list = ext2_xattr_user_list, .get = ext2_xattr_user_get, .set = ext2_xattr_user_set, };

ext2_xattr_user_set is is a simple passthrough to ext2_xattr_set which _does_ contain the value of ERANGE. Huzzah! ext2_xattr_set is defined in fs/ext2/xattr.c at line 364, the ERANGE mention comes pretty early on:

394 name_len = strlen(name); 395 if (name_len > 255 || value_len > sb->s_blocksize) 396 return -ERANGE;

Hmmm. According to the documentation (man 2 setxattr) we should be returning ENOSPC if we can't store the xattr:

ENOSPC There is insufficient space remaining to store the extended attribute.

It would also probably make more sense to return something like ENAMETOOLONG (An error setxattr is listed as capable of returning because it shares error codes with stat(2)). On a more general case, should we propose a patch to bring the kernel in line with the documentation, or instead bring the docs in line with the kernel's actual behavior? Well, let's base it on example. Looking at the ext4 handlers, we can see that ext4 also returns ERANGE if the name is too long (in ext4_xattr_set_handle):

2317 if (strlen(name) > 255) 2318 return -ERANGE;

Additionally, ext4 appears to be in-line with the documentation and returns ENOSPC when value_len is too big (the exact code path is kind of convoluted, so I won't represent it entirely here, and I could be mistaken about what this section of code does):

2338 if (!ext4_handle_has_enough_credits(handle, credits)) { 2339 error = -ENOSPC; 2340 goto cleanup; 2341 }

Given that clearly other handlers return ERANGE, paired with the fact that we should keep things stable for userspace, it's probably better to submit documentation fixing this rather than attempting to patch ERANGE out.



So, this (unfortunately) three-day journey culminated in the following message to the man-pages mailing list:



linux-man: "[patch] setxattr.2: Add ERANGE to 'ERRORS' section"



And this simple fix in libkoios:

- if (errno == ENOSPC) { + if (errno == ENOSPC || errno == ERANGE) {