Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] Futex Asynchronous Interface Original-Message-ID: <Pine.LNX.4.44.0206060930240.5920-100000@home.transmeta.com> Date: Thu, 6 Jun 2002 16:36:58 GMT Message-ID: <fa.m7v2dav.160k90u@ifi.uio.no> On Thu, 6 Jun 2002, Rusty Russell wrote: > > The method is: open /dev/futex STOP! What madness is this? You have a damn mutex system call, don't introduce mode crap in /dev. Do we create pipes by opening /dev/pipe? No. Do we have major and minor numbers for sockets and populate /dev with them? No. And as a result, there has _never_ been any sysadmin problems with either. You already have to have a system call to bind the particular fd to the futex _anyway_, so do the only sane thing, and allocate the fd _there_, and get rid of that stupid and horrible /dev/futed which only buys you pain, system administration, extra code, and a black star for being stupid. Linus

Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] Futex Asynchronous Interface Original-Message-ID: <Pine.LNX.4.44.0206081523410.11630-100000@home.transmeta.com> Date: Sat, 8 Jun 2002 22:29:26 GMT Message-ID: <fa.l2cpo2v.1mh639m@ifi.uio.no> On Fri, 7 Jun 2002, Peter Wächtler wrote: > > What about /proc/futex then? Why? Tell me _one_ advantage from having the thing exposed as a filename? The whole point with "everything is a file" is not that you have some random filename (indeed, sockets and pipes show that "file" and "filename" have nothing to do with each other), but the fact that you can use common tools to operate on different things. But there's absolutely no point in opening /dev/futex from a shell script or similar, because you don't get anything from it. You still have to bind the fd to it's real object. In short, the name "/dev/futex" (or "/proc/futex") is _meaningless_. There's no point to it. It has no life outside the FUTEX system call, and the only thing that you can do by exposing it as a name is to cause problems for people who don't want to mount /proc, or who do not happen to have that device node in their /dev. > Give it an entry in the namespace, why not with sockets (unix and ip) also? Perhaps because you cannot enumerate sockets and pipes? They don't _have_ names before they are created. Same as futexes, btw. Linus

Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] Futex Asynchronous Interface Original-Message-ID: <Pine.LNX.4.44.0206091029001.13459-100000@home.transmeta.com> Date: Sun, 9 Jun 2002 17:51:15 GMT Message-ID: <fa.l4shnqv.1g1a3hr@ifi.uio.no> On Sun, 9 Jun 2002, Peter Wächtler wrote: > > Still you can open a file in the namespace and write some commands to it. > Then it turns out to be a socket on port 25: > > fd=open("/dev/socket",O_RDWR); > write(fd,"connect stream 25

",sizeof(..)); > write(fd,"helo mail.my.com

",..); Yes, obviously you can avoid system calls entirely, and replace all of them with read/write of commands. This is not even a very uncommon idea: the above is basically message passing, and is largely how many microkernels work. Except they don't call it read/write, they tend to call it send/recv, and they aren't "file descriptors", they are "ports". It has advantages: because you only have one set of primitives, it's more easily abtracted at that level, meaning that you can (and people do) make it distributed etc without having to worry about local semantics. It has disadvantages too: performance tends to be bad (you have to copy around and parse the commands that are no longer implicit in the system call number), and while there is a high level of abstraction on one level ("everything is a 'port' that can receive or send messages), at some point the proverbial shit hits the fan and you've moved the details behind the abstraction down (and now the data stream is no longer just bytes, but has a meaning in itself). But yes, the sequences open("/dev/socket") -> socket() write(fd,"connect stream 25") -> connect() are obviously "equivalent". It's not my personal favourite equivalence, though. I'd much rather add the information at _open_ time, and make it a name-space issue, so that you'd do something like open("//sockfs/dst=123.45.67.89:25", O_RDWR); instead. Which is _also_ entirely equivalent, of course (the "namespace" approach does require that you be able to do "fd-relative" lookups, so that you could also do sk = open("//sockfs", O_RDWR); sk2 = fd_open(sk, "dst=123.45.67.89:25", O_RDWR); which is actually useful even in regular files too, just as a way of doing directory-relative file opens without having to do a "chdir()"). HOWEVER, the fact is that exactly because they are equivalent, there is no real difference between them. So you might as well just use the old UNIX behaviour, and if you want to open sockets from a script, you use any of the already _existing_ socket script helpers. For port 25, you have one called "sendmail". For port 80, you have things like "lynx -source". And you have tons of things like "netpipes", for doing generic scripting of sockets. The fact is, trying to come up with new ways to do the same old thing is _not_ a good idea. It may look cool to expose sockets in the namespace, but what's the actual added advantage over existing standard practices? Unless that can be shown, there's just no point. Do a google search for "netpipes", I'm sure you'll find it can do what you wanted. Sorry to rain on the "cool feature" parade, but I want to see some _advantage_ from exposing new names in the namespace. Linus

Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] Futex Asynchronous Interface Original-Message-ID: <Pine.LNX.4.44.0206091056550.13459-100000@home.transmeta.com> Date: Sun, 9 Jun 2002 18:10:36 GMT Message-ID: <fa.l7croav.1ih021k@ifi.uio.no> On 9 Jun 2002, Kai Henningsen wrote: > > However, I don't think that's all that important. What I'd rather see is > making the network devices into namespace nodes. The situation of eth0 and > friends, from a Unix perspective, is utterly unnatural. But what would you _do_ with them? What would be the advantage as compared to the current situation? Now, to configure a device, you get a fd to the device the same way you get a fd _anyway_ - with "socket()". And anybody who says that "socket()" is utterly unnatural to the UNIX way is quite far out to lunch. It may be unnatural to the Plan-9 way of "everything is a namespace", but that was never the UNIX way. The UNIX way is "everything is a file descriptor or a process", but that was never about namespaces. Yes, some old-timers could argue that original UNIX didn't have sockets, and that the BSD interface is ugly and an abomination and that it _should_ have been a namespace thing, but that argument falls flat on its face when you realize that the "pipe()" system call _was_ in original UNIX, and has all the same issues. Don't get hung up about names. Linus

Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: of ethernet names (was [PATCH] Futex Asynchronous Original-Message-ID: <Pine.LNX.4.44.0206091130490.13751-100000@home.transmeta.com> Date: Sun, 9 Jun 2002 18:35:26 GMT Message-ID: <fa.l2t5oiv.1m1q39i@ifi.uio.no> On Sun, 9 Jun 2002, Dr. David Alan Gilbert wrote: > > Personally I would do away with ifconfig and replace it with > cat in and out of device nodes; ifconfig seems to suffer about having to > know about every protocol on every device type and the kernel has to > provide interfaces for it that only it uses. Well, the kernel would have to provide the same interfaces for "cat" if you did it that way, and it would probably take up more space and cause more kernel bloat. And we'd still have to have the old interfaces for backwards compatibility for ifconfig. Is the "magic ioctl" approach ugly? Sure. But it's fairly well contained to just one program (ifconfig), and everybody else just uses that. I think it's less horrible than the alternatives right now. Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 2/5] signalfd v2 - signalfd core ... Date: Thu, 08 Mar 2007 16:24:03 UTC Message-ID: <fa.yLetPnUEmfUYainqduUPKkNd06w@ifi.uio.no> On Thu, 8 Mar 2007, Davide Libenzi wrote: > > This patch, if you get a POLLIN, you have a signal to read for sure (well, > unless you another thread/task reads it before you - but that's just > somthing you have to take care). There is not explicit check for > O_NONBLOCK now, but a zero timeout would do exactly the same thing. You missed David's worry, I think. Not only is POLLIN potentially an edge event (depending on the interface you use to fetch it), but even as a level-triggered one you generally want to read as much as possible per POLLIN event, and go back to the event loop only when you get EAGAIN. So that's in addition to the read/signal race with other threads/processes. You solved it by having a separate system call, but since it's a regular file descriptor, why have a new system call at all, and not just make it be a "read()"? In which case you definitely want O_NONBLOCK support. The UNIX philosophy is often quoted as "everything is a file", but that really means "everything is a stream of bytes". In Windows, you have 15 different versions of "read()" with sockets and files and pipes all having strange special cases and special system calls. That's not the UNIX way. It should be just a "read()", and then people can use general libraries and treat all sources the same. For example, the main select/poll/epoll loop may be the one doing all the reading, and then pass off "full buffers only" to the individual per-fd "action routines". And that kind of model really very fundamentally wants an fd to be an fd to be an fd - not "some file descriptors need 'read_from_sigfd()', and some file descriptors need 'read()', and some file descriptors need 'recvmsg()'" etc. So I think you should get rid of signalfd_dequeue(), and just replace it with a "read()" function. Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 2/5] signalfd v2 - signalfd core ... Date: Thu, 08 Mar 2007 17:29:00 UTC Message-ID: <fa.36v6m+BTJbc0UZr7qfaU4I7I8F0@ifi.uio.no> On Thu, 8 Mar 2007, Michael K. Edwards wrote: > > Make it a netlink socket and fetch your structures using recvmsg(). > siginfo_t belongs in ancillary data. Gaah. That interface is horrible. > The UNIX philosophy is "everything's a file". The Berkeley philosophy > is "everything's a socket, except for files, which are feeble > mini-sockets". I'd go with the Berkeley crowd here. No, the berkeley crowd is totally out to lunch. I might agree with you *if* you could actually do "recvmsg()" on arbitrary file descriptors, but you cannot. We could fix that in Linux, of course, but the fact is, "recvmsg()" is *not* a superset of "read()". In general, it's a *subset*, exactly because very few file descriptors support it. So the normal way to read from a file descriptor (and the *only* way in any generic select loop) is to use "read()". That's the only thing that works for everything. And we shouldn't break that. The sad part is that there really is no reason why the BSD crowd couldn't have done recvmsg() as an "extended read with per-system call flags", which would have made things like O_NONBLOCK etc unnecessary, because you could do it just with MSG_DONTWAIT.. So anybody who would "go with the Berkeley crowd" really shows a lot of bad taste, I'm afraid. The Berkeley crowd really messed up here, and it's so long ago that I don't think there is any point in us trying to fix it any more. (But if somebody makes recvmgs a general VFS interface and makes it just work for everything, I'd probably take the patch as a cleanup. I really think it should have been a "struct file_operations" thing rather than being a socket-only thing.. But since you couldn't portably use it anyway, the thing is pretty moot) Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 2/5] signalfd v2 - signalfd core ... Date: Thu, 08 Mar 2007 17:16:49 UTC Message-ID: <fa.cYAvg0XfOZcR/yrH852IQWdtT9E@ifi.uio.no> On Thu, 8 Mar 2007, Davide Libenzi wrote: > > The reason for the special function, was not to provide a non-blocking > behaviour with zero timeout (that just a side effect), but to read the > siginfo. I was all about using read(2) (and v1 used it), but when you have > to transfer complex structures over it, it becomes hell. How do you > cleanly compat over a f_op->read callback for example? I agree that it gets a bit "interesting", and one option might be that the "read()" interface just gets the signal number and the minimal siginfo information, which is, after all, what 99% of all apps actually care about. But "siginfo_t" is really a *horrible* structure. Nobody sane should ever use siginfo_t, and the designer of that thing was so high on LSD that it's not even funny. Re-using fields in a union? Values that depend on other bits in the thing in random manners? In other words, I bet that we could just make it a *lot* better by making the read structure be: - 16 4-byte fields (fixed 64-byte packet), where each field is an uint32_t (we could even do it in network byte order if we care and if you want to just pipe the information from one machine to another, but that sounds a bit excessive ;) - Just put the fields people actually use at fixed offsets: si_signo, si_errno, si_pid, si_uid, si_band, si_fd. - that still leaves room for the other cases if anybody ever wants them (but I doubt it - things like si_addr are really only useful for synchronous signals that are actually done as *signals*, since you cannot defer a SIGBUS/SIGSEGV/SIGILL *anyway*). So I bet 99% of users actually just want si_signo, while some small subset might want the SIGCHLD info and some of the special cases (eg we might want to add si_addr as a 64-bit thing just because the USB stack sends a SI_ASYNCIO thing for completed URB's, so a libusb might want it, but that's probably the only such user). And it would be *cleaner* than the mess that is siginfo_t.. (I realize that siginfo_t is ugly because it built up over time, using the same buffer for many different things. I'm just saying that we can probably do better by *not* using it, and just laying things out in a cleaner manner to begin with, which also solves any compatibility issues) Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 2/5] signalfd v2 - signalfd core ... Date: Thu, 08 Mar 2007 19:27:57 UTC Message-ID: <fa.G19YT6NjsTW5pT0AnjFeFoS9Wpo@ifi.uio.no> On Thu, 8 Mar 2007, Davide Libenzi wrote: > > So, to cut it short, I can do the pseudo-siginfo read(2), but I don't > like it too much (little, actually). The siginfo, as bad as it is, is a > standard used in many POSIX APIs (hence even in kernel), and IMO if we > want to send that back, a struct siginfo should be. > No? I think it's perfectly fine if you make it "struct siginfo" (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. And *that* is why "everything is a stream of bytes" is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ... Date: Sat, 10 Mar 2007 21:44:53 UTC Message-ID: <fa.jXxFH+f6+h145nuS3tS+QtqwPi0@ifi.uio.no> On Sat, 10 Mar 2007, Nicholas Miell wrote: > > That's what the sigevent structure is for -- to describe how events > should be signaled to userspace, whether by signal delivery, thread > creation, or queuing to event completion ports. If if you think > extending it would be bad, I can show you the line in POSIX where it > encourages the contrary. I'm sorry, but by pointing to the POSIX timer stuff, you're just making your argument weaker. POSIX timers are a horrible crock and over-designed to be a union of everything that has ever been done. Nasty. We had tons of bugs in the original setup because they were so damn nasty. I'd rather look at just about *anything* else for good design than from some of the abortions that are posix-timers. Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ... Date: Sat, 10 Mar 2007 22:42:49 UTC Message-ID: <fa.VEzwAybcgwouCeDwgcbOkbLjnWw@ifi.uio.no> On Sat, 10 Mar 2007, Nicholas Miell wrote: > > Care to elaborate on why they're a horrible crock? It's a *classic* case of an interface that tries to do everything under the sun. Here's a clue: look at any system call that takes a union as part of its arguments. Count them. I think we have two: - struct siginfo - struct sigevent and they are both broken horrible interfaces where the data structures depend on various flags. It's just not the UNIX system call way. And none of it really makes sense if you already have a file descriptor, since at that point you know what the notification mechanism is. I'd actually much rather do POSIX timers the other way around: associate a generic notification mechanism with the file descriptor, and then implement posix_timer_create() on top of timerfd. Now THAT sounds like a clean unix-like interface ("everything is a file") and would imply that you'd be able to do the same kind of notification for any file descriptor, not just timers. But posix timers as they are done now are just an abomination. They are not unix-like at all. > And are the bugs fixed? If so, why replace them? They work now. .. but the reason for the bugs was largely a very baroque interface, which didn't get fixed (because it's specified by the standard). I'd rather have straightforward interfaces. The timerfd() one looked a lot more straightforward than posix timers. (That said, using "struct itimerspec" might be a good idea. That would also obviate the need for TFD_TIMER_SEQ, since an itimerspec automatically has both "base" and "incremental" parts). Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ... Date: Sun, 11 Mar 2007 00:35:57 UTC Message-ID: <fa.ZHzG7Ky4QN/om/8pdENWfOkm+WE@ifi.uio.no> On Sat, 10 Mar 2007, Nicholas Miell wrote: > > > > I'd actually much rather do POSIX timers the other way around: associate a > > generic notification mechanism with the file descriptor, and then > > implement posix_timer_create() on top of timerfd. Now THAT sounds like a > > clean unix-like interface ("everything is a file") and would imply that > > you'd be able to do the same kind of notification for any file descriptor, > > not just timers. > > > > But timers aren't files or even remotely file-like What do you think "a file" is? In UNIX, a file descriptor is pretty much anything. You could say that sockets aren't remotely file-like, and you'd be right. What's your point? If you can read on it, it's a file. And the real point of the whole signalfd() is that there really *are* a lot of UNIX interfaces that basically only work with file descriptors. Not just read, but select/poll/epoll. They currently have just one timeout, but the thing is, if UNIX had just had "timer file descriptors", they'd not need even that one. And even with the timeout, Davide's patch actually makes for a *better* timeout than the ones provided by select/poll/epoll, exactly because you can do things like repeating timers and absolute time etc. Much more naturally than the timer interface we currently have for those system calls. The same goes for signals. The whole "pselect()" thing shows that signals really *should* have been file descriptors, and suddenly you don't need "pselect()" at all. So the "not remotely file-like" is not actually a real argument. One of the big *points* of UNIX was that it unified a lot under the general umbrella of a "file descriptor". Davide just unifies even more. Linus