From: torvalds@transmeta.com (Linus Torvalds) Newsgroups: comp.os.linux.development.system Subject: Re: NT kernel guy playing with Linux Date: 27 Jun 1999 03:25:59 GMT In article <7l3oan$mrj$1@wire.cadcamlab.org>, Peter Samuelson <psamuels@sampo.creighton.edu> wrote: >I say this in the hope of sparing others the confusion I had about what >a spinlock is. (It looked a lot like what CS calls a semaphore, so why >don't they just call it a semaphore, and what's this semaphore_t...?) If your CS courses didn't tell you the difference between a semaphore and a spinlock, your CS courses were bad (or just didn't cover much about concurrency, which is fairly common). Blame your professors, don't blame the Linux kernel code. >A spinlock is a semaphore used for very short critical sections; while >waiting for a spinlock to be released, the kernel sits on the CPU and >does nothing. A spinlock is a mutual exclusion mechanism, not a semaphore (a semaphore is a very specific _kind_ of mutual exclusion). But yes, you're right in that it's (by design) a busy-waiting one. That's why they are called "spinlocks" - they "spin" waiting for the lock to go away. > A `semaphore_t' is a longer-term semaphore which has the >advantage that the kernel puts the needy process to sleep and goes and >does something else rather than busy-waiting. A semaphore also has the ability to have more than one process enter the critical region. Basically semaphores were (as far as I know) first proposed by Dijkstra, and they explicitly imply a "sleep"/"wakeup" behaviour, ie they are _not_ spinlocks. They originally had operations called "P()" and "V()", but nobody ever remembers whether P() was down() or up(), so nobody uses those names any more. Dijkstra was probably a bit heavy on drugs or something (I think the official explanation is that P and V are the first letters in some Dutch words, but I personally find the drug overdose story much more believable). Also, unlike basic spinlocks, a semaphore has a "count" value: each process that does a down() operation decrements the count if positive, until the count would go negative. Only then do they sleep. The original intent of this was to allow multiple entries to the region protected by the semaphore: by initializing the count to 4, you allow four down() operations and only the fifth one will block. However, almost all practical use of semaphores is a special case where the counter is initialized to 1, and where they are used as simple mutual exclusion with only one user allowed in the critical region. Such a semaphore is often called a "mutex" semaphore for MUTual EXclusion. I've never really seen anybody use the more complex case of semaphores, although I do know of cases where it can be useful. For example, one use of a more complex semaphore is as a "throttle", where you do something like this: /* Maximum concurrent users */ #define MAX_CONCURRENT_USERS 20 struct semaphore sem; init_sema(&sem, MAX_CONCURRENT_USERS); and then each user does a down() on the semaphore before starting an operation. It won't block until you have 20 users - you've not created a mutual exclusion, but you HAVE created a throttling mechanism. See? Potentially useful, as I said, but the common case (and the only case currently in use in the Linux kernel - even though the implementation definitely can handle the general case) is certainly the mutex one. >However, you need to grab a spinlock for the purpose of grabbing a >semaphore_t, so for short critical sections a spinlock (even with the >potential of busy-waiting) is more efficient. Only in bad implementations or on bad hardware. All reasonably modern CPU's can do a semaphore without having grabbed a spinlock. Often a spinlock is needed for the _contention_ case, but if done right that is very rare. The contention case for a semaphore is usually the very expensive one, because that's when you have to re-schedule etc. So basically spinlocks are much simpler, and faster under short-lived contention, so that's why they tend to be used. Also, semaphores cannot be used by interrupt handlers, as Linux doesn't allow interrupt handlers to sleep, so anything that protects interrupts needs to be a spinlock. In addition to semaphores, there are other mutual exclusion notions, the most popular being a "read-write" lock - something that requires exclusive access for writers, but allows any number of readers at a time. Linux has the spinning version of this, but not the blocking one. We'll probably add a blocking version some day, as it's often very useful, but it hasn't been a major issue yet. For example, the per-VM memory management semaphore could very usefully be a blocking read-write lock, but without heavy thread contention a mutex semaphore is basically equivalent. > Since the usual practice >is to make your critical sections as short as possible, the result is >that the kernel uses a lot more spinlocks than semaphore_t's. True. Semaphores are really only useful around anything that does IO, for example. When the potential contention period is multiple milliseconds as opposed to nano- or microseconds, blocking operations (ie operations that cause a re-schedule on contention) are the right way to go. Note that some people believe in a mix-and-match approach, where you have a spinlock that gets upgraded to a semaphore if it waits too long. Personally, I think that only makes sense if (a) you're in user space and don't know what the scheduling rules are or (b) your locking is so badly designed that you have tons of short-lived contention on your semaphores, so you want to try the light approach first. Linus

Date: Sun, 8 Apr 2001 20:08:13 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: rw_semaphores Newsgroups: fa.linux.kernel On Sun, 8 Apr 2001, Andrew Morton wrote: > > One issue with the current implementation is the (necessary) > design where one reader (or writer) is sleeping in > down_x_failed_biased(), with its bias in place, whereas > all other sleepers are sleeping in down_x_failed(), with > their bias not in place. Why not do this the same way the _real_ semaphores do it? You have a fast-path that has a single integer, and a slow path which has more state but and protected by a spinlock. The _only_ worry I see is to make sure that we properly have a "contention" state, because both readers and writers need to be able to know when they should wake things up. But there's a _trivial_ contention state. Read on: Forget about the BIAS stuff, and go back to the old simple "negative is writer, positive is reader" implementation: 0 - unlocked 1..n - 1-n readers 0xc0000000 - writer other <0 - contention Do you see anything wrong with this? Implementation: - fast path: down_read: lock incl (%sem) js __down_read_failed down_write: xorl %eax,%eax movl $0xc0000000,%r lock cmpxchgl %r,(%sem) jne __down_write_failed up_read: lock decl (%sem) js __up_read_wakeup up_write: lock andl $0x3fffffff,(%sem) jne __up_write_wakeup The above are all fairly obvious for the non-failure case, agreed? Including the guarantee that _if_ there is contention, the "up()" routines will always go to the slow path to handle the contention. Now, the _slow_ case could be your generic "protected by a spinlock" code, although I have a suggestion. As follows: The only half-way "subtle" case is that __down_write_failed needs to make sure that it marks itself as a contender (the down_read() fast-case code will have done so already by virtue of incrementing the counter, which is guaranteed to have resulted in a "contention" value. While down_read() automatically gets a "contention value" on failure, "down_write()" needs to do extra work. The extra work is not all that expensive: the simplest thing is to just do subl $0x8000,(%sem) at the beginning - which will cause a "contention" value regardless of whether it was pure-reader before (it goes negative by the simple assumption that there will never be more than 32k concurrent readers), or whether it was locked by a writer (it stays negative due to the 0x40000000 bit in the write lock logic). In both cases, both a up_write() and a up_read() will notice that they need to handle the contention of a waiting writer-to-be. Would you agree that the above fast-path implementation guarantees that we always get to the spinlock-protected slow path? Now, the above _heavily_ favours readers. There's no question that the above is unfair. I think that's ok. I think fairness and efficiency tend to be at odds. But queuing theory shows that the faster you can get out of the critical region, the smaller your queue will be - exponentially. So speed is worth it. Let's take an example at this point: - lock is zero - writer(1) gets lock: lock is 0xc0000000 - reader(2) tries to get in: lock becomes 0xc0000001, synchronizes on spinlock. - another writer(3) tries to get in: lock becomes 0xbfff8001, synchronizes on spinlock. - writer(1) does up_write: lock becomes 0x3fff8001, != 0, so writer decides it needs to wake up, and synchronizes on spinlock. - another reader(4) comes on on another CPU, increments, and notices that it can do so without it being negative: lock becomes 0x3fff8002 and this one does NOT synchronize on the spinlock. End result: reader(4) "stole base" and actually got the lock without ever seeing any contention, and we now have (1), (2) and (3) who are serialized inside the spinlock. So we get to what the serializers have to do, ie the slow path: First, let's do the __up_read/write_wakeup() case, because that one is the easy one. In fact, I think it ends up being the same function: spin_lock(&sem->lock); wake_up(&sem->waiters); spin_unlock(&sem->lock); and we're all done. The only optimization here (which we should do for regular semaphores too) is to use the same spinlock for the semaphore lock and the wait-queue lock. The above is fairly obviously correct, and sufficient: we have shown that we'll get here if there is contention, and the only other thing that the wakup could sanely do would possibly be to select which process to wake. Let's not do that yet. The harder case is __down_read/write_failed(). Here is my suggested pseudo-code: __down_write_failed(sem) { DECLARE_WAIT_QUEUE(wait, current); lock subl $0x8000,(%sem) /* Contention marker */ spin_lock(&sem->lock); add_wait_queue_exclusive(&sem->wait, &wait); for (;;) { unsigned int value, newvalue; set_task_state(TASK_SLEEPING); value = sem->value; /* * Ignore other pending writers: but if there is * are pending readers or a write holder we should * sleep */ if (value & 0xc0007fff) { spin_unlock(&sem->lock); schedule(); spin_lock(&sem->lock); continue; } /* * This also undoes our contention marker thing, while * leaving other waiters contention markers in place */ newvalue = (value + 0x8000) | 0xc0000000; if (lock_cmpxchg(sem->value, value, newvalue)) break; /* GOT IT! */ /* Damn, somebody else changed it from under us */ continue; } remove_wait_queue(&sem->wait, &wait); spin_unlock(&sem->lock); } The down_read() slow case is equivalent, but ends up being much simpler (because we don't need to play with contention markers or ignore other peoples contention markers): __down_read_failed(sem) { DECLARE_WAIT_QUEUE(wait, current); spin_lock(&sem->lock); add_wait_queue(&sem->wait, &wait); for (;;) { set_task_state(TASK_SLEEPING); /* * Yah! We already did our "inc", so if we ever see * a positive value we're all done. */ if (sem->value > 0) break; spin_unlock(&sem->lock); schedule(); spin_lock(&sem->lock); } remove_wait_queue(&sem->wait, &wait); spin_unlock(&sem->lock); } Can anybody shoot any holes in this? I haven't actually tested it, but race conditions in locking primitives are slippery things, and I'd much rather have an algorithm we can _think_ about and prove to be working. And I think the above one is provably correct. Not that I want to go to that kind of extremes. Anybody? Andrew? Mind giving the above a whirl on your testbed? Or can you see some thinko in it without even testing? Note in particular how the above keeps the non-contention "down_read()" / "up_read()" cases as two single instructions, no slower than a spinlock. (There are the following assumptions in the code: there are at most 32k active readers, and also at most 32k pending writers. The limits come from the writer contention marker logic. You have the 32 bits split up as: - 2 bits "write lock owner" (it's really only one bit, but the other bit is set so that the writer contention marker won't turn the most negative number into a positive one, so we have one "extra" bit set to keep the thing negative for the whole duration of a write lock) - 15 "reader count" bits - 15 "pending writer count" bits and the 15 bits is why you have the 32k user limitation. I think it's an acceptable one - and if not you can expand on it by adding extra fields that are only accessed within the spinlock). Linus

Date: Sun, 8 Apr 2001 21:18:20 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: rw_semaphores Newsgroups: fa.linux.kernel The "down_writer_failed()" case was wrong: On Sun, 8 Apr 2001, Linus Torvalds wrote: > > __down_write_failed(sem) > { > .... > /* > * Ignore other pending writers: but if there is > * are pending readers or a write holder we should > * sleep > */ > if (value & 0xc0007fff) { .... The "value & 0xc0007ffff" test is wrong, because it's actually always true for some normal contention case (multiple pending writers waiting for a reader to release the lock). Because even if there is no write lock holder, other pending writers (and we're one) will have caused the high bits to be set, so the above would end up causing us to think that we can't get the lock even if there is no real lock holder. The comment is right, though. It's just the test that is simplistic and wrong. The pending readers part is correct, and obvious enough: (value & 0x7fff) != 0 implies that there are readers. In which case we should try again. Simple enough. The pending writers part is slightly more subtle: we know that there is at least _one_ pending writer, namely us. It turns out that we must check the two high bits, and the logic is: - 11: only pending writers, no write lock holder (if we had a write lock holder, he would have set the bits to 11, but a pending writer would have borrowed from the lower bit, so you'd get bit pattern 10). - 10: we have a real lock holder, and the pending writers borrowed from the low lock bit when they did the "subl 0x8000" to mark off contention. - 01: must not happen. BUG. - 00: we had a real write lock holder, but he released the lock and cleared both bits. So the "is there a write lock holder" test basically becomes (value & 0xc0000000) == 0x80000000 and the full test should be if ((value & 0x7fff) || ((value & 0xc0000000) == 0x80000000) { spin_unlock(); schedule(); spin_lock(); continue; } which might be rewritten some simpler way. I'm too lazy to think about it even more. For similar reasons, the "newvalue" calculation was subtly bogus: we must make sure that we maintain the correct logic for the two upper bits in the presense of _other_ pending writers. We can't just do the unconditional binary "or" operation to set the two upper bits, because then we'd break the above rules if there are other pending writers. So the newvalue calculation should be something along the lines of /* Undo _our_ contention marker */ newvalue = value + 0x8000; /* Get rid of stale writer lock bits */ newvalue &= 0x3fffffff; /* * If there were no other pending writers (newvalue == 0), we set * both high bits, otherwise we only set bit 31. * (see above on the "borrow bit being clear" logic). */ if (!newvalue) newvalue = 0xc0000000; newvalue |= 0x80000000; And THEN I think the algorithm in the email I'm following up to should actually really work. Does anybody find any other details I missed? And no, I have not yet actually _tested_ any of this. But all my code works on the first try (or, in this case, second try if you want to be a stickler for details). No wonder we didn't get this right first time through. It's not really all that horribly complicated, but the _details_ kill you. Linus

Date: Mon, 9 Apr 2001 22:43:53 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: rw_semaphores Newsgroups: fa.linux.kernel On Tue, 10 Apr 2001, Tachino Nobuhiro wrote: > > I am not familiar with semaphore or x86, so this may not be correct, > but if the following sequence is possible, the writer can call wake_up() > before the reader calls add_wait_queue() and reader may sleep forever. > Is it possible? The ordering is certainly possible, but if it happens, __down_read_failed() won't actually sleep, because it will notice that the value is positive and just return immediately. So it will do some unnecessary work (add itself to the wait-queue only to remove itself immediately again), but it will do the right thing. Linus

Date: Tue, 10 Apr 2001 12:42:07 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] i386 rw_semaphores fix Newsgroups: fa.linux.kernel On Tue, 10 Apr 2001, David Howells wrote: > > Here's a patch that fixes RW semaphores on the i386 architecture. It is very > simple in the way it works. XADD only works on Pentium+. That's no problem if we make this SMP-specific - I doubt anybody actually uses SMP on i486's even if the machines exist, as I think they all had special glue logic that Linux would have trouble with anyway. But the advantages of being able to use one generic kernel that works on plain UP i386 machines as well as SMP P6+ machines is big enough that I would want to be able to say "CONFIG_X86_GENERIC" + "CONFIG_SMP". Even if it would be noticeably slower (ie a fallback to a spinlock might be perfectly ok). If you do this, I would suggest having asm-i386/{rwsem.h|rwsem-xadd.h}, and just having a #ifndef CONFIG_XADD #include <asm/rwsem.h> #else #include <asm/rwsem-xadd.h> #endif (And adding "CONFIG_XADD" to the list of generated optimization configuration options in arch/i386/config.in, of course). That way we don't make the semaphore.h file even more unreadable. Linus

Date: Tue, 10 Apr 2001 13:16:10 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] i386 rw_semaphores fix Newsgroups: fa.linux.kernel On Tue, 10 Apr 2001, Andi Kleen wrote: > > I guess 386 could live with an exception handler that emulates it. That approach is fine, although I'd personally prefer to take the exception just once and just rewrite the instuction as a "call". The places that need xadd would have to follow some strict guidelines (long modrms or other instructions to pad out to enough size, and have the arguments in fixed registers) > (BTW an generic exception handler for CMPXCHG would also be very useful > for glibc -- currently it has special checking code for 386 in its mutexes) > The 386 are so slow that nobody would probably notice a bit more slowness > by a few exceptions. Ehh. I find that the slower the machine is, the more easily I _notice_ that it is slow. So.. Linus

Date: Tue, 10 Apr 2001 17:55:09 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] i386 rw_semaphores fix Newsgroups: fa.linux.kernel On Wed, 11 Apr 2001, David Weinehall wrote: > > > > Yes, and with CMPXCHG handler in the kernel it wouldn't be needed > > (the other 686 optimizations like memcpy also work on 386) > > But the code would be much slower, and it's on 386's and similarly > slow beasts you need every cycle you can get, NOT on a Pentium IV. Note that the "fixup" approach is not necessarily very painful at all, from a performance standpoint (either on 386 or on newer CPU's). It's not really that hard to just replace the instruction in the "undefined instruction" handler by having strict rules about how to use the "xadd" instruction. For example, you would not actually fix up the xadd to be a function call to something that emulates "xadd" itself on a 386. You would fix up the whole sequence of "inline down_write()" with a simple call to an out-of-line "i386_down_write()" function. Note that down_write() on an old 386 is likely to be complicated enough that you want to do it out-of-line anyway, so the code-path you take (afetr the first time you've trapped on that particular location) would be the one you would take for an optimized 386 kernel anyway. And similarly, the restrictions you place on non-386-code to make it fixable are simple enough that it probably shouldn't make a difference for performance on modern chips. Linus

Date: Wed, 11 Apr 2001 11:41:06 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [PATCH] i386 rw_semaphores fix Newsgroups: fa.linux.kernel On Wed, 11 Apr 2001, David Howells wrote: > > > These numbers are infinity :) > > I know, but I think Linus may be happy with the resolution for the moment. It > can be extended later by siphoning off excess quantities of waiters into a > separate counter (as is done now) and by making the access count use a larger > part of the variable. I'm certainly ok with the could being limited to "thousands". I don't see people being able to exploit it any practical way. But we should remember to be careful: starting thousands of threads and trying to make them all take page faults and overflowing the read counter would be a _nasty_ attack, It would probably not be particularly easy to arrange, but still. Note that blocking locks are different from spinlocks: for spinlocks we can get by with just 7 bits in a byte, and be guaranteed that that is enough for any machine with less than 128 processors. For the blocking locks, that is not true. (Right now the default "max user processes" ulimit already protects us from this exploit, I just wanted to make sure that people _think_ about this). So a 16-bit count is _fine_. And I could live with less. We should remember the issue, though. If we ever decide to expand it, it would be easy enough to make an alternative "rwsem-reallybig.h" that uses cmpxchg8b instead, or whatever. You don't need to start splitting the counts up to expand them past 16 bits, you could have a simple system where the _read_ locks only look at one (fast) 32-bit word for their fast case, and only the write lock actually needs to use cmpxchg8b. (I think it's reasonable to optimize the read-lock more than the write-lock: in the cases where you want to do rw-locking, the common reason is because you really _want_ to allow many concurrent readers. That also implies that the read case is the common one). So you could fairly easily expand past 16-bit counters by using a 31-bit counter for the reader, and making the high bit in the reader count be the "contention" bit. Then the slow path (and the write path) would use the 64-bit operations offered by cmpxchg8b. And yes, it's a Pentium+ instruction (and this time I -checked- ;), but by the time you worry about hundreds of thousands of threads I think you can safely just say "you'd better be running on a big a modern machine", and just make the code conditional on CONFIG_PENTIUM+ So no real trickiness needed for expanding the number space, but certainly also no real _reason_ for it at this time. Linus

Date: Mon, 16 Apr 2001 10:05:57 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: rw_semaphores Newsgroups: fa.linux.kernel On Mon, 16 Apr 2001 yodaiken@fsmlabs.com wrote: > > I'm trying to imagine a case where 32,000 sharing a semaphore was anything but a > major failure and I can't. To me: the result of an attempt by the 32,768th locker > should be a kernel panic. Is there a reasonable scenario where this is wrong? Hint: "I'm trying to imagine a case when writing all zeroes to /etc/passwd is anything but a major failure, but I can't. So why don't we make /etc/passwd world-writable?" Right. Security. There is _never_ any excuse for panic'ing because of some inherent limitation of the data structures. You can return -ENOMEM, -EAGAIN or something like that, but you must _not_ allow a panic (or a roll-over, which would just result in corrupted kernel data structures). Note that the limit is probably really easy to work around even without extending the number of bits: a sleeper that notices that the count is even _halfway_ to rolling around could easily do something like: - undo "this process" action - sleep for 1 second - try again from the beginning. I certainly agree that no _reasonable_ pattern can cause the failure, but we need to worry about people who are malicious. The above trivial approach would take care of that, while not penalizing any non-malicious users. So I'm not worried about this at all. I just want people _always_ to think about "how could I mis-use this if I was _truly_ evil", and making sure it doesn't cause problems for others on the system. (NOTE: This does not mean that the kernel has to do anything _reasonable_ under all circumstances. There are cases where Linux has decided that "this is not something a reasonable program can do, and if you try to do it, we'll give you random results back - but they will not be _security_ holes". We don't need to be _nice_ to unreasonable requests. We just must never panic, otherwise crash or allow unreasonable requests to mess up _other_ people) Linus

Date: Fri, 20 Apr 2001 10:46:01 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [andrea@suse.de: Re: generic rwsem [Re: Alpha "process table Newsgroups: fa.linux.kernel On Fri, 20 Apr 2001, David Howells wrote: > > The file should only be used for the 80386 and maybe early 80486's where > CMPXCHG doesn't work properly, everything above that can use the XADD > implementation. Why are those not using the generic files? The generic code is obviously more maintainable. > But if you want it totally non-inline, then that can be done. However, whilst > developing it, I did notice that that slowed things down, hence why I wanted > it kept in line. I want to keep the _fast_ case in-line. I do not care at ALL about the stupid spinlock version. That should be the _fallback_, and it should be out-of-line. It is always going to be the slowest implementation, modulo bugs in architecture-specific code. For i386 and i486, there is no reason to try to maintain a complex fast case. The machines are unquestionably going away - we should strive to not burden them unnecessarily, but we should _not_ try to save two cycles. In short: - the only case that _really_ matters for performance is the uncontended read-lock for "reasonable" machines. A i386 no longer counts as reasonable, and designing for it would be silly. And the write-lock case is much less compelling. - We should avoid any inlines where the inline code is >2* the out-of-line code. Icache issues can overcome any cycle gains, and do not show up well in benchmarks (benchmarks tend to have very hot icaches). Note that this is less important for the out-of-line code in another segment that doesn't get brought into the icache at all for the non-contention case, but that should still be taken _somewhat_ into account if only because of kernel size issues. Both of the above rules implies that the generic spin-lock implementation should be out-of-line. > (1) asm-i386/rwsem-spin.h is wrong, and can probably be replaced with the > generic spinlock implementation without inconveniencing people much. > (though someone has commented that they'd want this to be inline as > cycles are precious on the slow 80386). Icache is also precious on the 386, which has no L2 in 99% of all cases. Make it out-of-line. > (2) "fix up linux/rwsem-spinlock.h": do you want the whole generic spinlock > implementation made non-inline then? Yes. People who care about performance _will_ have architecture-specific inlines on architectures where they make sense (ie 99% of them). Linus

Date: Fri, 20 Apr 2001 16:45:32 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: x86 rwsem in 2.4.4pre[234] are still buggy [was Re: rwsem Newsgroups: fa.linux.kernel On Fri, 20 Apr 2001, Andrea Arcangeli wrote: > > While dropping the list_empty check to speed up the fast path I faced the same > complexity of the 2.4.4pre4 lib/rwsem.c and so before reinventing the wheel I > read how the problem was solved in 2.4.4pre4. I would suggest the following: - the generic semaphores should use the lock that already exists in the wait-queue as the semaphore spinlock. - the generic semaphores should _not_ drop the lock. Right now it drops the semaphore lock when it goes into the slow path, only to re-aquire it. This is due to bad interfacing with the generic slow-path routines. I suspect that this lock-drop is why Andrea sees problems with the generic semaphores. The changes to "count" and "sleeper" aren't actually atomic, because we don't hold the lock over them all. And re-using the lock means that we don't need the two levels of spinlocking for adding ourselves to the wait queue. Easily done by just moving the locking _out_ of the wait-queue helper functions, no? - the generic semaphores are entirely out-of-line, and are just declared universally as regular FASTCALL() functions. The fast-path x86 code looks ok to me. The debugging stuff makes it less readable than it should be, I suspect, and is probably not worth it at this stage. The users of rw-semaphores are so well-defined (and so well debugged) that the debugging code only makes the code harder to follow right now. Comments? Andrea? Your patches have looked ok, but I absolutely refuse to see the non-inlined fast-path for reasonable x86 hardware.. Linus

Date: Sat, 21 Apr 2001 10:18:06 -0700 (PDT) From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: x86 rwsem in 2.4.4pre[234] are still buggy [was Re: rwsem Newsgroups: fa.linux.kernel On Sat, 21 Apr 2001, Russell King wrote: > > Erm, spin_lock()? What if up_read or up_write gets called from interrupt > context (is this allowed)? Currently that is not allowed. We allow it for regular semaphores, but not for rw-semaphores. We may some day have to revisit that issue, but I suspect we won't have much reason to. Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [PATCH 1/19] MUTEX: Introduce simple mutex implementation Date: Fri, 16 Dec 2005 21:43:03 UTC Message-ID: <fa.fv9t5bi.g0g5qg@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512161339140.3698@g5.osdl.org> On Fri, 16 Dec 2005, Thomas Gleixner wrote: > On Thu, 2005-12-15 at 21:32 +0100, Geert Uytterhoeven wrote: > > > Why have the "MUTEX" part in there? Shouldn't that just be DECLARE_SEM > > > (oops, I mean DEFINE_SEM). Especially that MUTEX_LOCKED! What is that? > > > How does a MUTEX start off as locked. It can't, since a mutex must > > > always have an owner (which, by the way, helped us in the -rt patch to > > > find our "compat_semaphores"). So who's the owner of a > > > DEFINE_SEM_MUTEX_LOCKED? > > > > No one. It's not really a mutex, but a completion. > > Well, then let us use a completion and not some semantically wrong > workaround It is _not_ wrong to have a semaphore start out in locked state. For example, it makes perfect sense if the data structures that the semaphore needs need initialization. The way you _should_ handle that is to make the semaphore come up as locked, and the data structures in some "don't matter" state, and then the thing that initializes stuff can do so properly and then release the semaphore. Yes, in some cases such a locked semaphore is only used once, and ends up being a "completion", but that doesn't invalidate the fact that this is a perfectly fine way to handle a real issue. Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [PATCH 1/19] MUTEX: Introduce simple mutex implementation Date: Fri, 16 Dec 2005 22:21:08 UTC Message-ID: <fa.g0a54rd.h0o5at@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512161414370.3698@g5.osdl.org> On Fri, 16 Dec 2005, Thomas Gleixner wrote: > > Well, in case of a semaphore it is a semantically correct use case. In > case of of a mutex it is not. I disagree. Think of "initialization" as a user. The system starts out initializing stuff, and as such the mutex should start out being held. It's that simple. It _is_ mutual exclusion, with one user being the early bootup state. Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [PATCH 1/12]: MUTEX: Implement mutexes Date: Sun, 18 Dec 2005 18:43:42 UTC Message-ID: <fa.g09n3bh.i067av@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512181027220.4827@g5.osdl.org> On Sun, 18 Dec 2005, Russell King wrote: > > On Sat, Dec 17, 2005 at 10:30:41PM -0800, Linus Torvalds wrote: > > An interrupt can never change the value without changing it back, except > > for the old-fashioned use of "up()" as a completion (which I don't think > > we do any more - we used to do it for IO completion a looong time ago). > > I doubt you can guarantee that statement, or has the kernel source > been audited for this recently? Well, _if_ it's a noticeable performance win, we should just do it. We already know that people don't call "down()" in interrupts (it just wouldn't work), we can instrument "up()" too. It's easy enough to add a "might_sleep()" to the up(). Not strictly true, but conceptually it would make sense to make up/down match in that sense. We'd have to mark the (few) places that do down_trylock() + up() in interrupt context with a special "up_in_interrupt()", but that would be ok even from a documentation standpoint. > However, the real consideration is stability - if a semaphore was > used for a completion and it was merged, would it be found and > fixed? Probably not, because it won't cause any problems on > architectures where semaphores have atomic properties. Actually, the reason we have completions is that using semaphores as completions caused some really subtle problems that had nothing to do with atomicity of the operations themselves, so if you find somebody who uses a semaphore from an interrupt, I think we want to know about it. Completions actually have another - and more important - property than the fact that they have a more logical name for a particular usage. The completion has "don't touch me" guarantees. A thread or interrupt that does an "up()" on a semaphore may still touch the memory that was allocated for the semaphore after the "down()" part has been released. And THAT was the reason for the completions: we allocate them on the stack or in temporary allocations, and the thing that does the "down()" to wait for something to finish will also do the _release_ of the memory. With semaphores, that caused problems, because the side doing the "up()" would thus possibly touch memory that got released from under it. This problem happens only on SMP (since you need to have the downer and the upper running at the same time), but it's totally independent of the other atomicity issues. And almost any semaphore that is used as a completion from an interrupt will have this problem, so yes, if you find somebody doing an "up()" in interrupt context, we'll fix it. It would be good to make the rules clear, that you can never touch a semaphore from irq context without changing it back before you return. Of course, that still leaves the following sequence if (!down_trylock(..)) { ... do something .. up(..); } which is actually used from interrupts too. At least the console layer does that (printk() can happen from interrupts, and we do a down_trylock on the console semaphore. But that one shouldn't mess with the _count_, although it does mean that the wait-queue preparation etc (for when the fast case fails) does still need to be protected against interrupts. But that would be the slow case, so from a performance standpoint, it would still allow the case that really _matters_ to be done with interrupts enabled. > Unless of course sparse can be extended to detect the use of unbalanced > semaphores in interrupt contexts. In theory, yes, but in practice I'd much rather just do the stupid brute force things. > > (Of course, maybe it's not worth it. It might not be a big performance > > issue). > > Balancing the elimination of 4 instructions per semaphore operation, > totalling about 4 to 6 cycles, vs stability I'd go for stability > unless we can prove the above assertion via (eg) sparse. I agree, if arm interrupt disables are fast. For example, on x86 (where this isn't needed, because you can have an "interrupt-safe" decrement by just having it as a single instruction, even if it isn't SMP-safe), disabling and re-enabling interrupts is just one instruction each, but the combination is usually something like 50+ cycles. So if this was an issue on x86, we'd definitely care. But if you don't think it's a big issue on ARM, it just doesn't matter. Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [PATCH 1/12]: MUTEX: Implement mutexes Date: Sun, 18 Dec 2005 19:55:47 UTC Message-ID: <fa.fva543d.j0g6ir@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512181153080.4827@g5.osdl.org> On Sun, 18 Dec 2005, James Bottomley wrote: > > Actually, I don't think you want might_sleep(): there are a few cases > where we do an up() from under a spinlock, which will spuriously trigger > this. I'd suggest WARN_ON(in_interrupt()) instead. Ahh, good point. Yes. However, if even the arm people aren't all that interested in this, maybe it simply doesn't matter. A lot of other architectures either have "decrement in memory" or can already use ll/sc for it. (of course, on some architectures, ll/sc is really really slow, so they might well prefer using a normal load and store instead). Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 00/15] Generic Mutex Subsystem Date: Mon, 19 Dec 2005 19:12:41 UTC Message-ID: <fa.g19j43e.h026iq@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512191053400.4827@g5.osdl.org> On Mon, 19 Dec 2005, Ingo Molnar wrote: > > in fact, generic mutexes are _more_ fair than struct semaphore in their > wait logic. In the stock semaphore implementation, when a waiter is > woken up, it will retry the lock, and if it fails, it goes back to the > _tail_ of the queue again - waiting one full cycle again. Ingo, I don't think that is true. It shouldn't be true, at least. The whole point with the "sleeper" count was to not have that happen. Of course, bugs happen, so I won't guarantee that's actually true, but .. If you are woken up as a waiter on a semaphore, you shouldn't fail to get it. You will be woken first, and nobody else will get at it, because the count has been kept negative or zero even by the waiters, so that a fast-path user shouldn't be able to get the lock without going through the slow path and adding itself (last) to the list. But hey, somebody should test it with <n> kernel threads that just do down()/up() and some make-believe work in between to make sure there really is contention, and count how many times they actually get the semaphore. That code has been changed so many times that it may not work the way it is advertized ;) [ Oh. I'm looking at the semaphore code, and I realize that we have a "wake_up(&sem->wait)" in the __down() path because we had some race long ago that we fixed by band-aiding over it. Which means that we wake up sleepers that shouldn't be woken up. THAT may well be part of the performance problem.. The semaphores are really meant to wake up just one at a time, but because of that race hack they'll wake up _two_ at a time - once by up(), once by down(). That also destroys the fairness. Does anybody remember why it's that way? ] Ho humm.. That's interesting. Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 00/15] Generic Mutex Subsystem Date: Mon, 19 Dec 2005 19:57:13 UTC Message-ID: <fa.g1a13rj.h0c6qh@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512191148460.4827@g5.osdl.org> On Mon, 19 Dec 2005, Benjamin LaHaise wrote: > > The only thing I can see as an improvement that a mutex can offer over > the current semaphore implementation is if we can perform the same > optimization that spinlocks perform in the unlock operation: don't use > a locked, serialising instruction in the up() codepath. That might be > a bit tricky to implement, but it's definately a win on the P4 where the > cost of serialisation can be quite high. Good point. However, it really _is_ hard, because we also need to know if the mutex was under contention. A spinlock doesn't care, so we can just overwrite the lock value. A mutex would always care, in order to know whether it needs to do the slow wakeup path. So I suspect you can't avoid serializing the unlock path for a mutex. The issue of "was there contention while I held it" fundamentally _is_ a serializing question. > > [ Oh. I'm looking at the semaphore code, and I realize that we have a > > "wake_up(&sem->wait)" in the __down() path because we had some race long > > ago that we fixed by band-aiding over it. Which means that we wake up > > sleepers that shouldn't be woken up. THAT may well be part of the > > performance problem.. The semaphores are really meant to wake up just > > one at a time, but because of that race hack they'll wake up _two_ at a > > time - once by up(), once by down(). > > > > That also destroys the fairness. Does anybody remember why it's that > > way? ] > > History? Oh, absolutely, I already checked the old BK history too, and that extra wake_up() has been there at least since before we even started using BK. So it's very much historical, I'm just wondering if somebody remembers far enough back that we'd know. I don't see why it's needed (since we re-try the "atomic_add_negative()" inside the semaphore wait lock, and any up() that saw contention should have always been guaranteed to do a wakeup that should fill the race in between that atomic_add_negative() and the thing going to sleep). It may be that it is _purely_ historical, and simply isn't needed. That would be funny/sad, in the sense that we've had it there for years and years ;) Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 00/15] Generic Mutex Subsystem Date: Mon, 19 Dec 2005 20:14:24 UTC Message-ID: <fa.fvpr2re.jg27qq@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512191203120.4827@g5.osdl.org> On Mon, 19 Dec 2005, Ingo Molnar wrote: > > average cost per op: 206.59 usecs > average cost per op: 512.13 usecs (mutex vs semaphore). That looks suspiciously like exactly double the cost, so I do believe that the double wake_up() might be exactly what is going on. However: > hm, removing that wakeup quickly causes hung test-tasks. So clearly it really is still hiding some bug. > and even considering that the current semaphore implementation may have > a fairness bug, i cannot imagine that making it more fair would also > speed it up. That's not the point. The extra wakeup() in the semaphore code wakes up two processes for every single up(), so the semaphores end up not just being unfair, they also end up doing twice the work (because it will result in the other processes effectively just doing the down() twice). > I personally find the semaphore implementation clever but too complex, > maybe that's a reason why such bugs might be hiding there. (possibly > for many years already ...) Oh, absolutely. It is too complex. And don't get me wrong: if it's easier to just ignore the performance bug, and introduce a new "struct mutex" that just doesn't have it, I'm all for it. However, if so, I do NOT want to do the unnecessary renaming. "struct semaphore" should stay as "struct semaphore", and we should not affect old code in the _least_. Then code can switch to "struct mutex" if people want to. And if one reason for it ends up being that the code avoids a performance bug in the process, all the better ;) IOW, I really think this should be a series of small patches that don't touch old users of "struct semaphore" at all. None of this "semaphore" to "arch_semaphore" stuff, and the new "struct mutex" would not re-use _any_ of the names that the old "struct semaphore" uses. Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch Date: Tue, 20 Dec 2005 20:13:01 UTC Message-ID: <fa.g09n334.i0672i@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512201202200.4827@g5.osdl.org> On Tue, 20 Dec 2005, Russell King wrote: > > Also, Nico has an alternative idea for mutexes which does not > involve decrementing or incrementing - it's an atomic swap. > That works out at about the same cycle count on non-Intel ARM > CPUs as the present semaphore path. I'm willing to bet that > it will be faster than the present semaphore path on Intel ARM > CPUs. Don't be so sure, especially not in the future. An atomic "swap" operation is, from a CPU design standpoint, fundamentally more expensive that a "load + store". Now, most ARM architectures don't notice this, because they are all in-order, and not SMP-aware anyway. No suble memory ordering, no nothing. Which is the only case when "swap" basically becomes a cheap "load+store". What I'm trying to say is that a plain "load + store" is almost always going to be the best option in the long run. It's also almost certainly always the best option for UP + non-preempt, for both present and future CPU's. The reason is simply that a microarchitecture will _always_ be optimized for that case, since it's pretty much by definition the common situation. Is preemption even the common case on ARM? I'd assume not. Why are people so interested in the preemption case? IOW, why don't you just do ldr lr,[%0] subs lr, lr, %1 str lr,[%0] blmi failure as the _base_ timings, since that should be the common case. That's the drop-dead fastest UP case. There's an additional advantage of the regular load/store case: if some CPU has scheduling issues, you can actually write this out as C code (with an extra empty ASM to make sure that the compiler doesn't move anything out of the critical region). Again, that probably doesn't matter on most ARM chips, but in the general case it sure does matter. (Btw, inlining _any_ of these except perhaps the above trivial case, is probably wrong. None of the ARM chips tend to have caches all that big, I bet). Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch Date: Tue, 20 Dec 2005 22:07:04 UTC Message-ID: <fa.g09r4b6.i066ak@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512201354210.4827@g5.osdl.org> On Tue, 20 Dec 2005, Nicolas Pitre wrote: > > I mean...... what is it with mutexes that you dislike to the point of > bending backward that far, and even after seeing the numbers, with such > a semaphore implementation that _I_ even wouldn't trust people to use > correctly? Quite frankly, what has disgusted me about this mutex discussion is the totally specious arguments for the new mutexes that just rubs me entirely the wrong way. If it had _started_ with a mutex implementation that was faster, simpler, and didn't rename the old and working semaphores, I'd have been perfectly fine with it. As it is, the discussion has been pretty much everything but that. And then people who argue about single cycles, end up dismissing the single cycles when I argue that "ld+st" is faster - like you just did. Be consistent, dammit. If single cycles matter, they matter. If they don't, then the existing code is better, since it's existing and works. You can't have it both ways. In other words: if people didn't mix up issues that had nothing to do with this into it, I'd be happier. I've already said that a mutex that does _not_ replace semaphore (and doesn't mess with naming) is acceptable. We've done that before. But do it RIGHT, dammit. And don't mix existing semaphores into it (for example, completions didn't change any old users). Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 2/8] mutex subsystem, add asm-generic/mutex-[dec|xchg].h Date: Thu, 22 Dec 2005 23:58:36 UTC Message-ID: <fa.hb476cr.k76389@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0512221550290.14098@g5.osdl.org> On Fri, 23 Dec 2005, Ingo Molnar wrote: > > add the two generic mutex fastpath implementations. Now this looks more like it. This is readable code without any #ifdef's in the middle. Now the only #ifdef's seem to be for mutex debugging. Might it be worthwhile to have a generic debugging, that just uses spinlocks and just accept that it's going to be slow, but shared across absolutely all architectures? Then you could have <linux/mutex.h> just doing a single #ifdef CONFIG_MUTEX_DEBUG # include <asm-generic/mutex-dbg.h> #else # include <asm/mutex.h> #endif and have mutex-dbg.h just contain prototypes (no point in inlining them, they're going to be big anyway) and then have a obj$(CONFIG_MUTEX_DEBUG) += mutex-debug.c in the kernel/ subdirectory? That way you could _really_ have a clean separation, with absolutely zero pollution of any architecture mess or debugging #ifdef's in any implementation code. At that point I'd like to switch to mutexes just because the code is cleaner! Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch 08/19] mutex subsystem, core Date: Tue, 03 Jan 2006 15:40:21 UTC Message-ID: <fa.g0a14jd.i0g62p@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0601030736531.3668@g5.osdl.org> On Tue, 3 Jan 2006, Ingo Molnar wrote: > > > > Is this an interrupt deadlock, or do you not allow interrupt contexts > > to even trylock a mutex? > > correct, no irq contexts are allowed. This is also checked for if > CONFIG_DEBUG_MUTEXES is enabled. Note that semaphores are definitely used from interrupt context, and as such you can't replace them with mutexes if you do this. The prime example is the console semaphore. See kernel/printk.c, look for "down_trylock()", and realize that they are all about interrupts. Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [PATCH] Replace completions with semaphores Date: Tue, 15 Apr 2008 17:00:38 UTC Message-ID: <fa.z+KqPzoOtCdF/f8Xq6ejdEs5kZ4@ifi.uio.no> On Tue, 15 Apr 2008, Andi Kleen wrote: > > > - probably add support for completions to do counting > > But that's just a semaphore, isn't it? Exactly. But the point here is: - nobody should use semaphores anyway (use mutexes) - making *more* code use semaphores is wrong - completions have a different _mental_ model IOW, this is not about implementation issues. It's about how you think about the operations. We should _not_ implement completions as semaphores, simply because we want to get *rid* of semaphores some day. So rather than this long and involved patch series that first makes semaphores generic, and then makes them be used as completions, I'd much rather just skip this whole pointless exercise entirely. Why have "generic semaphores" at all, if we want to get rid of them? Linus

From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [PATCH] Replace completions with semaphores Date: Tue, 15 Apr 2008 18:20:47 UTC Message-ID: <fa.6KvRw3P6/PVf+3AnODUXgA0i0AU@ifi.uio.no> On Tue, 15 Apr 2008, Matthew Wilcox wrote: > > > In other words, what makes me not like this is hat we first turn > > semaphores into the generic code (which is largely what completions were: > > just a special case of the generic semaphores!) and then turns completions > > into these things. That just doesn't make any sense to me! > > Blame me for not realising that completions were semaphores under a > different name. The origin of completions is literally the semaphore code - just simplified to use spinlocks and be usable as just a mutex. We used to use semaphores, and because of the subtle race with lockless semaphores I wrote that stupid completion code as a "generic semaphore with a very specific usage scenario" and called them "completions". The completions _could_ have been extended/used as mutex semaphores, but the difference was really the mental model for them. That then limited the implementation of them: the functions working on completions are defined on purpose to be limited - it doesn't really have "up()" and "down()" functions: "complete()" is really a up(), but "wait_for_completion()" is more like a "wait_until_I_could_do_a_trydown()" function. Would it make sense to use completions for countable events too? Yeah. In fact, we have some things that really would like to do counting, both in the sense of "wait for <n> events to all complete" _and_ in the sense of "allow up to <n> events to be outstanding". Both of which could be done as a counting function (just make "complete" increment the counter, and then make "wait for <n> events" initialize it to negative, while "allow <n> outstanding events" would be a positive counter, and make "wait_for_completion()" basically be a "decrement and wait until it is zero". IOW, completions() really follow the same patterns as semaphores, and it *does* make sense to just have one single code-base. But if we want to make semaphores go away, I think that it would be better to implement semaphores in terms of "extended completions" rather than the other way around. That way, we could one day really get rid of semaphores entirely. Linus