Didn't you just tell us to use a bigger machine?

Why, yes, I did.

There are two conflicting forces at play. Each individual node needs to be powerful enough to run cluster-supporting workloads, as well as reasonable amount of your own. However simply making few big nodes means that whenever a node fails, your available capacity drop much more - and that means that the knock-on effect on the rest of the cluster is much bigger. Even when you plan your capacity to account for this and use pod antiAffinities and multiple replicas, even if you run in cloud with autoscaler setting a new node to take over immediately - the bigger percentage a single node represents, the worse the chaos after failure.

Sounds like simple, obvious thing, yet I have to admit I have done that in the past.

A good rule of thumb is that you should try to have nodes sized so that losing one node doesn't take a significant portion of your capacity offline. Otherwise you're risking that possibly failing pods won't result in not enough capacity, as system loses slack necessary to start new resources in replacement of failing ones.

Nature of kubernetes as a much more dynamic system than rigid approaches like pacemaker and manually allocated applications, while making it easier to use every last bit of capacity you have, also makes it a bit more chaotic (good time to get that Chaos Engineering practices in play!).