Alright, moving my comment to an answer. Do note that there are multiple views of the IO monad and these ideas. You can achieve similar results with uniqueness types and whatnot.

I just happen to find this the simplest and coolest explanation.

Faking Impurity

Haskell is a purely functional language. This should mean that evaluating a Haskell program should always produce the same results. However, that doesn't seem like the case! Look at something like

-- Echo.hs main :: IO () main = getLine >>= putStrLn

This seems to do something different depending on user input.

Really anything that lives in IO looks like it can do wildly different things depending on everything from the state of the moon to the life insurance costs of Schrodinger's cat.

More over, it seems like any language that can do anything useful must be impure. Unless your interested in watching your CPU spin, producing side effects is all that programs exist for after all!

Evil Interpreters

In fact this isn't actually the case! The appropriate mental model is to imagine IO as something like

data IO a = PutStrLn String a | GetLine (String -> a) ...

So IO could just be a data structure representing a sort of "plan" for the program to execute. Then the evil impure Haskell runtime actually executes this plan, producing the results you see.

This isn't just a minor semantic quibble though, we can do something like

runBackwards :: [IO ()] -> IO () runBackwards = foldr (>>) (return ()) . reverse

In other words we can manipulate our "plans" as normal, first class values.

We can evaluate them, force them, drop a ton of bricks on them, even say mean things about them behind their backs and they'll never produce a side effect! They can't you see, normal Haskell code can only build up IO actions to be evaluated by the run time, it's incapable of doing anything noticeable.

In a way you can almost view a Haskell program as the ultimate form of metaprogramming, producing programs on the fly during runtime and having them evaluated by some interpreter.

So when you say

foo = delay 20

You're not saying "Delay this program for 20 whatevers", you're saying "In the program that this code builds, pause its execution for 20 whatevers when it runs".

Who Cares

It's fair to ask "Who Cares": if this code gets run at some point who cares who runs it? What good does it do to be purely functional in this way? It can actually have some interesting effects (heh).

For example, think of something like http://www.tryhaskell.org, clearly it needs to run Haskell code, but it also can't just blindly execute whatever IO it gets! What it can do is provide a different implementation of IO , while exposing an identical API.

This new IO builds up a nice tree like datastructure which can be easily sanitized and checked by the web backend to ensure that it never runs something evil. We can even compile our fake- IO structure to the normal one that GHC provides and execute it efficiently on the server! Since there's never anything evil in their to begin with we only have to trust code we wrote.

No more endless applet-style security holes. By replacing IO we know beyond a shadow of a doubt that we can execute this code and it will never attempt to do something evil.

Evil Interpreters Everywhere

In fact, this notion of building up data structures is useful for more than just IO . It's a great way to structure any project that aims to provide a limited DSL. Anything from

A query language

Game scripting

Code generation

Writing "client side haskell"

All of these can be solved by building up normal datastructures and "compiling" them to the appropriate language. The usual trick for this is to use a free monad. If you're an intermediate Haskeller, go learn about em!