Repetitive tasks scheduling in the cloud: patterns and anti patterns

Some stuff just needs to happen on a repetitive schedule. A daily calculation, an hourly cleanup, an import job every minute and a monthly summary. This is an old problem in computing, but not trivial. Surprisingly I see lots of anti-patterns in various implementations I encounter.

A naive approach

The naive approach is just scheduling in your own main process, using some event loop or interval mechanism your programming language supports. This “solution” rarely works as expected. Your scheduling is now using the same memory, cpu and event loop/threads the rest of your app is using. So scheduled tasks might use the cpu available for handling incoming requests and worst, if you are in a cpu intensive process, your naive scheduler wouldn’t trigger on time (or at all). An even more problematic problem would hit you when you scale your process to multiple replicas; and you’ll get yourself multiple invocations of your tasks.

So it’s clear we need to separate scheduling tasks from the regular lifecycle of our application (unless that is the only thing our application does). But this doesn’t really solve the problem of multiple tasks competing for a limited, one process resources, which might also might prevent scheduling the next tasks in line. For anything more than a few short running, not cpu intensive tasks, we’ll need to separate scheduling from the execution of scheduled tasks.

Separation

How can we separate scheduling from processing? There are two common solutions here (And there might be more.)

Starting a scheduled job dedicated process.

How? This can be done with a cloud native orchestrator: K8s jobs, ECS tasks, scheduled cloud functions, etc. Are all good solutions for that. If resources are limited, you might need to reason about the number of concurrent scheduled jobs running;, and limit this concurrency. This solution is scalable, easy to reason about and has the added advantage that tasks are running in a clean environment, without any left over state from previous task. The main con for this solution is startup time. If your container (and usually there is a container involved in this solution) has a long startup time compared to the length of the actual task, this won’t be practical, especially when you need to run lots and lots of scheduled short tasks on a tight schedule. Use standing by workers:

in this pattern the scheduler “tells” an existing worker from a worker pool to run a task. How? A common practice is doing this using a message and not http/grpc, so if there is no available worker the message would wait until one is ready. An alternative solution might be creating a workers orchestrator, taking the requests for scheduled runs and queueing them until a worker is ready (And then yet again, you have to reason about HA of this orchestrator process, distributed state, etc). This solution usually shines for lots of really short tasks, especially where startup time for the process/container needed for running the task is long. For a 1 second job, you don’t want to wait 30 seconds for a dedicated process to start. But we do need to reason about the fleet size of available workers; auto scaling vs the cost of maintaining a standby workers fleet. The distribution of your scheduling over a period matters a lot here. If all tasks run between 12:00 to 12:30 midnight, you’ll have an unused workers fleet for 23.5 hours every day. A responsive and well tuned auto scaling/downscaling solution would come in handy in this scenario .

Common pitfalls

There are some common problems all scheduled tasks solutions need to handle: Task Concurrency, error handling, and limits. Reasoning about them really depends on your tasks and requirements.

Concurrency: Can the same scheduled tasks run concurrently with a task of the same kind? Let’s assume you have a long running calculation of metric A every 10 minutes, what happens if another task of the same kind started (because of time, or other problems) while the first one is still running? Would the answer be the same if the tasks involves “take all not sent notifications and send them” ? Error handling: How should we handle errors in our task? Rerun? how many times? Ignore? Stop further execution of the next instances of this scheduled tasks? Limits: besides a backoff limit for retries, is there another limit relevant here? If a task is still running after a day is that an expected behavior or should it be killed? How should tasks handle termination because of deploy?

A robust scheduling solution might help with relevant configuration options, allowing you to handle all or some of those; but you still need to reason on them, and they all come with a price tag, no silver bullet solutions. It usually boils down to two alternative semantics:

At most once — Tasks that the cost of running more than once or not on time is higher than the cost of not running at all, A non idempotent task sending a push message to your clients might be a good example. If you go this path, be sure to have decent monitoring so you’ll know which of your tasks didn’t run; and you’ll eventually need a “manual” task api allowing system operators to fill missing tasks.

At least once — Ensure the task will run at least once, and prepare for the option it might run more than once, perhaps concurrently. Task idempotency is a basic requirement here.

When did you say?

Another pitfall waiting here for our schedules are time zones. A best practice for scheduling is by UTC, and lots of schedulers just assume UTC and don’t allow a configuration around that. But if you need your schedules in some other time zones (like client time) then your schedules should be adapted accordingly, and you would probably need some automation to handle daylight saving time changes (unless your schedules does support time zones). A common solution for this problem is not attempting to schedule to a specific hour (3:00 PM) but scheduling to an “every hour” pattern and in the scheduled task check if this is a good time to run. While common. I believe this usually isn’t the right solution and investing in automating the adaptation of scheduled time to current UTC time is a more “correct” solution.

Accepting scheduling limits

An important mindset when planning our scheduling solution should be accepting the limits of scheduling, The most important of them would be accuracy, It’s very hard to ensure high accuracy with scheduling solutions. Especially when dealing with very short schedules (seconds), The defacto standard of scheduling configuration format: the crontab, doesn’t support a higher scheduling accuracy than a minute. But even if your scheduling solution does support one, it’s very hard to deliver. If lots of items are scheduled in a specific second (00:00 UTC is a *very* common scheduling target for example) the scheduler might not be able to deliver all of them at the exact moment, and even if it did — network latency, software layers, etc. might prevent you from processing or even accepting the task in the exact second. Accepting this limit would make your design more robust. A related issue is with spreading your tasks. We tend to schedule daily tasks when main usage is down. But besides figuring *when* this is actually the prefered run schedule when clients are across multiple time zones. This usually leads to trying to run a task on the exact End of day, So we end with lots of tasks trying to run exactly then, and it turns out because of the enormous spike, some of them fail to run on schedule. It’s probably better if a daily task does not depend on the time of running. The Google SRE team even extended the supported crontab format to add the ability to “spread” accepted time ranges and allow the system to choose an optimal timing, in order to mitigate the “Spike” affect (with limited success it seems).

High availability scheduling

Now this is a tough problem, and you probably can’t solve this by yourself.

Unless you completely don’t care about double scheduling (which is a very rare case of scheduling). You need a “master” publishing schedule, and a “secondary” (or multiple number of them) ready to replace it, without republishing the same tasks. So now you need some sort of distributed state and working failure detection over network (a hard problem on it’s own). Unless you want to dive into implementing a highly available distributed system with multi commit paxos implementation (like the “chubby” team in Google did, documented their tough journey with implementing multi commit paxos in this context as a warning for all of us) You probably should use some cloud provider based scheduler. Perhaps triggering an event triggering a task on your workers, or starting a container Each makes sense in a different scenario. In lots of scenarios, you don’t really need Highly available scheduling, or are not willing to pay the complexity cost. And you’d be better off with monitoring the Not highly available scheduler.