Defeat your 99th percentile with speculative task

Recently when running performance tests we have spotted a worrying phenomenon. The response times were quite satisfying at first glance. On average we got < 20ms which was meeting contract requirements. Unfortunately average is not always a good measure and in our case it has become a perfect manifestation of the flow.

Percentile [%] Value [ms] 50 15 90 28 95 30 99 118 99.9 350

The percentile values we observed had quite unusual distribution. Small percent of requests was executed in acceptable long times causing contract violations and timeouts. With help of NewRelic were able to quickly pinpoint the culprit - it was Redis. Surprisingly the long response times were occurring in isolation (responses following the long ones were normal) and there was no pattern.

Btw. our infrastructure sits on AWS. After spending some time tweaking Redis configuration we realised this was not going to be sufficient. We also got some feedback from other teams saying that they fall into similar problems and failed with taming that database.

I can imagine that in some cases one wouldn’t care about the last percentiles. Sometimes you are constrained by SLAs related to error rate ratio and long response times are signalled as errors.

Aggressive timeout?

The first idea we had was to define an aggressive timeout accompanied by retries for Redis operations. This would probably work in the most of situations. It has some limitations though - you need to squeeze the operation time along with the timeout within the time limit which was 50ms. This means that timeout would need to be set at half of this time - 25ms. In consequence given the percentile distribution we knew that significant percent of correct response times would fall into aggressive timeout trap.

Let’s give an example.

Service response time limit is 50ms To give chance for the second try the timeout on Redis is set to half of this time - 25ms. The first call to Redis takes 30ms failing the timeout criteria. Therefore after 25ms the second call to Redis starts.

The second call has chances to succeed but overall aftermath is not promising.

What then?

The inspiration came from BigData world. In Spark when task execution runs suspiciously long the application master starts the same task speculatively on a different executor but it lets the long running tasks to continue. The solution looked elegant:

Service response time limit is 50ms. If the first attempt doesn’t finish within 25ms start a new one, but keep the first thread running. Wait for either thread to finish and take result from the first one ready.

We named the pattern speculative task.

The selected Monix as a good candidate to help with the implementation. It appeared to be the right choice. What Monix has to offer is not only execution scheduling but also comes with time injection solution being invaluable tool for testing.

The code

import java.util.concurrent.TimeoutException import com.typesafe.scalalogging.StrictLogging import monix.eval.Task import monix.execution.atomic. { Atomic , AtomicAny } import monix.execution. { Cancelable , Scheduler } import scala.concurrent.duration.DurationLong import scala.concurrent. { Future , Promise } import scala.language.postfixOps import scala.util. { Failure , Success , Try } object SpeculativeCall { def speculatively [ R ]( attempts : Int , patience : Long , timeout : Long ) ( action : () => Future [ R ]) ( implicit scheduler : Scheduler ) : Future [ R ] = { val promise = Promise . apply [ R ]() val lastFailure = AtomicAny [ Option [ Throwable ]]( None ) val failedCount = Atomic ( 0 ) def complete ( result : Try [ R ]) : Unit = { promise . tryComplete ( result ) () } val scheduledTasks = ( for { attempt <- 1 until attempts } yield { val delay = attempt * patience attempt -> Task . deferFuture ( action ()). delayExecution ( delay . millis ) }). toList val firstTask = Task . deferFuture ( action ()) val tasks = ( 0 , firstTask ) :: scheduledTasks val cancelables = tasks . map { case ( attempt , task ) => task . runOnComplete { case s : Success [ R ] => complete ( s ) case Failure ( ex : TimeoutException ) => if ( failedCount . incrementAndGet ( 1 ) == attempts ) { complete ( Failure ( lastFailure . get . getOrElse ( ex ))) } case f @ Failure ( ex ) => lastFailure . set ( Some ( ex )) if ( failedCount . incrementAndGet ( 1 ) == attempts ) { complete ( f ) } } } Task { if (! promise . isCompleted ) { complete ( Failure ( lastFailure . get . getOrElse ( new TimeoutException ( s "All attempts timed-out" )))) } }. delayExecution ( timeout millis ). runAsync val result = promise . future result . onComplete { _ => Cancelable . cancelAll ( cancelables ) } result } }

Monix has quite clever solution for time injection to scheduler. That allows writing reliable and deterministic tests in our case. This is just an example of a test - I leave it to you to add some more.

class RetryTest extends AsyncFlatSpec with Matchers { implicit val scheduler = TestScheduler () def tasksSimulation [ R ]( results : ( Long , Try [ R ])*) : ( ArrayBuffer [ Int ], () => Future [ R ]) = { val runs = ArrayBuffer . empty [ Int ] val attempt = Atomic ( 0 ) ( runs , () => { val attemptNo = attempt . getAndIncrement () val ( delay , result ) = results ( attemptNo ) runs . append ( attemptNo ) Task { result match { case Success ( s ) => s case Failure ( ex ) => throw ex } }. delayExecution ( delay millis ). runAsync }) } it should "succeed even if the first attempt fails" in { val ( runs , simulation ) = tasksSimulation ( 1L -> Failure ( new RuntimeException ), 1L -> Success ( "result2" ), 100L -> Success ( "result3" ) ) val resultF = Retry . speculatively ( 3 , 30L , 300L , "description" )( simulation ) val testResult = resultF map { result => result shouldBe "result2" } scheduler . tick ( 250 millis ) runs should contain only ( 0 , 1 ) testResult } ... }

Conclusion

Speculative task has saved the day in our case. It has reduced the last percentiles. At the same time we haven’t observed increase load to Redis. In reality the speculative call takes place only once per a few hundreds calls.

Percentile [%] Value [ms] 50 15 90 28 95 30 99 35 99.9 65