Preamble

SObjectizer is one of very few live and still evolving OpenSource frameworks for C++ which implement actor model. We, the development team, are working not only on SObjectizer but also are actively using it in application development. In this article we want to tell about some lessons we learnt during almost 13 years of SObjectizer usage.

A very short introduction

SObjectizer was created in 2002. It has been used in production since the very first version.

The lessons described here were learnt during usage of SObjectizer in several applications in finance and telecom: a payment gateway, a mobile banking platform and a SMS/USSD aggregation and distribution system to name a few.

Those are not examples of large-scale or extremely loaded applications. There are no record-breaking performance results, just few millions of requests per day and one or two hundreds of millions of requests per month. Ordinary systems which were developed by very small teams in very tight schedules.

Some of those systems are in use and in maintenance for more than 10 years. So there is some experience of usage of actor model in real-world C++. We hope that this experience will be interesting and helpful.

Some words about terminology

We will discuss SObjectizer-related things. The term agent is used in SObjectizer. SObjectizer's agent is very similar to Akka's actor or to lightweight process in Erlang. Because of that speaking about SObjectizer's agents is like speaking about Akka's actors or Erlang processes.

Too many agents could be a problem, not a solution.

Almost every framework with some implementation of actor model declares a possibility to create millions of actors. SObjectizer is not an exception. There could be millions of agents. Their number is physically limited only by amount of RAM.

This possibility creates a dangerous delusion. Someone may think that representation of work units as agents is a good idea and a rather good solution for the problem of concurrent processing of thousands of independent requests.

It seems very easy: just create a separate agent for every request and everything will work fine.

Unfortunately, it is not as simple as it seems to be. Because there is such thing as spontaneous activity peaks.

For example, an agent initiates a request to another system and controls a timeout for a response. If there is no any response after that timeout then some reaction must be performed: a new request must be initiated, a negative response must be generated, some state must be updated in persistent storage and so on...

It looks simple: your agent sends a message to another agent and then sends a delayed timeout signal to itself. If a timeout signal arrives before the expected response then the agent performs some appropriate reaction. It is obvious.

But it doesn't scale. If you have tens of thousands of agents then you can find yourself in the situation where your agents process only timeout signals and nothing more. Let’s consider the following example.

10K agents send requests to a distant system at timepoint T. Then next 10K agents send their requests to the same distant system at timepoint (T+1s). Then next 10K agents send their requests to the same distant system at timepoint (T+2s). And so on.

No one of them got the response within timeout because the distant system is overloaded or is restarting right now or your network is dead or something else...

10K timeout signals occur at timepoint (T+timeout). You should handle these 10K timeout signals before the next 10K timeout signals occur at timepoint (T+1s+timeout). You should handle all of them before the next 10K timeout signals occur at timepoint (T+2s+timeout). And so on.

It is good if you can handle 10K timeout signals per second. But what if you can't? Your event queues will grow. Your performance will drop down. Some messages will be lost. On top of that, more timeout signals could be generated and so on...

We made that mistake in the past. And this is one of lessons we have learnt: it is not a good idea to have too many agents if they use timers. At some point you receive more timer events than you can handle before the arrival of the next portion of timer events.

This lesson can be expanded to other cases. For example, if you have 100K agents which need to do independent updates to RBDMS you could find yourself in the situation when almost all of your agents are trying to do an update at the same time and your database has not enough performance to survive...

What we can do with that?

We have found that instead of spawning of many agents where each of them handles all stages of request processing, it is better to have a few agents where each of them represents just one particular stage of requests processing.

This approach is also known as SEDA-way. There was SEDA framework for Java and the cornerstone of that framework was separation of request processing to several stages with implementation of every stage by separate handler.

This way was successfully reused in SObjectizer-based applications. Stages of request processing were implemented as independent agents. Interaction between agents was implemented via asynchronous message passing.

As a result there were a few heavy agents in an application: usually several dozens agents, sometimes hundreds, but not thousands. Every agent can effectively control its own activity peaks without a risk of application performance degradation or total loss of responsiveness of the application.

Overload control for agents -- it is a must have feature.

An actor framework can declare a very impressive performance numbers. Something like several millions of messages per second. Or even tens of millions. Or even more :)

However, the performance of underlying actor framework and the performance of a particular domain-specific agent are different things. An actor framework can deliver messages to an agent with several millions per second rate. But due to its domain-specific logic the agent can handle only several hundreds of domain-specific messages per seconds.

Even worse the performance of some particular agent can depend on several factors and can vary from hundreds of messages per seconds to tens or even to one or two messages per second.

It means that agents can be overloaded. A sender can send more messages to a receiver than receiver can handle.

If an agent has no overload defense the overload can lead to growth of agent's message queue. This growth can lead to agent's performance degradation. This in turn, can lead to further overload and degradation.

We have found that if some overload control mechanism is not implemented inside an application then agent overload and consequent application performance degradation is just a question of time.

What we can do with that?

We think that a good overload control scheme must be application- and domain-specific. There is no a simple solution which can be seen as "one size fits all". Sometimes request-reply mechanism is useful, sometimes it is possible to just drop some messages, sometimes a sender can be delayed for some time, sometimes several messages can be handled differently and so on...

Because of that overload control should be seen as domain-specific mechanism which can use some features of underlying actor framework (like message limits or size-limited message chains from SObjectizer) and some practices which are applicable for a particular domain (like request-reply interaction, message grouping, delays for senders, different processing policies and so on).

It is hard to control an application without monitoring of its internals.

An application built on top of actor model is like a flock of birds: there are a very simple and understandable rules of behaviour for every agent inside the application. But the behaviour of the whole application can be very complex to understand and predict.

Two very simple problems were described above: spontaneous activity peaks and overloading of agents. These problems can change the behaviour of your application significantly. For example, your application showed a request processing time around several milliseconds just a few seconds ago but now processing times become seconds and even minutes. What happened?

If you do not control the internals of your application it is very hard to answer to that question.

May be some external system responds significantly slower now? May be your agents are busy by handling several thousands of timer messages? May be several agents hang in an infinite loop and there is no enough working threads for live agents which can process new requests?...

Without a proper monitoring of an application internals you just don't know what is going on and how you can manage it.

What we can do with that?

We have found that collecting various kind of run-time statistics is very useful in agent-based applications. Especially for applications which should work in 24x7 mode.

In some applications the number of data sources (e.g. providers of run-time information) were above then several hundreds (700-800 usually, sometimes almost of 1000). A few percents of them were used for reactive on-line monitoring for an application. The other were used only occasionally. But there were the situations when their values were necessary. They were very useful and sometimes they pointed right to the center of the application problem.

Note that collecting of run-time statistics has its own price. Sometimes it is not cheap and has some cost in the development and in the run-time. But if your application handles hundreds of payment requests per second it is better to pay for appropriate monitoring than lose money for application downtime.

We have also found that small number of agents in an application simplifies the monitoring of the whole application. And the opposite -- if you have a big number of agents then you need to do some additional actions to aggregate individual monitoring parameters from each of them.

C++ is not Erlang. "Let it crash" works differently.

Erlang could be seen as a safe and managed language. One can write something like this:

devide_process () -> receive { From , x , y } -> From ! x / y ; end .

It is not a big problem to use something like that in an Erlang-application with hundreds of lightweight processes inside. If devide_process receives 0 as 'y' argument it will perform division by zero. This caused aborting of just one lightweight process but not the whole Erlang-application. The crash of devide_process will not affect the other lightweight processes.

But what if you write a similar code in a C++ application and division by zero occurs in one of the working threads? If you do not handle such kind of error by some low-level mechanism the whole application will crash.

It means that an error which can easily be ignored in complex Erlang application can lead to the crash of the whole C++ application.

We have learnt that agents are very similar to lightweight processes in Erlang but not in the case of "let it crash" principle. C++ has too many ways to leave a process in unpredictable state: dangling pointers, repeated memory deallocation, memory corruption, invalid agent's state as a result of unhandled exception and so on.

It leads to a simple consequence: agents in C++ must be developed more carefully than lightweight processes in Erlang or Akka-actors in Scala/Java. "Crashing" of an agent must be implemented as a graceful shutdown of the agent with correct cleanup of resources. Unfortunately, it is not as easy as terminate a lightweight process in Erlang.

What we can do with that?

There is no something really special for agents in comparison with any other approaches for writing robust C++ code. The only thing which should be mentioned is the exception safety for agent's message handlers.

When an agent throws an exception from a message handler the underlying framework should know that the agent implements at least basic exception guarantee: there is no memory corruption or resource leaks.

But the underlying framework should know more: is it safe to allow to handle the next message to that agent? If an agent provides strong exception guarantee the processing of the next messages is allowed. However, if an agent provides only basic guarantee the framework must disable the further processing for that agent and remove that agent from the application.

This is a reason why agents must take additional care about exception safety guarantees and framework must understand the level of exception safety for a particular agent and implement the corresponding reactions in the dependence of that level.

Agents will become fatter and fatter with time.

We have also learnt that code size of agents tends to grow with time.

In the very first version your agent can be very simple and can handle just one message:

class simple_action_performer { ... void on_action ( const some_message & msg ) { if ( some_condition ( msg )) do_action_one ( msg ); else do_action_two ( msg ); } };

Then you have to add some exception handling because message processing becomes more complex:

void on_action ( const some_message & msg ) { try { if ( some_condition ( msg )) do_complex_action_one ( msg ); else do_complex_action_two ( msg ); } catch ( const error_case & ex ) { handle_error_case ( msg , ex ); } }

Then you add some logging:

void on_action ( const some_message & msg ) { try { log_message_processing_start ( msg ); if ( some_condition ( msg )) do_complex_action_one ( msg ); else do_complex_action_two ( msg ); log_message_processing_successful_finish ( msg ); } catch ( const error_case & ex ) { log_error_case ( msg , ex ); handle_error_case ( msg , ex ); } }

Then you add collecting of some run-time monitoring data:

void on_action ( const some_message & msg ) { try { log_message_processing_start ( msg ); update_incoming_msg_count ( msg ); if ( some_condition ( msg )) do_complex_action_one ( msg ); else do_complex_action_two ( msg ); update_processed_msg_count ( msg ); log_message_processing_successful_finish ( msg ); } catch ( const error_case & ex ) { log_error_case ( msg , ex ); update_error_case_count ( msg , ex ); handle_error_case ( msg , ex ); } }

Then you can find that the message processing must be even more complex with more condition checking and more variants of processing actions.

If your application is under maintenance for years you could easily find that a simple agent for about 50 lines of code becomes 500 lines in just two or three years.

Growth of agents size can cause maintenance problems and it is good if your actor framework can cope with that.

What we can do with that?

We have found that the following properties of SObjectizer help us to deal with growing code size and complexity:

Representation of agents as an object of user-defined classes. If agent is an object of some class then the growth of agent size can be handled much easier than if agent is represented as a function with message handling loop inside. In some cases inheritance of agent's classes can significantly helps too. But these are rare cases and inheritance and polymorphism are not as useful as encapsulation. Representation of messages as an object of user-defined classes. In SObjectizer you can't simply send two integer values as a message. You need to define a struct/class with two integer fields and send an instance of that struct/class. It requires more writing but significantly simplifies the maintenance: it is very easy to change number and types of fields inside the message struct/class, there is a compile-time checks from a compiler and so on. Likewise, the places where a message of particular type is sent or handled can be easily found through IDE or tools like Doxygen or even by such simple tools like grep. A possibility to use C++ templates in implementation of agents. The first versions of SObjectizer (aka SO-4) didn't support templates and this led to the code duplication in some cases. This flaw is fixed in the current SObjectizer versions (aka SO-5) and sometimes usage of C++ templates greatly reduces code size and code complexity.

Conclusion

As we can see the actor model becomes more and more popular last years. But there are too much attention to things which have controversial values like ability to create millions of actors in one physical process or incredible high message throughput in synthetic benchmarks. Our experience tells us that there are more important things which must be seriously considered when C++ application is developed on top on actor model. In this short article we have tried to briefly mention some of them.