Nowadays, servers need to receive and accept connections from tens of thousands of clients. With micro-services architecture, big data and its real time computations gaining popularity, we are looking at millions of connections. Traditional ways of developing these servers do not suffice for this kind of demand.

With the libraries (libevent, boost::asio) available in the market, developers tend to ignore the details. However, we can learn from the work done by others, and that is what we attempt to do here. The initial interface of system calls for network programming did not allow servers to scale! The interface was later evolved to support the modern day needs of millions of clients.

Socket programming

Socket programming is initially intimidating to quite a few. To overcome this, let us first focus on understanding it. As you get a hang of what it is and how it works, you can explore the links below for finer details. The links point to the documentation for each of the system call that is explained below. This post provides the basic understanding that is necessary in order to understand the system call documentation.

In an earlier post Transition to Linux / Unix, I had mentioned that there is hardly any difference in using either Linux or Unix. Whenever developers use system calls, they see the difference. We will focus on Linux here, but the essence remains the same.

socket system call:

To be able to receive connections, we need to create a socket by calling the socket system call. This socket is an endpoint to which our clients connect. We can get it to work for different kinds of needs. For example:

A socket can be an endpoint for sending a live video of a football match.

or

It can be used to receive requests and send responses for a service in our application.

While streaming a live football match video, losing a few packets of information or jumbling up a few does not matter. There will be a very short ‘glitch’ experienced by the viewer.

However, when you send a response to a request sent to your application, you expect a guarantee that the receiver has received the response. Also, that response should be complete and meaningful, not jumbled up.

Observe that both these sockets are different in nature.

We are specifically interested in a TCP server. Use the socket system call as follows:

work for internet protocols (the domain argument, AF_INET) work reliably for two-way communication for request and response (the type argument, SOCK_STREAM) work on TCP protocol (the protocol argument, can be 0 as well) file descriptor of the created socket (the response) For now, just consider the returned file descriptor as an ID for the created socket.

Later you can find out why it is called a ‘file descriptor’.

bind system call:

For the client to connect to your server, it needs an address. Hence, you need an address for your tcp server. If someone else has already acquired this address, you will not get it. Use the bind system call here:

Identify the socket which you want to bind. (the sockfd argument) The address you want it to bind to. (the second and the third argument)

This step is called “assigning a name to a socket”.

listen system call:

Now that your socket has an address, you can start listening to incoming connection requests from your clients. A lot of clients could be queuing up with their connection requests. We can specify the max length of this queue, while making the listen system call.

Identify your socket for which you want to start listening (the sockfd argument) max backlog (the backlog argument)

accept system call:

If your server does not accept an incoming request, the queue mentioned in the above will start building up. When you call accept, the first connection request from the queue is accepted, and we get the clients address info and another socket. Let’s call the returned socket as the ‘accepted socket’. It is used only for receiving and sending messages to the connected client. The ‘original socket’ keeps listening for more incoming connections.

If there is no connection request coming in, your server will keep waiting.

This waiting is called blocking. To understand how to write a non-blocking TCP server, we need to understand all the blocking calls!

Use the accept system call as follows:

Identify the ‘original socket’ from which we want to accept incoming requests. (the sockfd argument) Collect the address info of the client (the second and the third arguments) File descriptor for the accepted socket (the response)

read system call:

We are aware that the net is slow. The data to be read from the net might be in transit. During this time the read system call will block, i.e. wait for the in-transit data. The system call read is invoked with:

The accepted socket (the fd argument) The memory location where we want data to be collected (the second and the third arguments) The number of bytes read (the response)

While the read call is waiting for data, there could be incoming connection requests from other clients. These incoming requests will not get accepted, and the ‘listen queue’ (mentioned earlier) will start building up. To avoid this problem, the primitive style was to fork a new process after the server accepted a connection request. This new process would have the file descriptor of the accepted socket. Using the file descriptor, the process could perform reads from the client and respond with writes. In such cases, the number of forked processes would be equal to the number of connected clients!

If there are tens of thousands of clients, the server will not scale. One might incorrectly argue that creating a new thread instead of forking a new process is better. This is not true for blocking reads. Later in the post, we will see how to use threads effectively when we unblock the reads.

In Linux, preferring threads over processes does not yield any benefit at the kernel level because the scheduler does not treat threads and processes differently. The details will get covered in a later post.

write system call:

After processing the client’s request, the server has to send the response. For this, use the write system call as follows:

The accepted socket’s file descriptor (the fd argument) The data that needs to be sent (the second and third arguments)

Surprise, surprise! Even the write is a blocking call. Remember, we expect guarantee of delivery of our message. The network layer therefore waits for acknowledgement from the client. Quite often, the network layer perceives the data to be sent as very large. For this, the data is split into small ‘chunks’ and placed in a ‘send buffer’ by the network layer. When all the chunks are received and acknowledged by the client, the write call proceeds further. Until then it remains blocked.

The TCP protocol is very elaborate, and hence not being discussed here. For now, let us assume that it expects acknowledgements from the peer and ensures that the peer receives the intended data in the right order.

Non-blocking sockets

By default all the sockets created are blocking in nature. If you call accept or read or write system calls for the sockets, it would block (wait). We can flag these sockets as non-blocking by using the system call fcntl. fcntl does various kinds of operations on file descriptors. Use the fcntl system call as follows:

File descriptor of the socket to be marked as non-blocking (the fd argument) To set a flag (the cmd argument, use F_SETFL command) The value of the flag we want to set (O_NONBLOCK)

Multiplexing I/O

Here’s the basic idea for writing non-blocking servers:

When a request message is sent by the client, it arrives over the net as packets of data. The packets received so far may not be the complete message. Instead of waiting for the remaining packets, the same thread can go ahead and read data or parts of messages from other clients. This is called multiplexing.

Similarly, even the write thread which waits for acknowledgements from the peer can also be multiplexed. Instead of merely waiting for acknowledgements for the whole data, it can serve other clients.

By setting the O_NONBLOCK flag on a file descriptor, we instruct the system calls to return with an indication that it is yet to complete its work, instead of just waiting. In this way, it can continue to serve other clients, and finish the remaining task whenever the socket is ready to be served again.

Knowing which clients are ready to be served

The key to multiplexing I/O is to call read or write on those file descriptors which are ready to be served. The kernel provides two ways of choosing such file descriptors: select system call and epoll_wait system call. epoll is far superior compared to select.

In the historic style of programming, the select system call is preferred over the primitive fork. In this style, we create fd_sets corresponding to file descriptors from which we want to read or write. The select system call takes in these fd_sets and marks the file descriptors which are ready to be served. The learning: The fd_sets are modified in-place by the system call. The caller of select has to iterate through the ‘list’ of file descriptors to find out which of them is ready. This is an O(N) operation, and it is also limited by the max size of fd_sets supported. However, once an interface is exposed, it cannot be changed. It was created this way because the creators wanted to have a ‘stateless’ implementation of select.

epoll: A better way to know which file descriptors are ready

The epoll system is an event notification facility. It monitors file descriptors for various kinds of events. Whenever a file descriptor is ready to be served, it is identified as an event by the epoll system. The application using epoll system gets notified about such events through the epoll_wait system call.

epoll_create system call:

To use the epoll facility, an epoll instance must be created using epoll_create system call as follows:

number of file descriptors we want to monitor as a ‘hint’ (the size argument)

The hint is provided to enable the kernel to allocate enough memory upfront. This increases the efficiency of the subsequent epoll_ctl system call. This need not be accurate.

epoll_ctl system call:

File descriptors which need to be monitored for readiness, can be added or removed from the epoll instance using this system call. The nature of monitoring that is needed, like read readiness or write readiness, needs to be set using this system call. Use the epoll_ctl system call as follows:

the epoll instance that we created earlier (the epfd argument) add / remove / modify (the op argument) the file descriptor for which want monitoring (the fd argument) the kind of events we are interested in, along with user data. (the epoll_event argument)

epoll_wait system call:

The epoll_wait system call identifies any file descriptors that may be ready for I/O. If none of the file descriptors are ready, it will timeout in a few milliseconds and epoll_wait can be called again within a loop. If any file descriptor is ready, the epoll_wait will respond with an event for that file descriptor. In case of a read event, the read system call can be invoked. The epoll_wait system call is used as follows:

the epoll instance (the epfd argument) the events and corresponding file descriptors which have a read / write ready events (the second and third arguments) timeout in milliseconds (the timeout argument)

The structure of a non-blocking tcp server

The system calls above are only explained superficially so that grasping the details will become easier. All the details are covered in the links in this post.

The non-blocking tcp server can be implemented as follows:

Create a socket Mark it as non-blocking (this will make even the accept call non-blocking) Bind it to an address Listen on the socket Create an epoll instance Add your socket to the epoll instance (this way the incoming requests can be monitored through event notification) Create a read event queue Create threads for processing tasks from read queue Create a write event queue Create threads for processing tasks in the write queue Wait for events on epoll instance in a loop For incoming requests events call accept mark the accepted socket as non-blocking add it to the epoll instance for monitoring For read events, push the file descriptor and user data to read event queue For write events, push the file descriptor and user data to write event queue For close events, remove the file descriptor from the epoll instance

The way forward

We have covered the basics with an intension of demystifying the non-blocking and highly scalable tcp server. This certainly is not enough to start coding. However, that never was the intension of this blog. To get going with coding, study the links provided in detail.

Hope it was enjoyable and simple enough. Simplifying it was not as simple though! 😉