Web Services using a TCP proxy server (Part 1 of 3)

Last edited 2003-04-13 by Jay Nelson

This tutorial demonstrates a simple, but non-trivial, example of a TCP proxy server using the OTP construct gen_server. The goal is to implement a process that can be used to deliver web services to HTML browsers. The source code is here. This tutorial contains three parts: A generic TCP proxy server based on gen_server How to implement the ProxyModule interface A Web Service using the TCP proxy server Please, report all errors, omissions or improvements to the author. Description of a proxy server The proxy server interface The gen_server behaviour Architecting a server design Implementing gen_server callbacks Implementing the Accept Connection process Implementing the ProxyModule interface ProxyModule explained Testing with an echo proxy server A caching web proxy server Deploying Web Services

Recognizing requests Extracting from web pages Synonym dictionary Pattern matching dictionary Zoning a page Pulling content from a zone Reconstructing web pages 1. Description of a Proxy Server We would like to provide Web Services, such as a page with 30 results from Google or a single page with news from Yahoo! France, England, US and Argentina, however we do not have access to program the server. Instead we provide a proxy server that relays information to the website and reformats the response for display in a web browser. The proxy server can be used for other tasks, but as described, it expects to receive requests over a TCP socket. Because we expect the proxy server to be transparently available at all times, we will use erlang's OTP to manage the proxy server. OTP provides a generic model of server functionality as well as the means to automatically restart the server if it fails, among other capabilities. 1.1 The proxy server interface The proxy server has a very simple interface. It can be started, it can list the clients that are currently attached to it and it can be stopped. Once it has been started, the parent application may send it requests (the clients of the server will be making requests using TCP socket requests rather than function calls). It is implemented using the gen_server module. The proxy server is linked to the calling process so the caller will know when the server goes down. The code to implement the proxy server interface is in Listing 1. Other aspects of the server implementation are described in the following sections. tcp_proxy.erl contains the full source code for the TCP proxy server. Listing 1. tcp_proxy external interface. -module(tcp_proxy). -behaviour(gen_server). -define(MSGPORT, 8000). -define(MAX_CONNECTS, 3). -export([start_link/1, start_link/2, start_link/3, start_link/5, report_clients/1, handle_request/3, stop/1]). start_link(ProxyModule) -> start_link(ProxyModule, ProxyModule, ?MSGPORT, ?MAX_CONNECTS, ?CLIENT_TAB). start_link(ProxyModule, MaxConnects) -> start_link(ProxyModule, ProxyModule, ?MSGPORT, MaxConnects, ?CLIENT_TAB). start_link(ProxyModule, Port, MaxConnects) -> start_link(ProxyModule, ProxyModule, Port, MaxConnects, ?CLIENT_TAB). start_link(ServerName, ProxyModule, Port, MaxConnects, ClientTab) -> case gen_server:start_link({local, ServerName}, ?MODULE, [ProxyModule,Port,MaxConnects,ClientTab], []) of {ok, Pid} -> start_listen(Pid, Port); {error, {already_started, OldPid}} -> {ok, OldPid}; Error -> error_logger:error_report([{start_link, Error}]), Error end. report_clients(Server) > gen_server:call(Server, report_clients). handle_request(Server, Request, Data) -> gen_server:call(Server, {msg, Request, Data}). stop(Server) -> gen_server:call(Server, stop). start_link/4 creates a new process that listens on the specified TCP port or the default port 8000. There is a maximum number of simultaneous connections allowed to prevent the server from being overloaded. ProxyModule is an external module that is used by the application to define what the TCP proxy server is supposed to do. Part 2 of this tutorial will explain how that module works using simple examples and Part 3 will describe a complete example for providing web services. The name of the proxy module is stored internally by the tcp_proxy server until it is needed later. gen_server:start_link/4 takes a name, a module, an init list and a list of system options. It creates a new process linked to the calling application with the name tcp_proxy registered locally as the name of the new process. All callbacks used by gen_server are implemented in the tcp_proxy module; all callbacks used by tcp_proxy are implemented in the ProxyModule module. tcp_proxy:init/1 will be called with the list [ProxyModule, Port, MaxConnects] to initialize the state of the proxy server. After it has initialized, it will start listening for TCP requests as well as application-level requests. None of the debug or system options are used. If the parent application needs a list of attached clients, report_clients/0 is called. It returns the list of clients along with some statistics about them by using the gen_server callback tcp_proxy:handle_call(report_clients, From, State) described below. When an application-level request is received by the proxy server in handle_request/2 , it passes the request to gen_server to generate a callback to tcp_proxy:handle_call({msg, Request}, From, State) which is implemented later. When an application-level command to stop the server is called, tcp_proxy passes it off to gen_server which in turn will callback to tcp_proxy:handle_call(stop, From, State) . So far, we have seen that a separate process is running the gen_server behaviour. Whenever the parent application needs to know anything about the server, or wants to shutdown the server, a function call is used that passes the work off to the gen_server process (via a message send occurring inside the call to gen_server:call/2 ). The reason for the indirection is that gen_server provides the main loop of the server and all the debug and system messages that can be monitored. Any application requests have to be interleaved with the system messages and any debug requests have to be enforced. Also, the server starts up with no state information but initializes a state variable to be used for subsequent requests. gen_server manages the state variable and only passes it to the tcp_proxy when the callbacks described in the next section are invoked. In addition, using callbacks is a way of encapsulating the format and protocol enforced by gen_server to communicate information between the application process and the gen_server process. The function call interface prevents protocol translation errors. 1.2 The gen_server behaviour gen_server is what is called a behaviour. It is a collection of functions that together provide a generic service that can be adapted to a particular application by supplying callbacks. The callbacks that must be provided to instantiate a behaviour are available through the function behaviour_info/1 as shown in Listing 2. gen_server is defined in the stdlib module of erlang. Listing 2. Displaying gen_server required callbacks. Erlang (BEAM) emulator version 5.2 [source] [hipe] Eshell V5.2 (abort with ^G) 1> gen_server:behaviour_info(callbacks). [{init,1}, {handle_call,3}, {handle_cast,2}, {handle_info,2}, {terminate,2}, {code_change,3}] The callbacks serve the following purposes: init(List) -> {ok, State} | {stop, Reason} Called when the server is started. This function must create the State that will be used in all subsequent callbacks, as well as set up any resources that are expected to be available by other callbacks. This function is invoked by gen_server:start_link and gen_server:start . handle_call(Request, From, State) -> {reply, Reply, NewState} | {stop, Error, NewState} Used by gen_server:call to make synchronous callbacks for service requests. The server will block until this function returns. This is typically where the meat of a gen_server implementation module will do its work. handle_cast(Request, State) -> Result Used by gen_server:cast to make asynchronous callbacks. Is analogous to handle_call, except that it is non-blocking and it does not allow a message to go back to the From requester since that argument is not supplied by gen_server. handle_info(Info, State) -> Result Called when a timeout occurs or a message other than a gen_server:call , gen_server:cast or system message is received. Typically the implementation of this function just logs the Info for later analysis, but it is also useful for monitoring linked processes because any EXIT messages caused by them will be reported as a call to this interface. terminate(Reason, State) Called when gen_server is about to end so that the implementation can clean up anything allocated in init/1 . It also gives a chance for statistics reporting, saving collected data for the next start up and generally to allow an orderly shutdown. code_change(OldVsn, State, Extra) -> {ok, NewState} Called when a live upgrade of the code version occurs to notify the application that the old code is about to disappear. This function should take care of any issues that might occur with the unloading of the current code version and then return an updated state to be used when the new code is loaded. 1.3 Architecting a server design We now need to make some architectural design decisions. The proxy server will accept connections via telnet, web browsers or other similar means. Many processes may connect simultaneously and they all need to be served responsively, even though some sessions may be long and others short. We also would like the server to be generally useful for a broad set of server applications that are accessible via TCP/IP. Because we will have more than one simultaneous session, the code is simpler if there is a separate process for each connection. Some applications may be concerned about overburdening the server, so we should keep track of the current set of active connections and update it as connections are started or dropped. Also, we should provide a means for limiting the number of connections, as well as reporting to the client that the server is too busy to accept more requests. To make the server generally useful, a ProxyModule callback behavior is implemented. The application writer will create an instance of a tcp_proxy server and supply the module name that contains the callbacks implementing the proxy functionality. For a browser cache, one ProxyModule would be written, for a webserver, a different ProxyModule would be provided. In either case, the same gen_server process, protocol and semantics would apply but the implementation of the proxy functionality would differ. Since there may potentially be a large number of clients, we need an efficient data structure for tracking them and a lightweight method of watching whether they are still connected including disconnections caused by process failure or network problems. We will use the built in Erlang Term Storage (ets) data structure to track them since it runs in memory, provides constant time lookup to large amounts of data and has a convenient interface for managing the data. If the client processes are linked to our server, we would have to trap EXITs to ensure a failing client doesn't take down the whole server. Instead we will use the erlang:monitor/2 function so that a poorly implemented ProxyModule has less impact on the tcp_proxy server. The only remaining decision is the protocol for accepting new connections. The gen_tcp:accept/1 function blocks the currently running process until a TCP request to connect is received. To prevent it from blocking the server so that it cannot respond to requests, the protocol must be to create a new process to use when waiting for a client connection. Now the problem becomes one of managing and communciating the current State of the server. While the Accept process is waiting, other calls may arrive which cause the server's internal State to change. When the Accept process finally receives a connection, the information must be communicated to the server process so that the connection bookkeeping can be kept current. The moment immediately after a connection is received becomes a key synchronization point between the two processes because the gen_server needs to control how many connections are active while the Accept process has no knowledge of this information. Synchronization of the two processes is achieved via a message. Rather than having to pass a complex message containing an aggregate State from the gen_server, it is easier to pass the information about the new connection to the gen_server. Since the gen_server needs to know when the connection is no longer active, even if the failure is caused by a process fault, it needs to monitor the process containing the socket. The Accept process already owns the socket, and the gen_server is already monitoring the health of the Accept process, why not use it as the process that manages the connection until the client session ends? This means that the Accept process will only accept one connection. The server must replace the Accept process with a new Accept process to allow a simultaneous connection. This approach reduces the communication protocol that would be necessary if the Accept process continued looping and allowing new connections, ensures that a single Accept process is not running forever (thus making hot swap to a new code version easier), and allows the gen_server to decide what to do when too many connections are active. This last issue becomes a communication problem if the Accept process is looping independent of the gen_server. Where possible it is best to solve problems by avoiding them altogether. 1.4 Implementing gen_server callbacks Let's start by implementing the initialization callback. Listing 3 shows the init/1 function and the start_listen/2 function. After these two run, the server has created the Accept process which is waiting for a TCP connection, while the first gen_server process is waiting for either an application-level request for information, a request to stop the server, or a notification by the Accept process that a new connection has arrived. Listing 3. tcp_proxy initialization functions. -export([init/1]). -record(tp_state, {tcp_opts = ?TCP_OPTIONS, % TCP Socket options port, % Port connections accepted on listen, % The listen socket instance accept_pid, % The current accept Process Id accept_deaths = 0, % The number of times accept EXITed clients, % ETS table of currently active clients client_count = 0, % Current number of active requests max_active = 0, % Maximum number of active allowed requests = 0, % Total number of requests handled server_busy = 0, % Number of times server was busy pmodule, % ProxyModule name pm_state % ProxyModule's dynamic state }). init([ProxyModule, Port, MaxConnects, ClientTab]) -> % Let ProxyModule initialize, then setup State and return. case catch ProxyModule:init() of {ok, ProxyState} -> process_flag(trap_exit, true), {ok, #tp_state{port = Port, pmodule = ProxyModule, pm_state = ProxyState, max_active = MaxConnects, clients = ets:new(ClientTab, [named_table])}}; Error -> {stop, Error} end. start_listen(Pid, Port) -> case gen_server:call(Pid, {listen, Port}) of ok -> start_accept(Pid); Error -> error_logger:error_report([{start_listen, Error}]), Error end. start_accept (Pid) when pid(Pid) -> case gen_server:call(Pid, {accept, Pid}) of ok -> {ok, Pid}; Other -> error_logger:error_report([{start_accept, Other}]), Other end. The init function receives the module name and port to listen on. It creates a TCP Proxy State record (of type tp_state) that will be used in all subsequent calls to keep track of any persistent information that might be needed for bookkeeping purposes. A record is commonly used to represent the state of a gen_server process. We want to keep track of the TCP options used, the name of the ProxyModule, the port to listen on, the actual socket used for listening, the Accept process that is listening for socket connections, some statistics about connections as well as the ets table of connected processes. Since we need to catch failures of the accept process, the gen_server must trap exits. The init function must return the State representation to the gen_server. Just as the gen_server provided an opportunity for tcp_proxy to initialize its internal state, this init function has to provide the ProxyModule a chance to initialize. Later we will see the ProxyModule:init/0 function used to allocate a cache for storing webserver names. The call must be wrapped in a catch so that we can detect failure and tell gen_server to shut the process down. The ProxyModule initialization returns a state that tcp_proxy has to manage and send to the ProxyModule whenever a callback occurs. start_listen makes a call to the gen_server requesting a callback to tcp_proxy:handle_call({listen, Port}, State) , and if that succeeds, it requests a callback to tcp_proxy:handle_call(accept, State) . The first callback will set up the listen socket, the second will spawn a new process to accept connections. All the alternatives for handle_call/3 are shown in Listing 4, Listing 5 and Listing 6. Listing 4. tcp_proxy:handle_call listen implementation. -export([handle_call/3]). % Set up a new listen socket to prepare for connections handle_call({listen, NewPort}, _From, #tp_state{port = OldPort, tcp_opts = TcpOpts} = State) -> % Open a new listen port case catch gen_tcp:listen(NewPort, TcpOpts) of % If we succeed, modify the state (possibly closing the old % listen port) and return a successful reply {ok, LSocket} -> NewPortState = State#tp_state{port = NewPort, listen = LSocket}, case OldPort of undefined -> {reply, ok, NewPortState}; Port -> {reply, ok, NewPortState}; % Close the old port, but don't pay attention to result Other -> Table = State#tp_state.clients, ets:delete_all_objects(Table), gen_tcp:close(OldPort), {reply, ok, NewPortState#tp_state{client_count = 0}} end; % If the call fails, tell the caller and don't change state or % close the existing listen port since terminate/2 will finish NotOK -> {stop, NotOK, State} end; The first clause of handle_call/3 starts listening on the TCP port using the TCP options for the socket. If this call occurs after the tcp_proxy is already listening on another port, the old port is closed and the new port becomes active. If we change anything about the server state that was passed in, it needs to be updated in the state record and returned to the gen_server. If this function returns a tuple that begins with the atom 'stop', gen_server will call tcp_proxy:terminate/2 as soon as it can. Listing 5. tcp_proxy:handle_call accept and connect implementations. % Accept connections in a new process, recording it for later reference handle_call({accept, Server}, _From, #tp_state{accept_pid = undefined, listen = LSocket} = State) -> Pid = spawn_link(fun() -> accept(Server, LSocket) end), {reply, ok, State#tp_state{accept_pid = Pid}}; % Handle an incoming connection by replacing the Accept process handle_call({connect, Pid, Socket, Server}, _From, #tp_state{accept_pid = Pid, client_count = Connects, max_active = Max, clients = Table, listen = LSocket, requests = Requests, pmodule = ProxyModule, pm_state = ProxyState} = State) -> % Start a new accept process NewAcceptPid = spawn_link(fun() -> accept(Server, LSocket) end), erlang:unlink(Pid), % Avoids handle_info when too many connected case Connects < Max of % Keep the new client but change to a lightweight monitor true -> MonitorRef = erlang:monitor(process, Pid), % Add the client process to the set of current connections ets:insert(Table, {MonitorRef, Pid}), {reply, {ok, ProxyModule}, State#tp_state{accept_pid = NewAcceptPid, client_count = Connects+1, requests = Requests + 1}}; % Too many connections are currently active false -> Busy = State#tp_state.server_busy, {reply, {error, too_many_connections, ProxyModule}, State#tp_state{accept_pid = NewAcceptPid, requests = Requests+1, server_busy = Busy + 1}} end; The second clause receives an open listen socket and accepts new connections to this socket from other processes and/or machines. Because accept is a blocking call which will wait forever for a connection, a new process needs to wait so that the control thread of this process can be returned to gen_server to handle application-level requests. When a new connection does arrive, the accept process makes a callback to connect handle_call passing the socket and server id. A new accept process is spawned to replace the old one which will only live on as a client connection. The client connection is unlinked and changed to a monitored process, added to the table of clients and the state statistics are updated, unless there are already too many client connections. Listing 6. tcp_proxy:handle_call msg request, report_clients and stop implementation. % Allow direct calls to functions defined in ProxyModule handle_call({msg, Request, Data}, From, #tp_state{pmodule = ProxyModule, pm_state = PMState} = State) -> case catch ProxyModule:Request(From, PMState, Data) of {ok, NewProxyState} -> {reply, ok, State#tp_state{pm_state = NewProxyState}}; % Report failure in other cases, but don't take server down Other -> Fun = atom_to_list(ProxyModule) ++ ":" ++ atom_to_list(Request) ++ "/3", error_logger:error_report([{Fun, Other}, {'State', State}]), % Only return a new state if the ProxyModule detected and updated case Other of {error, Error, NewPMState} -> {reply, ok, State#tp_state{pm_state=NewPMState}}; _Other -> {reply, ok, State} end end; handle_call(report_clients, From, #tp_state{pmodule = ProxyModule, accept_deaths = Deaths, client_count = Active, clients = Table, max_active = Max, requests = Connects, server_busy = Busy} = State) -> {reply, {ok, [{proxy_module, ProxyModule}, {active_clients, Active}, {clients, ets:tab2list(Table)}, {max_active_clients, Max}, {total_requests, Connects}, {server_busy, Busy}, {accept_failures, Deaths}]}, State}; handle_call(stop, _From, State) -> {stop, requested, State}. The ProxyModule may implement data structures or processes that need to be periodically managed. For example, if we build a caching proxy server that stores a copy of all webpages visited, there may need to be a periodic call to flush the cache, reap old entries, write it to disk, or perform other maintenance tasks. The {msg Request} message is what triggers an arbitrary function in the ProxyModule, which is invoked by calling tcp_proxy:handle_request/2 . Since we can't anticipate the needs of the ProxyModule, a flexible run-time interface is used. This function calls out by using the ProxyModule name stored in the State record coupled with the Request atom which should correspond to a function name in the ProxyModule. Again, the ProxyState is passed to the ProxyModule and returned, potentially updated. The application may request a list of the attached clients. This is returned by the report_clients clause of handle_call. For simplicity this function returns a proplist of the important values. A clean way to shutdown the server is to call 'stop' on it. This causes gen_server to issue a terminate call and then shutdown. This is exactly the effect of the application-level interface call tcp_proxy:stop/1 . Listing 7. tcp_proxy:handle_info implementation. -export([handle_info/2]). handle_info({'EXIT', Pid, Reason}, #tp_state{accept_pid = Pid, accept_deaths = Deaths, listen = LSocket} = State) -> error_logger:error_report([{'Accept EXIT', Reason}, {'State', State}]), ServerId = self(), NewPid = spawn_link(fun() -> accept(ServerId, LSocket) end), {noreply, State#tp_state{accept_pid = NewPid, accept_deaths = Deaths + 1}}; handle_info({'DOWN', MonitorRef, _Type, _Object, _Info}, #tp_state{clients = Table, client_count = OldCount} = State) -> case ets:member(Table, MonitorRef) of true -> ets:delete(Table, MonitorRef), {noreply, State#tp_state{client_count = OldCount - 1}}; false -> {noreply, State} end; handle_info(Info, State) -> error_logger:info_report([{'INFO', Info}, {'State', State}]), {noreply, State}. The function handle_info/2 is called whenever any otherwise unhandled message is received by gen_server. We need to handle the case when the accept process goes down and when a client connection is no longer active. When the accept process dies, a new accept process is spawned and the death is counted for later analysis. If a client connection goes down, the client process is removed from the client table and the active count is reduced. All others are treated as informational messages and are logged. Listing 8. tcp_proxy:handle_call implementation. -export([handle_cast/2, terminate/2, code_change/3]). % Just log any asynchronous gen_server requests handle_cast(Cast, State) -> error_logger:info_report([{'CAST', Cast}, {'State', State}]), {noreply, State}. % Log the failure of a linked process and then die handle_info({'EXIT', _Pid, Reason}, State) -> error_logger:error_report([{'INFO-EXIT', Reason}, {'State', State}]), {stop, Reason, State}; % Just log any other informational messages handle_info(Info, State) -> error_logger:info_report([{'INFO', Info}, {'State', State}]), {noreply, State}. % Log the termination reason and close any open listen port terminate(Reason, State) -> % Shutdown the ProxyModule first ProxyModule = State#state.module, ProxyModule:terminate(State#state.pm_state), % Then the Socket error_logger:error_report([{'TERMINATE', Reason}, {'State', State}]), case State#state.listen of undefined -> ok; Port -> gen_tcp:close(State#state.listen) end. % Just log a code change request code_change(OldVsn, State, Extra) -> error_logger:info_report([{'CODE-CHANGE-OLDVSN', OldVsn}, {'State', State}, {'Extra', Extra}]), {ok, State}. The callback for terminate/2 just closes the listen socket and doesn't bother to wait for a response to the request as shown in Listing 7. When the gen_server terminates, the attached Accept process will receive an 'EXIT' message and will be shutdown automatically unless it chooses to respond to the message in a different way. The remaining callbacks handle_cast/2 and code_change are not needed by this example but must be included to avoid a failure when they are called by gen_server. In the case of handle_info we include a handler for when an attached process dies as an example. 1.5 Implementing the Accept Connection process When we started the listener, it spawned a new process that was linked to the gen_server. The new process executed tcp_proxy:accept/1 which is shown in Listing 9. This function blocks waiting for a TCP connection. Once it receives one, it asks the gen_server to spawn a new accept process. The existing process continues on relaying the client request to the ProxyModule as shown in Listing 10. If there are too many clients connected, a call is made to ProxyModule:server_busy/1 so that it can properly relate the current status and then it is shut down. Listing 9. tcp_proxy:accept implementation. accept(Server, LSocket) -> % Wait for a client to connect case gen_tcp:accept(LSocket) of {ok, Socket} -> % Request a new accept socket then start ProxyModule response Result = gen_server:call(Server, {connect, self(),Socket,Server}), case Result of {ok, ProxyModule} -> relay(Server, ProxyModule, Socket); % Let ProxyModule issue a server busy message {error, too_many_connections, ProxyModule} -> % Socket refuses to accept data if it hasn't been read gen_tcp:recv(Socket, 0, 1000), ProxyModule:server_busy(Socket), gen_tcp:close(Socket) end; % Just let this process die if gen_tcp:accept fails % The handle_info callback will notice and restart it. NotOK -> error_logger:info_report([{"gen_tcp:accept", NotOK}]) end. tcp_proxy:relay/1 is the final function of the tcp_proxy module. It reads all the data from the TCP socket as a binary and then hands this information to the function ProxyModule:react_to/3 including the Socket connection so that the proxy module can reply to the requester if necessary. The relay function makes sure that the Socket is closed when it is finished. If the socket could not be read properly, or the callback to the proxy module fails, the event is logged before the relay process exits. This function could easily be modified to route the request to any of several different ProxyModules or processes based on the format or contents of the received binary data. Listing 10. tcp_proxy:relay implementation. relay(Server, ProxyModule, Socket) -> % Get the full TCP request case gen_tcp:recv(Socket, 0) of % Let the ProxyModule deal with it {ok, BinData} -> case catch ProxyModule:react_to(Server, Socket, BinData) of ok -> ok; {'EXIT', ok} -> ok; {'EXIT', Reason} -> error_logger:info_report([{atom_to_list(ProxyModule) ++ ":react_to/3 failed", Reason}]); Other -> error_logger:info_report([{atom_to_list(ProxyModule) ++ ":react_to/3 failed", Other}]) end; % Log any Socket receive problem NotOK -> error_logger:info_report([{"gen_tcp:recv/2", NotOK}]) end, % Close the socket in all cases gen_tcp:close(Socket). Continued in Part 2.