The problem
Web and Application Servers do suffer from memory leaks: it’s a fact. Just try to google “<your favorite application server> memory leaks”, e.g: “apache memory leaks“, “aolserver memory leaks“, “tomcat memory leaks“, “weblogic memory leaks“…
Web and Application Servers are complex pieces of software, they have to:
- accept HTTP requests
- spawn these requests to some worker threads
- interoperate and be integrated with other pieces of software written by third parties (e.g, the Apache Modules)
- support applications written by various people and possibly in different programming languages
- create and maintain a pool of connections to a database
- support web services and possibly some other communication mechanisms
- and so on…
In all these areas there’s room for bad memory handling. On top of that most of the times the major source of memory allocation problems is not the platform, it is not the application server itself, but it is those applications that have been written for that platform without adhering to the framework, the logic, the behaviour expected by the application server.
The end result of this situation is that every now and then web and application servers do require a restart (or even a reboot of the system they are running on) . This is unacceptable in a 24/24 7/7 operation.
The Solution – Live with that (but in an intelligent way)
The solution proposed by Spazio IT is not trying to remove all memory leaks from a given application server and the applications running on top of it (this may end up looking like Don Quixote tilting at windmills) but to allow for those memory leaks.
The basic ideas is to have a pool, a cluster of application servers, each one of them may suspend its activities because of a restart/reboot. Whatever a single application server does, the pool, the cluster still offers full 24/24 7/7 availability.
AS Distributed Sandbox – How it works
AS (Application Servers) Distributed Sandbox is a cluster configuration extended with a simple cooperation/coordination mechanism. This coordination mechanism relies on two components:
- the Application Servers Coordinator (ASC) – running on the reverse proxy machine
- the Application Server Monitor (ASM) – running on each application server machine
This is how the two components work together:
- When the cluster starts the ASC waits from registration requests from the ASMs.
- As soon as an application server machine starts, its ASM sends registrations requests to ASC.
- When a registration request is received ASC reconfigures the actual reverse proxy software (e.g. NGINX) so that is starts using the related application server.
- The ASC expects a heartbeat signal from each registered ASMs. If it doesn’t receive this signal from a registered ASM it considers that machine dead and removes it from the reverse proxy configuration.
- When a particolar condition occurs in an application server machine (e.g. the memory occupation goes beyond a predifined limit) the ASM running on that machine sends a request to reboot to ASC. ASC instructs the reverse proxy to gracefully stop sending requests to that particular application server, waits for the reverse proxy reconfiguration and eventually grants ASM the permission to reboot. ASM initiates a gracefull reboot and when the application server machine restarts again ASM will register again to ASC (and so on…).
The mechanism is fully scalable: in case of poor performances, or in case more reliability / availability are needed it is enough to add one or more application server machines to the cluster. Having a separated file system and database server enables full interchangeability among application servers.
Pros and Cons
On one hand AS Distributed Sandbox requires a cluster configuration, on the other hand many production systems are already in this configuration.
The database server and the file system server may become a bottleneck for the performances of the overall cluster. If this is the case various countermeasures can be taken, e.g.:
- the database server itself can be implemented as a cluster of database servers (with its own replication and redundancy mechanisms)
- the file systems could evolve from a shared file system exposed by a machine via NFS or CIFS, to a file system on a SAN or for huge configurations to a cloud file system like HDFS or XtremeFS.
To prove the concept Spazio IT has successfully converted a physical machine with 12 GB of RAM and a given amount of HD disk space running Aolserver + OpenACS + PostgreSQL into a cluster of 5 virtual machines (1 with the reverse proxy, 2 application servers, 1 database server and 1 file system server) using in total 16 GB or RAM and an addition of 32 GB to the original space on the hard disk. The cluster of virtual machines offers performances comparable to the ones of the original physical system. The cluster supports 24/24 7/7 operations; the original system does not.
Offered Services
Spazio IT offers consultancy services to configure and install the AS Distributed Sandbox and make your system able to support 24/24 7/7 operations.