|||

Video Transcript

X

Let It Crash: Best Practices for Handling Node.js Errors on Shutdown

This blog post is adapted from a talk given by Julián Duque at NodeConf EU 2019 titled "Let it crash!."

Before coming to Heroku, I did some consulting work as a Node.js solutions architect. My job was to visit various companies and make sure that they were successful in designing production-ready Node applications. Unfortunately, I witnessed many different problems when it came to error handling, especially on process shutdown. When an error occurred, there was often not enough visibility on why it happened, a lack of logging details, and bouts of downtime as applications attempted to recover from crashes.

Julián: Okay. So, as Brian said, my name is Julián Duque, it will be in proper Spanish. I come from a very beautiful town in Columbia called Medellín. So, if you haven't gone there, please visit us. That we have an amazing community, as Brian said. Right now, I work as a senior developer advocate for Heroku. So, I live in United States. Sadly, I'm away from my country, but I always constantly in communication with my community, and that's pretty much true to main conferences that I organize. One is the NodeConf Colombia and the other is the JSConf Colombia. So, I know if you are like me right now, you are needing coffee. I'm needing coffee too. It's super early. So, please don't crash now. Let's wait until my talk finish, and we can have some coffee to keep us awake.

Julián: So, a little bit of some background about this talk, why I presented this. These are pretty much lessons learned while I was working at NodeSource, previously. I was doing consulting work as a solutions architect, pleasing the customer, making sure they were using Node.js properly and they were successfully using Node. And I saw a lot of different bad patterns out there on how other companies were doing error handling, and especially when the process were crashing or the process were dying. They didn't have enough visibility. They didn't have logging strategies in place. They were missing the very important information about why the Node processes were having issues or were crashing. They were experiencing downtime, and we started to collect in a set of best practices and recommendations for them, that are aligned with the overall Node.js community.

Julián: If you go to the documentation, there are going to be pretty much the same recommendations that I'm going to be speaking about today. We add a couple other more things to make sure you have a very good exit strategy for your Node.js Processes. These best practices applies pretty much for web and network based applications because we are going to cover also the graceful shutdowns, but you can use them for other type of Node.js applications that are constantly running. And Node, sadly, is not Erlang. If you know about Erlang or leaks related crashes, just like a term that it's very common in that community. When I started learning Erlang back in 2014, I loved the fault tolerance options that these platform and language has. And I always think about how to bring the same experience into Node.js, is not the same because you can't do whole code reloading or function swapping on Node. You can do those things on Erlang, but still, Node is pretty lightweight, and you can easily restart and recover from a crash.

Julián: First, before getting into the bad place or when bad things happen, how to make sure that everything is good? What do we need to do to our Node applications to make sure they are running properly? So, first, as a recommendation, and there is going to be a workshop later about this specific thing, cloud native JS, don't miss this worship by Beth. She's going to also mention about how to add health checks to you Node.js processes. So, pretty much as our recommendation, add a health check route, it's a simple route that is going to return a 200 status code, and you will need to set something to monitor that route. You can do it at your loa balancer level. If you are using a reverse proxy, or a load balancer like nginx, or HAProxy, or you're using ELB, ALB, any type of application that is being the top layer of your Node.js process being constantly monitoring that the health check is returning okay. So you are making sure that everything is fine.

Julián: And also, rely on APM, some tools that are going to monitor the performance and the health of your Node.js Processes. So, in order to make sure that everything is running fine, you will need to have tools, some very known tools, New Relic, App Dynamics, Dynatrace, and N|Solid. A lot of them in the market will give you way more visibility around the health of your Node.js processes, and you can live in peace when you are making sure your Node is running properly. But what to do if something bad and unexpected happens? So, what should we do with our Node.js processes? Letting them crash. If something bad and unexpected happened, I will let my Node.js process crash, but in order to be able to do it and drive, we will need to implement a set of best practices and follow some steps to make sure that the application is going to restart properly and continue running and serving to our customers and clients.

Julián: Before letting it crash, we will need to learn about the process lifecycle, especially on the shutdown side of things, some error handling best practices. There is going to be also another very recommended workshop around it. I'm not going to be covering how to properly handle errors in Node.js, just on shutdown, and this is pretty much so you stop worrying about unexpected errors and increased visibility of your Node.js processes, increased visibility of what happened when your process crashes and what might be the reason, so you can fix it and iterate over your application. So, similar to coming back to the Erlang concept, a Node.js process is very lightweight. It's a small in memory. It doesn't have a very big memory footprint, and the idea is to keep the processes very lean at a startup, so they can start like super fast. If you have a lot of operations, like high intensive CPU or synchronous operation at a startup, it might decrease the ability to restart super fast, your Node.js processes.

Julián: So, try to keep your processes very lean on a startup. Use the strategies, like prebuilding, so you are not going to build on a startup or on the bootstrap of your process. Do everything before you are going to start your process, and if something unexpected and bad happens, just exit and start a new Node.js process as soon as possible to avoid downtime. And pretty much this is called a restart. You're late in the process crash, and then start the new one. But we will need to have some tools in place and settings to be able to have something that restarts all our Node.js processes. So, let's learn how to exit a Node.js process. So, there are two common methods on the process module that will help you to shut down or terminate a Node.js process. The most common one is the process.exit. You can pass an exit code, the zero if it's a success exit or higher than zero, commonly one, if it's a failure. And this pretty much instructs Node.js to end a process with a specified exit code.

Julián: And there is the other one, which is a process.abort. With the process.abort, it's going to cause Node.js process to exit immediately and generate the core file, if your operating system has core dumps enabled. So, in order to be able to have more visibility on postmortem debugging, to be able to see what happened or what clashes your Node.js process. If there is a memory issue, you can call process.abort, it will generate a core dump, and then you can use tools like llnode, which is a plugin for lldb to do a C and C++ debugging of the core dump, and to see what might happen in the native side of Node.js when your process scratch. So those are the two options you have to exit the Node.js process. How to handle exit events? So Node.js, it needs two different or two main events when your Node.js process is exiting. One is the beforeExit. So the beforeExit, it's a handle that can make asynchronous calls and the event loop will continue to work until it finishes.

Julián: So before the process is ending, you can schedule more work on the event loop, do more a synchronous task and then you can clean up your process. This event is not immediate on conditions that are causing explicit termination like on an uncaught exception or when I explicitly call process that exit. So this is all other exit scenarios. And the exit event, it's a handle, also can't make a synchronous call. Only synchronous calls can happen in this part of the process life cycle because the event loop doesn't have any more work to do. So the event loop is paused in here. So, if you try to do any asynchronous calling here, is not going to be executed. Only synchronous calls can happen here and this event is immediate when process.exit is called explicitly. It's commonly used if you want to log at the end, some information when you process.exit, my process.exit with the specific exit code and you want to add some more context around the state of your application at the time that the process exits.

Julián: Some examples how to use it. You attach those events on the process module. The beforeExit can do asynchronous code so that setTimeout, even though the event loop is pause at that moment when you are scheduled more asynchronous work, it will receive the event loop and continue until there is no more work to do. There's one thing I want to mention here is that normally a Node.js process exits when there is no more work is scheduled on the event loop. When there is nothing else on the event loop, a process is going to exit. How does a server keeps running? Because it has a handle register on the event loop, like a socket waiting for connections and that's why a web server is constantly running until you close the server or you interrupt the process. Otherwise, if there is something register on the event loop, the Node.js process is going to continue running. So in this case, I execute setTimeout, schedule more work, it will continue working on until there is no more left to do.

Julián: On process.exit, pretty much just synchronous calls. I can do anything here with the event loop. The event loop is thoroughly paused, useful for logging information or saving the state of the application and exit. There is are a couple of signal events that are going to be useful on shutdown. There is the SIGTERM and SIGINT. SIGTERM, it's normally immediate when a process monitor send a signal termination to your Node.js process to tell them that there is going to be a successful way to shut down your process. When you execute on systemd or using upstart, when you send stop that service or stop that process, it's going to sending that SIGTERM to your Node.js process and you can handle that event and do some work on that specific part of the life cycle. And the SIGINT, it's an interruption. It is immediate when you interrupt the Node.js processes, normally when you do control-C, when you are running on the console, you can also capture that event and do some work around it.

Julián: So these are two ways to expectedly finalize a Node.js process. So these two events are considered a successful termination. This is why I'm exiting here with the exit code zero because it is something that is expected. I say I don't want this process to continue running. And there is also the error events. So there are two main error events. One is the uncaughtException, the famous uncaughtException. And recently, in promises we're introducing to Node, the unhandledRejection. So the uncaught exception is immediate when a JavaScript error is not properly handled. So it pretty much represents a programmer error or represents a bug in your code. If an uncaughtException happens, the recommendation is to always crash your process, let it crash. Don't try to recover from an uncaughtException because it might give you some troubles. And while even though, the community is not totally agree on the second one.

Julián: I will say the same for an unhandledRejection. An unhandledRejection, it is immediate when a promise is rejected and there is no handle attached to the promise. So there is no catch attached to the promise. It my represent an operational error, it my represent a programmer error, so it depends of what happened here. But in both of those cases, it's better to log as much information as possible. Treat those as P1 issues that needs to be fixed in the next iteration or in the next release. So if you don't have any strategy in place to be able to identify why your processes are crashing and you are not fixing and handling those properly, your application are going to remain having box. So if it is an uncaught exception, that's a bug, that's a programming error, that is something that is not expected. Please crash, log and file an issue, so that needs to be fixed.

Julián: If it is an unhandled rejection, see if this is a programmer error or if it's an operational error that needs to be handled and go update the code, add the proper handling to that promise and continue with your job. So as I say in both cases an error event, it's a cause of termination for your Node.js process. Always exit the process with an exit code different than zero. So it's going to be one. So your process monitor and your logs know that it was a failure and as I say, don't try to recover from an uncaught exception. While I was working as a consultant, I saw a lot of people trying to do a lot of magic to avoid the Node.js processes dying by adding some complex logic on uncaught exception. And that always ended your application on a bad state. They were having memory leaks or having sockets hanging and it was a mess. So it's cheaper to let it crash, start a new process from a scratch and continue receiving more requests.

Julián: So a couple of examples on uncaught exception and unhandled rejection. The uncaught exception received such an argument and error instance. So you get the information about the error that was thrown or that wasn't handling your Node.js code. And the unhandled rejection is going to give you a reason which can be an error instance tool and it will give you the promise that was not properly handled. So those are useful information that you can have in your logs to have more information where things are failing in your code. But we saw how to handle the events, how to handle the errors, some of the best practices, but how to do it properly? What we need to do a better to be able to have a very good shutdown a strategy for Node.js processes? So the first one is running more than one process at the same time. So rely on scaling load balancer processes, having more than one. So in that way, if one of those processes crashes, there is another process that is alive and it's able to receive requests.

Julián: So it will give you time to do the restart and all the requests that are coming in. And maybe the only issue you are going to have are with the requests that were already happening in the Node.js process that crashes. But this is going to give you a little bit more leverage and prevent downtime. And what do you use for load balancing? Use whatever you have in hand. If it's nginx or HAProxy as a reverse proxy for your Node.js applications. If you are on AWS or on the cloud, you can use their elastic load balancer application, load balancers or the order load-balancer solutions that cloud offers. If you are on Kubernetes, you can use Ingress or other different in the load balancer strategies for your application. So pretty much make sure that you have more than one Node.js process running, so you can be more in peace if one of those processes crashes. You will need to have process monitoring and process monitoring needs a pretty much something that is running in your operating system or an application that it's constantly checking if your process is alive or not.

Julián: If it crashes, if there is a failure, the process monitor is in charge of restarting the process. So, the recommendation is to always use the native process monitoring that it's available on your operating system. If it's Unix or Linux, you can use systemd or upstart, specifically adding the restart on failure or respond when you are working on upstart. If you are using containers, use whatever is available. Docker has the restart option, Kubernetes has the restart policy and you can also configure your processes to restart when it fails to retry a number of times. So you don't go into a crazy error, that is going to constantly make your application crash and you end up in the crash loop. So you can add some retries into there but always have a process monitoring in place. If you can't use any of these tools as a last resource--but not recommended--use a Node.js process monitor like PM2 or forever.

Julián: But I will not recommend these to any customer of mine or any friend, but if you don't have any more resource, if you can use the native stuff in your operating system or if you are not using containers, you can go this way. These tools are good for development. Don't get me wrong. If you are logging on the development and they're very good tools to restart your processes when the crashes. But for production, they might not be the best. Let's talk about little bit about a graceful shutdown. So we have a web server running. The web server is getting request and it's getting connection. Sometime we have some established connections between our customers or clients and the server. But what happens when the process crashes? When the process crashes, if we are not doing a graceful shutdown, some of those sockets are going to be kept hanging and are going to wait until a timeout has been reached and that might cause down time and a decreased experience of your users. So it is better. So setting up an un-reference timeout is going to let the server do its job.

Julián: So, we will need to close the server, it's explicitly say to the server, stop receiving connections so they can reject the new connection. So new connections are going to the new or to the other Node.js process that is running through the load balancer and it will be able to send a TCP packet to the clients that are already connected. So they are going to be finishing the connection immediately when the server dies. They are not going to stay waiting until a timeout is reach out. They are going to be closing that connection and on the next retry, we expect that the process has restarted at that point or they go to another process that is running. So one example of that, un-reference time out, when we are handling the signal or error event, which is the shutdown part of the life cycle. What we can do, it's too explicitly call server.close. If it is an instance of the net server, which is the same one that uses the http or https, Node modules, you can pass a callback.

Julián: So when it finishes closing the connection, it will exit the process successfully. But we will need to have our timeout in place because we don't want to wait for a long time. Imagine if we had a lot of different clients connected that it's taken a lot of time to clean up those processes. We need to have some way to have an internal timeout. So here, we are scheduling a new timeout, but that timeout is not on the event loop. That last part the, unref is not the scheduling the timeout on the event loop, so it is not adding more work to the event loop. So when the timeout is reach or the server close callback is reach, either of those paths are going to close the Node.js process. So this is a race between the two, between your time out that is not in the event loop or between the server close, whichever works better. And what timeout time we do need to put here depending on the needs of your applications.

Julián: We had customers that had the need to have very few timeouts or a small time out because they were doing a lot of real time trading and they needed the processes to restart as far as possible. There are others that can have longer timeouts to lead, or when the connection finishes, so this depends of the use case. If you don't add the unref in here, since this timeout is going to be a schedule on the event loop, it's going to wait until it finishes and the process is going to end. So this is like a safeguard. So there is no more work schedule on the event loop while we are exiting our process. Logging, this is one of the most important parts of having a very good exit strategy for Node.js processes. So implement the robust logging strategy for your application, especially on shutdown. If an error happens, please log as much information as possible. An error object will contain not only the message or the cost of the error, but it will also contain the stack trace.

Julián: So if you log the stack trace, you will be able to come back to your code and fix and look specifically why it failed and where it fail. And you can rely on libraries, like pino or winston and use transport to store the logs in an external service. You can use like Splunk or Papertrail or use whatever you like to store the logs. But have a way to always go back to the logs, search for those uncaught exceptions and unhandled rejections and being able to identify why your processes are crushing. Fix those issues and continue with your work. So how can we put these altogether? I have some pattern I use on my projects but there is also a lot of modules on NPM that are going to do the same thing even better than the approach I'm following here. So this is a pattern I use. I create a module called terminate or I use a file called terminate. I pass the server like the instance of that server that I'm going to be closing and some configuration options if I want to enable core dumps or not, and the timeout.

Julián: Usually when I want to enable the core dump of Node, I use an environment variable. When I am going to do some performance testing on my application or I want to replicate the error, I enable the core dump. I let it crash with the process.abort, I check out the core dump and get more information about it. So here, I have our exit function that switches between the abort or the process.exit, depending of the configuration you have here. And the rest, I'm returning a function that returns a function and that function is the one that I'm going to be using as the exit handler. And this is pretty much the code that I'm going to be using for uncaught exceptions, unhandled rejections, and signals. And here, log as much as possible. I'm using console log for simplicity, but please use a proper logging library here. And pretty much if there is an error and if that is an instance of the error, I want to get information about the message and the stack trace. And at the end, I'm going to be trying doing the graceful shutdown.

Julián: So this is the same thing I explained before. I will close this server and also I will have a timeout to also close the server after that timeout happens. So it depends whatever ends first. And how to use this small module I have here, this is as an example, I have an issue to the server. I have my terminate code that I use for my project. I create an exit handler with the options with the server I'm running, with the different parameters I want to pass into my exit handler and I attach that function into the different events. So here exit handler, on uncaught exception and unhandled rejection, I'm going to return an exit code of one and I can add a message to my logs to say what type of error or what type of handling was this, and also with the signals. And with the signals, I'm passing an exit code of zero because it is something that there is going to be successful.

Julián: So this is pretty much what I have for today and the presentation, some resources that are going to be useful for you. Please don't miss Rubin Bridgewater workshop later today. It's going to be called "Error Handling: doing it right". Again, it's going to be explaining now how to avoid getting here? How to avoid getting into the uncaught exception side of things? How to properly create the error objects to have more visibility? How to handle promises, rejections? So, these are going to be a very good presentation and also the cloud native JS by Beth. She's going to be mentioning also how to add monitoring to application health checks. So those are going to be good things if you want to run Node.js properly in production. Some NPM modules to take a look that pretty much solve the issue I was talking about today. There is a module I like, the terminus by the team at GoDaddy.

Julián: It supports adding health checks to your application. It has a C signal handlers too. It has a very good graceful shutdown strategy. Way more complex than the one I presented you. This is something that you can add to your projects pretty easily. Just create an instance of terminus, configure it, and add the different handlers there. There is another module called stoppable. Stoppable is the decorator over the server class that is going to be able to implement not a close function, but a stop function and it's going to be also doing a lot of things around a graceful shutdown. And there is also a module that pretty much is what I presented today. It's called http-graceful-shutdown. You also pass an instance of your HTTP server and it has different handlers, you can see what happened when there is an error or what signals I'm going to be monitoring.

Julián: It's pretty much... It's all going to be resources that are going to simplify your life and make you a better up running Node in production and you will be able to let it crash. One last thing, I want to invite you to Nodeconf Colombia, so save the day. This is going to happen June 26 and 27, 2020. It's going to happen in Medellín, Columbia. More information at nodeconf.co. CFP is not open yet, but I will expect a lot of you all sending proposals to go to Medellin. We pay for travel, we pay for a hotel. And if you want to know a little bit about the experience of speaking at a conference in Columbia, you can ask James, you can ask Anna, and I think you can ask Brian. There is a couple of folks here that have spoken there and thank you very much. This is it.

We started to assemble a collection of best practices and recommendations on error handling, to ensure they were aligned with the overall Node.js community. In this post, I'll walk through some of the background on the Node.js process lifecycle and some strategies to properly handle graceful shutdown and quickly restart your application after a catastrophic error terminates your program.

The Node.js process lifecycle

Let's first explore briefly how Node.js operates. A Node.js process is very lightweight and has a small memory footprint. Because crashes are an inevitable part of programming, your primary goal when architecting an application is to keep the startup process very lean, so that your application can quickly boot up. If your startup operations include CPU intensive work or synchronous operations, it might affect the ability of your Node.js processes to quickly restart.

A strategy you can use here is to prebuild as much as possible. That might mean preparing data or compiling assets during the building process. It may increase your deployment times, but it's better to spend more time outside of the startup process. Ultimately, this ensures that when a crash does happen, you can exit a process and start a new one without much downtime.

Node.js exit methods

Let's take a look at several ways you can terminate a Node.js process and the differences between them.

The most common function to use is process.exit(), which takes a single argument, an integer. If the argument is 0, it represents a successful exit state. If it's greater than that, it indicates that an error occurred; 1 is a common exit code for failures here.

Another option is process.abort(). When this method is called, the Node.js process terminates immediately. More importantly, if your operating system allows it, Node will also generate a core dump file, which contains a ton of useful information about the process. You can use this core dump to do some postmortem debugging using tools like llnode.

Node.js exit events

As Node.js is built on top of JavaScript, it has an event loop, which allows you to listen for events that occur and act on them. When Node.js exits, it also emits several types of events.

One of these is beforeExit, and as its name implies, it is emitted right before a Node process exits. You can provide an event handler which can make asynchronous calls, and the event loop will continue to perform the work until it's all finished. It's important to note that this event is not emitted on process.exit() calls or uncaughtExceptions; we'll get into when you might use this event a little later.

Another event is exit, which is emitted only when process.exit() is explicitly called. As it fires after the event loop has been terminated, you can't do any asynchronous work in this handler.

The code sample below illustrates the differences between the two events:

process.on('beforeExit', code => {
  // Can make asynchronous calls
  setTimeout(() => {
    console.log(`Process will exit with code: ${code}`)
    process.exit(code)
  }, 100)
})

process.on('exit', code => {
  // Only synchronous calls
  console.log(`Process exited with code: ${code}`)
})

OS signal events

Your operating system emits events to your Node.js process, too, depending on the circumstances occurring outside of your program. These are referred to as signals. Two of the more common signals are SIGTERM and SIGINT.

SIGTERM is normally sent by a process monitor to tell Node.js to expect a successful termination. If you're running systemd or upstart to manage your Node application, and you stop the service, it sends a SIGTERM event so that you can handle the process shutdown.

SIGINT is emitted when a Node.js process is interrupted, usually as the result of a control-C (^-C) keyboard event. You can also capture that event and do some work around it.

Here is an example showing how you may act on these signal events:

process.on('SIGTERM', signal => {
  console.log(`Process ${process.pid} received a SIGTERM signal`)
  process.exit(0)
})

process.on('SIGINT', signal => {
  console.log(`Process ${process.pid} has been interrupted`)
  process.exit(0)
})

Since these two events are considered a successful termination, we call process.exit and pass an argument of 0 because it is something that is expected.

JavaScript error events

At last, we arrive at higher-level error types: the error events thrown by JavaScript itself.

When a JavaScript error is not properly handled, an uncaughtException is emitted. These suggest the programmer has made an error, and they should be treated with the utmost priority. Usually, it means a bug occurred on a piece of logic that needed more testing, such as calling a method on a null type.

An unhandledRejection error is a newer concept. It is emitted when a promise is not satisfied; in other words, a promise was rejected (it failed), and there was no handler attached to respond. These errors can indicate an operational error or a programmer error, and they should also be treated as high priority.

In both of these cases, you should do something counterintuitive and let your program crash! Please don't try to be clever and introduce some complex logic trying to prevent a process restart. Doing so will almost always leave your application in a bad state, whether that's having a memory leak or leaving sockets hanging. It's simpler to let it crash, start a new process from scratch, and continue receiving more requests.

Here's some code indicating how you might best handle these events:

process.on('uncaughtException', err => {
  console.log(`Uncaught Exception: ${err.message}`)
  process.exit(1)
})

We’re explicitly “crashing” the Node.js process here! Don’t be afraid of this! It is more likely than not unsafe to continue. The Node.js documentation says,

Unhandled exceptions inherently mean that an application is in an undefined state...The correct use of 'uncaughtException' is to perform synchronous cleanup of allocated resources (e.g. file descriptors, handles, etc) before shutting down the process. It is not safe to resume normal operation after 'uncaughtException'.

process.on('unhandledRejection', (reason, promise) => {
  console.log('Unhandled rejection at ', promise, `reason: ${err.message}`)
  process.exit(1)
})

unhandledRejection is such a common error, that the Node.js maintainers have decided it should really crash the process, and they warn us that in a future version of Node.js unhandledRejections will crash the process.

[DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Run more than one process

Even if your process startup time is extremely quick, running just a single process is a risk to safe and uninterrupted application operation. We recommend running more than one process and to use a load balancer to handle the scheduling. That way, if one of the processes crashes, there is another process that is alive and able to receive new requests. This is going to give you a little bit more leverage and prevent downtime.

Use whatever you have on-hand for the load balancing. You can configure a reverse proxy like nginx or HAProxy to do this. If you're on Heroku, you can scale your application to increase the number of dynos. If you're on Kubernetes, you can use Ingress or other load balancer strategies for your application.

Monitor your processes

You should have process monitoring in-place, something running in your operating system or an application environment that's constantly checking if your Node.js process is alive or not. If the process crashes due to a failure, the process monitor is in charge of restarting the process.

Our recommendation is to always use the native process monitoring that's available on your operating system. For example, if you're running on Unix or Linux, you can use the systemd or upstart commands. If you're using containers, Docker has a --restart flag, and Kubernetes has restartPolicy, both of which are useful.

If you can't use any existing tools, use a Node.js process monitor like PM2 or forever as a last resort. These tools are okay for development environments, but I can't really recommend them for production use.

If your application is running on Heroku, don’t worry—we take care of the restart for you!

Graceful shutdowns

Let's say we have a server running. It's receiving requests and establishing connections with clients. But what happens if the process crashes? If we're not performing a graceful shutdown, some of those sockets are going to hang around and keep waiting for a response until a timeout has been reached. That unnecessary time spent consumes resources, eventually leading to downtime and a degraded experience for your users.

It's best to explicitly stop receiving connections, so that the server can disconnect connections while it's recovering. Any new connections will go to the other Node.js processes running through the load balancer

To do this, you can call server.close(), which tells the server to stop accepting new connections. Most Node servers implement this class, and it accepts a callback function as an argument.

Now, imagine that your server has many clients connected, and the majority of them have not experienced an error or crashed. How can you close the server while not abruptly disconnecting valid clients? We'll need to use a timeout to build a system to indicate that if all the connections don't close within a certain limit, we will completely shutdown the server. We do this because we want to give existing, healthy clients time to finish up but don't want the server to wait for an excessively long time to shutdown.

Here's some sample code of what that might look like:

process.on('<signal or error event>', _ => {
  server.close(() => {
    process.exit(0)
  })
  // If server hasn't finished in 1000ms, shut down process
  setTimeout(() => {
    process.exit(0)
  }, 1000).unref() // Prevents the timeout from registering on event loop
})

Logging

Chances are you have already implemented a robust logging strategy for your running application, so I won't get into it too much about that here. Just remember to log with the same rigorous quality and amount of information for when the application shuts down!

If a crash occurs, log as much relevant information as possible, including the errors and stack trace. Rely on libraries like pino or winston in your application, and store these logs using one of their transports for better visibility. You can also take a look at our various logging add-ons to find a provider which matches your application’s needs.

Make sure everything is still good

Last, and certainly not least, we recommend that you add a health check route. This is a simple endpoint that returns a 200 status code if your application is running:

// Add a health check route in express
app.get('/_health', (req, res) => {
  res.status(200).send('ok')
})

You can have a separate service continuously monitor that route. You can configure this in a number of ways, whether by using a reverse proxy, such as nginx or HAProxy, or a load balancer, like ELB or ALB.

Any application that acts as the top layer of your Node.js process can be used to constantly monitor that the health check is returning. These will also give you way more visibility around the health of your Node.js processes, and you can rest easy knowing that your Node processes are running properly. There are some great great monitoring services to help you with this in the Add-ons section of our Elements Marketplace.

Putting it all together

Whenever I work on a new Node.js project, I use the same function to ensure that my crashes are logged and my recoveries are guaranteed. It looks something like this:

function terminate (server, options = { coredump: false, timeout: 500 }) {
  // Exit function
  const exit = code => {
    options.coredump ? process.abort() : process.exit(code)
  }

  return (code, reason) => (err, promise) => {
    if (err && err instanceof Error) {
    // Log error information, use a proper logging library here :)
    console.log(err.message, err.stack)
    }

    // Attempt a graceful shutdown
    server.close(exit)
    setTimeout(exit, options.timeout).unref()
  }
}

module.exports = terminate

Here, I've created a module called terminate. I pass the instance of that server that I'm going to be closing, and some configuration options, such as whether I want to enable core dumps, as well as the timeout. I usually use an environment variable to control when I want to enable a core dump. I enable them only when I am going to do some performance testing on my application or whenever I want to replicate the error.

This exported function can then be set to listen to our error events:

const http = require('http')
const terminate = require('./terminate')
const server = http.createServer(...)

const exitHandler = terminate(server, {
  coredump: false,
  timeout: 500
})

process.on('uncaughtException', exitHandler(1, 'Unexpected Error'))
process.on('unhandledRejection', exitHandler(1, 'Unhandled Promise'))
process.on('SIGTERM', exitHandler(0, 'SIGTERM'))
process.on('SIGINT', exitHandler(0, 'SIGINT'))

Additional resources

There are a number of existing npm modules that pretty much solve the aforementioned issues in a similar ways. You can check these out as well:

Hopefully, this information will simplify your life and enable your Node app to run better and safer in production!

Originally published: December 17, 2019

Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.