Wednesday, 21 August 2019

Adding a self-destruct with a reset timer to a Node.js program

Over the last couple of years, we've experienced multiple partial outages due to various bugs in the Node.js ecosystem that resulted in our program hanging, e.g. database socket hanging without a timeout, Redis hanging, proxy agent hanging, etc. Usually this goes unnoticed for a long time (1-2 hours) because it only affects a subset of workers and there are no errors in the logs – the program is just hanging waiting for some event.Long story short, we ran into another partial outage today when we noticed that even though we have 1k+ active agents in Kubernetes, there were only 400 active jobs. Bug details are not important, but it made me think of what safety mechanism we could add to prevent this happening in the future. I am currently thinking of adding a sort of self-destruct mechanism that requires program to check-in every X minutes or otherwise it terminates the process, e.g.``` // @flowconst createTimeout = (interval: number) => { return setTimeout(() => { console.error('liveness monitor was not reset in time; terminating program');process.nextTick(() => { // eslint-disable-next-line no-process-exit process.exit(1); }); }, interval); };/** * This exists because of numerous bugs that were encountered * relating to hanging database connection sockets, http agent, * Redis and other services that resulted in Node.js process hanging. * This is a sledgehammer solution to ensure that we detect instances * when program is unresponsive and terminate program with a loud error. * In practise, this timeout is never expected to be reached * unless there is a bug somewhere in the code. * * Liveness monitor must be created once and reset trigger injected * into whenever program makes observable progress. */ const createLivenessMonitor = (interval: number = 15 * 60 * 1000) => { let lastTimeout = createTimeout(interval);return () => { clearTimeout(lastTimeout);lastTimeout = createTimeout(interval); }; };export default createLivenessMonitor;```Program would call the liveness monitor whenever it does something that indicates progress (e.g. a new task picked up from the queue). If program fails to check-in for X minutes, then the process is terminated with an error and possibly additional debug details.What are your thoughts about this solution?

Submitted August 22, 2019 at 12:48AM by gajus0

No comments:

Post a Comment