Refining Product Search With Advanced Open Search Techniques

AWS OpenSearch Nov 22, 2024

Introduction

At Halodoc, our mission to simplify the patient experience drives us to explore cutting-edge technologies. One key area of focus has been our search engine, which serves as the backbone of our platform, connecting users with the health solutions they need.

Search is integrated throughout our products—from medicine delivery to hospital appointment bookings—and plays a critical role in helping users find what they are looking for efficiently and accurately. Our search engine, powered by OpenSearch, utilizes a variety of query types, including match, fuzzy, wildcard, span, and prefix queries, to deliver highly relevant results. To make search smarter and more intuitive, we went beyond simple term matching and introduced enhancements such as shingle analyzers for multi-word queries, phonetic analyzers to handle misspellings, optional match queries for greater flexibility, and painless scripts to optimize wildcard functions.

By refining our search capabilities, we’ve achieved notable results: a 59% reduction in users unable to find relevant matches, a 3.3% increase in click-through rates, and a 2% growth in weekly order conversion rates. These improvements demonstrate the impact of creating a more intuitive and efficient search experience. In this blog, we discuss the strategies that enabled these changes, the challenges we addressed, and how they have improved the user experience at Halodoc.

Challenges in the Existing Search Engine

As our user base grew, we encountered several challenges that underscored the need for search enhancements:

Typos in Product Names: Users sometimes make significant spelling mistakes, which can prevent relevant products from appearing in search results. While fuzzy queries can handle minor errors, they have limitations when errors are more substantial. For example, if someone searches for "azytromicin" instead of "Azithromycin," the correct product may not appear, even though it exists in the catalogue.

Missing Spaces Between Words: Users sometimes enter product names as a single word instead of separating each term, which may cause relevant products to be overlooked in search results. For instance, if a user searches for "stopcold" instead of "stop cold," the correct product might not appear in the results, even though it's available in the catalogue.

Partial Keyword Matches in Multi-Word Searches: When users search with multiple keywords, our system only returns results if all keywords are present in a document. If even one of the keywords is incorrect —even if the others match—the system doesn’t return any results. This can prevent users from finding relevant products if they enter a multi-word search where not every word exactly matches our documents.

Performance Issues with Wildcard Queries: We observed latency issues in searches that involved wildcard queries to capture variations. While wildcard queries are useful for expanding matches, they began to cause performance bottlenecks, resulting in high response times and increased CPU utilization, making it challenging to maintain optimal search performance. These limitations pointed us toward advanced OpenSearch techniques to improve user experience, boost search performance, and reduce load on our systems.

Strategies for Improved Search Performance

To address these challenges, we implemented several advanced techniques to enhance our search engine's functionality.

Shingle Analyzer

Shingles are effectively word-nGrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.

Imagine we have the sentence: "Shingle is a viral disease." When processed with a shingle filter, it might generate tokens like:

Shingle is
is a
a viral
viral disease

By adjusting the min_shingle_size and max_shingle_size, we can control the length of these tokens, forming shingles of various sizes.

Shingles essentially let us prepare phrases in advance for phrase matching, which can be a big time-saver. Instead of creating phrases during each query, shingles are already in the index, leading to faster searches.

One trade-off is that adding shingles increases the size of the index, as more tokens are stored. This can also mean higher memory usage, especially if you need to sort or facet based on the shingled field. However, for many use cases, shingles offer a good balance of search relevance without the higher overhead of n-grams, which generate a large number of tokens.

While n-grams can help with handling spelling errors, shingles work better when spelling accuracy is already high.

Using Shingles in Open Search

Let’s look at how we set up a shingle filter in Open Search.

Example Mapping

In our mapping, we use an analyzer with the shingle filter for both indexing and searching, creating a single field called title:

"mappings": {
    "product": {
        "properties": {
            "title": {
                "search_analyzer": "analyzer_shingle",
                "index_analyzer": "analyzer_shingle",
                "type": "string"
            }
        }
    }
}

Here, we specify the analyzer_shingle for both indexing and searching.

Analyzer with Shingle Filter

Next, we define the analyzer_shingle analyzer, which includes a tokenizer and multiple filters:

"analyzer_shingle": {
    "tokenizer": "standard",
    "filter": ["standard", "lowercase", "filter_stop", "filter_shingle"]
}

This analyzer tokenizes text using the standard tokenizer, converts it to lowercase, removes stopwords, and then applies the shingle filter.

Configuring the Stopword Filter

To prevent gaps (like underscores) when creating shingles, we configure the stopword filter (filter_stop) with enable_position_increments set to false:

"stop": {
    "type": "stop",
    "enable_position_increments": "false"
}

This setting makes the token stream continuous, without gaps, resulting in cleaner shingle tokens.

Defining the Shingle Filter

Finally, we configure the filter_shingle:

"filter_shingle": {
    "type": "shingle",
    "max_shingle_size": 5,
    "min_shingle_size": 2,
    "output_unigrams": "true"
}

With these settings, we create shingle tokens between two and five words long. Enabling output_unigrams allows single-word tokens alongside multi-word tokens.

By default, shingles are separated by a single space. However, for our specific requirement, we need to create a custom shingle analyzer that produces shingles without spaces. Here’s how we can define such an analyzer:

"filter_shingle": {
    "type": "shingle",
    "max_shingle_size": 5,
    "min_shingle_size": 2,
    "output_unigrams": "true",
    "token_separator": ""
}

We added token_separator to produce tokens without spaces as well.

Benefits of Shingles

Shingles offer flexibility for both exact matches and partial phrase matching. Exact matches score higher because they match all the tokens in a shingle, while partial matches still return results. Shingling also considers token frequency, so unique phrases score higher.

With the above setup, we were able to address use cases where users enter multi-word product names without spaces, which previously were difficult or impossible to match in search. For example, if a user searches for "stopcold" instead of "stop cold," the shingle filter can recognize this as a sequence of words and generate relevant shingles that match both versions. This significantly improves search functionality, making it more robust and user-friendly for handling these types of queries.

The filter setup can be customized with different tokenizers or stopword settings to fit your needs, making shingles a powerful, adaptable tool for improved search relevance.

While using a shingle analyzer can lead to increased memory usage, the actual impact depends on the specific characteristics of the data on which we apply it. In our implementation, we applied the shingle analyzer to product names, which typically do not consist of an excessively long set of tokens. Since shingles are generated at the token level, we observed that the increase in memory usage was negligible.

Phonetic Analyzer

To enhance search functionality and accommodate potential typos, many opt to enable fuzziness in their search queries. However, this method comes with its drawbacks. Firstly, there are limitations on the degree of fuzziness allowed per term, which varies with the length of the word. Secondly, employing fuzziness can lead to increased CPU utilization on your nodes. Additionally, the results may not always align with user intent; for example, a fuzzy search for "lead" might yield documents that include "leak," which could be irrelevant to the user's needs.

In contrast, phonetic search adopts a fundamentally different strategy. This process occurs during indexing, where tokens are transformed and stored in the inverted index as distinct representations. Unlike fuzzy search, which relies on edit distances (a method that requires term comparison at query time), phonetic search focuses on the pronunciation of terms and the arrangement of certain letters. This approach generates a normalized output, allowing similar-sounding names—such as "Azithromycin" and "Azythromicin"—to be indexed under the same representation, thus improving the likelihood of retrieving relevant results.

Using Phonetics in Open Search

Let’s look at how to set up a phonetic analyzer in OpenSearch.

Example Mapping

In our mapping, we use an analyzer with the phonetic filter

"mappings": {
    "product": {
        "properties": {
            "title": {
               "phonetic": {
                  "search_analyzer": "analyzer_phonetic",
                  "index_analyzer": "analyzer_phonetic",
                  "type": "string"
            }
        }
    }
}

Here, we specify the analyzer_phonetic for both indexing and searching.

Analyzer with Phonetic Filter

Next, we define the analyzer_phonetic analyzer, which includes a tokenizer and multiple filters:

{
  "analysis": {
    "analyzer": {
      "analyzer_phonetic": {
        "filter": [
          "standard",
          "lowercase",
          "filter_stop",
          "filter_phonetic"
        ],
        "tokenizer": "standard"
      }
    }
  }
}

This analyzer tokenizes text using the standard tokenizer, converts it to lowercase, removes stopwords, and then applies the double_metaphone filter.

Defining the Phonetic Filter

Finally, we configure the filter_phonetic:

"filter_phonetic": {
        "type": "phonetic",
        "encoder": "double_metaphone"
        "replace": true
  }

Encoder: We use the double_metaphone encoder to generate phonetic representations. This encoder is known for handling a broad range of phonetic similarities.

Replace: Setting replace to true means the phonetic tokens will replace the original tokens in the title_phonetic field, ensuring that only phonetic representations are stored.

Benefits of Phonetics

This approach improves the user experience by providing more relevant search results and reduces the computational overhead associated with fuzzy searching. With phonetic analysis in place, users can find what they're looking for even if they misspell a name or enter a variant spelling, thereby enhancing the overall search efficiency of our application.

Phonetic analysis is performed at indexing time, which lowers the computational burden during search time compared to fuzzy searches that require real-time evaluation of edit distances.

Replacing Wildcards with Painless Scripts

Wildcard queries are used for pattern matching within string fields, allowing for searches that accommodate variable text structures. A wildcard query to find documents with attribute values containing a specific pattern might look like this:

{
  "query": {
    "wildcard": {
      "name": "*paracetamol*"
    }
  }
}

While wildcard queries are useful for flexible searching, they can lead to slower query performance, especially when leading wildcards are used. This is because wildcard queries may require scanning many documents to find matches, resulting in higher CPU usage and longer response times.

Using Painless scripts has proven to be significantly faster and less resource-intensive compared to wildcard queries. For large datasets or performance-critical applications, leveraging Painless scripts can lead to substantial improvements in response times and system resource utilisation.

Converting the above Wildcard Query to a Painless Script Equivalent :

{
  "script": {
    "script": {
      "params": {
        "speciality_id": "paracetamol"
      },
      "lang": "painless",
      "source": "doc['name'].value.contains(params.name)"
    }
  }
}

Painless scripts are faster than wildcard queries in OpenSearch for several key reasons:

Direct Field Access: Painless scripts access field values directly from the index, allowing efficient evaluations without full dataset scans.

Avoiding Full Scans: Wildcard queries, especially with leading wildcards, require scanning many documents, leading to higher CPU usage. Painless scripts evaluate specific conditions directly, reducing unnecessary evaluations.

Optimised Compilation: Painless scripts are compiled and optimized before execution, allowing for faster processing and reduced computation overhead.

Less Query Parsing Overhead: Wildcard queries involve complex pattern matching, which adds parsing overhead. Painless scripts execute simpler logical expressions, minimising this overhead.

Custom Logic: Painless allows for tailored logic that narrows results early in the process, leading to more efficient evaluations.

Switching from wildcard queries to painless scripts led to a ~40% reduction in CPU usage and a 10x improvement in response time.

Optional Match Query

In OpenSearch, the ability to perform optional match queries can be crucial for enhancing search flexibility and relevance. By default, the match query operator is set to AND, meaning all specified terms must be present in the document for it to match. By changing the operator to OR, you can create a more permissive search that matches documents containing any of the specified terms.

{
  "query": {
    "match": {
      "name": {
        "query": "azithromisin azyth",
        "operator": "or",
        "boost": 5.0,
        "auto_generate_synonyms_phrase_query": true,
        "fuzziness": "AUTO",
        "fuzzy_transpositions": true,
        "lenient": false,
        "max_expansions": 50,
        "prefix_length": 0,
        "zero_terms_query": "none"
      }
    }
  }
}

This flexibility can sometimes lead to irrelevant results, particularly when common terms or numerical keywords are involved. To mitigate this issue, it's essential to implement strategies that filter out unnecessary results.

Adjusting Query Boosts for Scoring

To ensure that the results from the shingle, optional, and phonetic queries do not overshadow the results from existing queries, we apply a hierarchical approach to boosting these queries in our search logic. This scoring adjustment allows us to maintain the relevance of more traditional query results while still providing the flexibility and benefits of enhanced query techniques.

The boost order we establish is as follows:

Existing Queries > Shingle Queries > Optional Queries > Phonetic Queries

Results and Impact

After implementing these improvements, we observed a significant improvement in search results. We have seen a remarkable 59% reduction in instances where users did not find relevant matches, decreasing from approximately 1,800 to around 740 daily users. This result far surpasses our initial goal of a 21% decrease in unique users experiencing difficulties. The remaining users primarily consist of those searching for banned products or items we do not carry.

Additionally, the click-through rate (CTR) for search results has improved by 3.3%, increasing from 84.6% to 87.9%. We have also observed an uptick in the weekly order conversion rate, which has risen by about 2% — from an average of 27% to 29%.

Conclusion

Through these enhancements, we have made significant strides in improving the search experience for our users. By addressing key challenges and implementing advanced OpenSearch techniques, we have not only optimized performance but also ensured that our users can easily find the products they need. At Halodoc, we remain committed to leveraging technology to enhance the patient experience, and we look forward to continuously evolving our services to meet their needs.

References

Execute Painless script

Execute Painless script Introduced 1.0

OpenSearch Documentation

Join Us

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels, and if solving complex problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.

About Halodoc

Halodoc is the number 1 Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. We recently closed our Series D round and In total have raised around USD$100+ million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Recommended for you

RAG

Semantic Search in Healthcare: Inside Halodoc’s RAG Retrieval Architecture

3 months ago • 6 min read

Capacity Planning

Structured Capacity Planning and Automation to Optimise Java Services

a year ago • 9 min read

How We Turned Data Engineering Runbooks Into Reliable AI Skills

Automating Dynamic Prerendering at Halodoc: A Data-Driven Approach to SEO and Faster Page Discovery

From Single Point of Failure to Warm Standby: A Practical Core Switch Disaster Recovery Playbook

Reducing LLM Token Costs by 15% by Switching from JSON to TOON Format

Refining Product Search With Advanced Open Search Techniques

Introduction

Challenges in the Existing Search Engine

Strategies for Improved Search Performance

Shingle Analyzer

Using Shingles in Open Search

Example Mapping

Analyzer with Shingle Filter

Configuring the Stopword Filter

Defining the Shingle Filter

Benefits of Shingles

Phonetic Analyzer

Using Phonetics in Open Search

Example Mapping

Analyzer with Phonetic Filter

Defining the Phonetic Filter

Benefits of Phonetics

Replacing Wildcards with Painless Scripts

Optional Match Query

Adjusting Query Boosts for Scoring

Results and Impact

Conclusion

References

Join Us

About Halodoc

Tags

Satish Kumar Agarwal

Recommended for you

Semantic Search in Healthcare: Inside Halodoc’s RAG Retrieval Architecture

Structured Capacity Planning and Automation to Optimise Java Services

How We Turned Data Engineering Runbooks Into Reliable AI Skills

Automating Dynamic Prerendering at Halodoc: A Data-Driven Approach to SEO and Faster Page Discovery

From Single Point of Failure to Warm Standby: A Practical Core Switch Disaster Recovery Playbook

Reducing LLM Token Costs by 15% by Switching from JSON to TOON Format

Introduction

Challenges in the Existing Search Engine

Strategies for Improved Search Performance

Shingle Analyzer

Using Shingles in Open Search

Example Mapping

Analyzer with Shingle Filter

Configuring the Stopword Filter

Defining the Shingle Filter

Benefits of Shingles

Phonetic Analyzer

Using Phonetics in Open Search

Example Mapping

Analyzer with Phonetic Filter

Defining the Phonetic Filter

Benefits of Phonetics

Replacing Wildcards with Painless Scripts

Optional Match Query

Adjusting Query Boosts for Scoring

Results and Impact

Conclusion

References

Join Us

About Halodoc

Tags

Satish Kumar Agarwal

Recommended for you

Semantic Search in Healthcare: Inside Halodoc’s RAG Retrieval Architecture

Securing the Cloud : How Cloud Security Platforms Handle Threats and Misconfigurations In Real Time

Structured Capacity Planning and Automation to Optimise Java Services