Advanced Techniques for Company Name Searches in Elasticsearch

Advanced Techniques for Company Name Searches in Elasticsearch

Building accurate company name search query and index

You know how they say Elasticsearch is powerful? Well, that's like saying the sun is "kinda warm." This search engine is so feature-packed, it makes Swiss Army knives look like plastic spoons. But here's the kicker – with great power comes great confusion. Trying to set up Elasticsearch for a specific use case is like trying to assemble IKEA furniture with instructions written in hieroglyphics. Sure, you might end up with something functional, but there's a good chance it'll look nothing like what you had in mind.

In our case, we were on a quest to perfect company name search. Sounds simple, right? Ha! If only. We spent A LOT of time tweaking our Elasticsearch config. The problem seemed so common, we thought surely someone must have cracked this nut already. But nope!

Well, fear not, dear readers! After countless cups of coffee, a few existential crises, and one incident involving a rubber duck and a whiteboard that we don't talk about anymore, we've finally done it. We've tamed the Elasticsearch beast for company name search, and we're here to spill the beans.

Problem statement

  • Accurate company search by name is hard and I have a few examples to share with you here:

    1. Search of Boomi Technologies PTE LTD should not return Wayfarer Technologies PTE LTD

    2. but search of Way**B**arer Technologies PTE LTD should find Wayfarer Technologies PTE LTD

    3. Crazy hat PTE LTD in search, should find CRAZY HAT

    4. GITA Nur YULIANI should not find NUR EKA Zuhrah

    5. IPRINT SG should find iPrint.Sg

    6. GITA WISMA LAILATUL M should not find WISMA alat Zuhrah

and there are many more complications, but you started getting the complexity of the problem

Solution

You can find the index and query in the end of this article, so if you just need the answers, scroll to the end. Otherwise, I will cover both topics in detail below.

Index settings

Analysis configuration

The heart of our index configuration lies in its custom analysis settings. These are crucial for tailoring how Elasticsearch processes and searches our company data.

Let’s start a review with building blocks, it’s important to understand what’s happening here, before we jump into the index itself and query.

Character Filters

"char_filter": {
  "remove_punctuation": {
    "type": "pattern_replace",
    "pattern": "[.,]",
    "replacement": " "
  }
}

We've defined a custom character filter called remove_punctuation. This filter replaces periods and commas with spaces. The purpose is to normalize text by removing these punctuation marks, which can interfere with tokenization and searching.

Because of this I will be able to split iPrint.id into 2 tokens iPrint and id .

Custom Filters

  1. Company Stop Filter
"company_stop": {
  "type": "stop",
  "stopwords": ["ltd", "pte", "pt", "bhd", "sdn", "pty", "limited", "enterprise", "and", "bin", "gmail", "pvt", "inc", "cv", "hotmail"]
}

This custom stop filter removes common company suffixes and irrelevant words, improving the quality of company name matches. I have analysed the frequency of each word in our database of 1m+ company names across South-East Asia, and picked the most common words that will negatively affect the scoring and will create noise.

If we are not careful Shoes PTE LTD will match to Hats PTE LTD because of 2 tokens PTE and LTD .

  1. Company Length Filter
"company_length": {
  "type": "length",
  "min": 3
}

This filter removes very short tokens (less than 3 characters) from company names, which are often not meaningful for search and create noise

Analysers

  1. Company Name Analyzer
"company_name_analyzer": {
  "type": "custom",
  "char_filter": ["remove_punctuation"],
  "tokenizer": "standard",
  "filter": ["lowercase", "asciifolding", "company_length", "company_stop"]
}

The core for the successful company name searches. This analyzer is specifically tailored for company names. It:

  • Removes punctuation with char_filter before tokenisation. It’s import to use char_filter here instead of a filter, because char_filter will be applied before tokenisation.

  • Uses standard tokenization (which is more sophisticated than whitespace tokenization)

  • Applies lowercase conversion, and ASCII folding

  • Removes common company suffixes and words (via company_stop)

  • Filters out very short tokens with length < 3 (via company_length)

This specialized analyzer helps in matching company names more accurately, accounting for common variations and irrelevant terms.

Let’s follow step by step what’s happening here on the example of iPrint.id PTE LTD:

  1. Removing punctuation and it becomes iPrint id PTE LTD

  2. Standard tokenisation into iPrint, id, PTE, LTD

  3. lowercase filtering into iprint, id, pte, ltd

  4. ASCII folding is not doing anything with this example.

  5. token length filtering into iprint , pte, ltd

  6. stop words filtering into iprint

  7. This is the final token that will be stored and compared with future searches.

  8. Whitespace Lowercase Analyzer

"whitespace_lowercase": {
  "type": "custom",
  "tokenizer": "whitespace",
  "filter": ["lowercase", "asciifolding", "remove_punctuation"]
}

This analyzer is designed for general text processing. It:

  • Tokenizes on whitespace

  • Converts tokens to lowercase

  • Applies ASCII folding (normalizing accented characters)

  • Removes punctuation

This combination allows for flexible matching of text fields, accommodating variations in spacing, case, and accents.

Mappings

Finally we are ready to review mapping itself. In our example it’s important to understand the configuration of 3 fields:

  1. companyName - uses the custom company_name_analyzer for flexible text matching, with an additional keyword sub-field for exact matching.

    1. We store companyName as both text for a full text tokenised search, and a keyword for exact matching.
  2. parentCompanyId - keyword field for exact matching of the tenant.

  3. searchCompanyKey - whitespace_lowercase analyzer for general search functionality, also with a keyword sub-field. We only need keyword field here for exact matches, because we are storing here unique company ids (tax numbers, codes etc).

Query

The query is a Boolean query with filter, should, and minimum_should_match clauses. It's designed to find companies that match specific criteria while providing flexible, yet precise, matching on the company name.

Query Components

Bool query

The top-level query is a bool query, which allows us to combine multiple query clauses.

{
  "query": {
    "bool": {
      "filter": [...],
      "should": [...],
      "minimum_should_match": 1
    }
  },
  "size": 10,
  "from": 0,
  "sort": [
    {
      "_score": "desc"
    }
  ]
}

Filter Clause

"filter": [
  {
    "term": {
      "parentCompanyId": "0demoFileID"
    }
  },
  {
    "term": {
      "isCustomer": true
    }
  }
]

The filter clause contains two term queries:

  1. It filters for companies with a specific parentCompanyId.

  2. It ensures that only companies marked as customers (isCustomer: true) are returned.

Hard requirement in my case. All the results should satisfy these values.

Should Clause

To prevent returning low priority partial matches when there is a high priority full match the should clause is split into two main parts with minimuim_should_match = 1 :

  1. Nested Bool query - High priority full matches

  2. Simple Query string - Low priority partial matches

  3. minimum_should_match in combination with must_not that negates the nested bool query ensures that only one should Clause group is being executed.

This nested bool query tries to match the company name in three different ways:

  1. Exact Match: Uses a term query on companyName.keyword with the highest boost (10). This favors exact matches.

           {
             "term": {
               "companyName.keyword": {
                 "value": "crazy hat pte ltd",
                 "boost": 10
               }
             }
           }
    
  2. Flexible Match on searchCompanyKey: Uses a match query on searchCompanyKey with a boost of 5. This allows to match the company by tax numbers.

           {
             "match": {
               "searchCompanyKey": {
                 "query": "crazy hat pte ltd",
                 "minimum_should_match": "1",
                 "boost": 5
               }
             }
           }
    
  3. Fuzzy Match on companyName: Uses a fuzzy match query on companyName with a boost of 3. This allows for slight misspellings or variations, but still requires full match. Operator and indicates that each query token should be matched in no particular order.

           {
             "match": {
               "companyName": {
                 "query": "crazy hat pte ltd",
                 "fuzziness": "AUTO",
                 "operator": "and",
                 "boost": 3
               }
             }
           }
    

The minimum_should_match: 1 ensures that at least one of these conditions must be met.

Simple Query String for low priority results

{
          "bool": {
            "should": [
              {
                "simple_query_string": {
                  "query": "*crazy hat pte ltd*",
                  "fields": [
                    "companyName",
                    "companyName._2gram",
                    "companyName._3gram"
                  ],
                  "default_operator": "AND",
                  "minimum_should_match": "100%",
                  "analyze_wildcard": true,
                  "boost": 0.5
                }
              }
            ],
            "must_not": [
              {
                "term": {
                  "companyName.keyword": {
                    "value": "crazy hat pte ltd",
                    "boost": 10
                  }
                }
              },
              {
                "match": {
                  "searchCompanyKey": {
                    "query": "crazy hat pte ltd",
                    "minimum_should_match": "1",
                    "boost": 5
                  }
                }
              },
              {
                "match": {
                  "companyName": {
                    "query": "crazy hat pte ltd",
                    "fuzziness": "AUTO",
                    "operator": "and",
                    "boost": 3
                  }
                }
              }
            ],
            "minimum_should_match": 1
          }
        }

This query uses simple_query_string for more advanced text matching:

  • It searches for the company name with wildcards at the start and end.

  • It searches in companyName and its n-gram fields, allowing for partial matches.

  • The AND operator and "minimum_should_match": "100%" ensure all terms must be present.

  • analyze_wildcard: true allows the wildcard query to be analyzed.

  • It has a lower boost (0.5), making it less influential than the other queries.

  • must_not negates the previous Nested query It is used to ensure that we don’t show any wildcard results if any of the exact matches are being met.

Size and From

Usual pagination, nothing special.

"size": 10,
"from": 0,

Sort

"sort": [
  {
    "_score": "desc"
  }
]

This sorts the results by their relevance score in descending order.

Final Query

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "parentCompanyId": "0demoFileID"
          }
        },
        {
          "term": {
            "isCustomer": true
          }
        }
      ],
      "should": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "companyName.keyword": {
                    "value": "crazy hat pte ltd",
                    "boost": 10
                  }
                }
              },
              {
                "match": {
                  "searchCompanyKey": {
                    "query": "crazy hat pte ltd",
                    "minimum_should_match": "1",
                    "boost": 5
                  }
                }
              },
              {
                "match": {
                  "companyName": {
                    "query": "crazy hat pte ltd",
                    "fuzziness": "AUTO",
                    "operator": "and",
                    "boost": 3
                  }
                }
              }
            ],
            "minimum_should_match": 1
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "term": {
                  "companyName.keyword": {
                    "value": "crazy hat pte ltd"
                  }
                }
              },
              {
                "match": {
                  "searchCompanyKey": {
                    "query": "crazy hat pte ltd",
                    "minimum_should_match": "1"
                  }
                }
              },
              {
                "match": {
                  "companyName": {
                    "query": "crazy hat pte ltd",
                    "fuzziness": "AUTO",
                    "operator": "and"
                  }
                }
              }
            ],
            "should": [
              {
                "simple_query_string": {
                  "query": "*crazy hat pte ltd*",
                  "fields": [
                    "companyName",
                    "companyName._2gram",
                    "companyName._3gram"
                  ],
                  "default_operator": "AND",
                  "minimum_should_match": "100%",
                  "analyze_wildcard": true,
                  "boost": 0.5
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ],
      "minimum_should_match": 1
    }
  },
  "size": 10,
  "from": 0,
  "sort": [
    {
      "_score": "desc"
    }
  ]
}

Final index

{
    "indexName": "companies",
    "settings": {
        "analysis": {
            "char_filter": {
                "remove_punctuation": {
                    "type": "pattern_replace",
                    "pattern": "[.,]",
                    "replacement": " "
                }
            },
            "analyzer": {
                "whitespace_lowercase": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "asciifolding", "remove_punctuation"]
                },
                "company_name_analyzer": {
                    "type": "custom",
                    "char_filter": ["remove_punctuation"],
                    "tokenizer": "standard",
                    "filter": ["lowercase", "asciifolding", "remove_punctuation", "company_length", "company_stop"]
                }
            },
            "normalizer": {
                "case_insensitive": {
                    "type": "custom",
                    "char_filter": [],
                    "filter": ["lowercase", "asciifolding"]
                }
            },
            "filter": {
                "remove_punctuation": {
                    "type": "pattern_replace",
                    "pattern": "[.,]",
                    "replacement": " "
                },
                "company_stop": {
                    "type": "stop",
                    "stopwords": ["ltd", "pte", "pt", "bhd", "sdn", "pty", "limited", "enterprise", "and", "bin", "gmail", "pvt", "inc", "cv", "hotmail"]
                },
                "company_length": {
                    "type": "length",
                    "min": 3
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "id": {
                "type": "keyword"
            },
            "companyName": {
                "type": "text",
                "analyzer": "company_name_analyzer",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "normalizer": "case_insensitive"
                    }
                }
            },
            "parentCompanyId": {
                "type": "keyword"
            },
            "searchCompanyKey": {
                "type": "text",
                "analyzer": "whitespace_lowercase",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "normalizer": "case_insensitive"
                    }
                }
            },
            // other fields below ..
        }
    },
    "severity": "INFO",
    "message": ""
}