Advanced Techniques for Company Name Searches in Elasticsearch
Building accurate company name search query and index
You know how they say Elasticsearch is powerful? Well, that's like saying the sun is "kinda warm." This search engine is so feature-packed, it makes Swiss Army knives look like plastic spoons. But here's the kicker – with great power comes great confusion. Trying to set up Elasticsearch for a specific use case is like trying to assemble IKEA furniture with instructions written in hieroglyphics. Sure, you might end up with something functional, but there's a good chance it'll look nothing like what you had in mind.
In our case, we were on a quest to perfect company name search. Sounds simple, right? Ha! If only. We spent A LOT of time tweaking our Elasticsearch config. The problem seemed so common, we thought surely someone must have cracked this nut already. But nope!
Well, fear not, dear readers! After countless cups of coffee, a few existential crises, and one incident involving a rubber duck and a whiteboard that we don't talk about anymore, we've finally done it. We've tamed the Elasticsearch beast for company name search, and we're here to spill the beans.
Problem statement
Accurate company search by name is hard and I have a few examples to share with you here:
Search of
Boomi Technologies PTE LTD
should not returnWayfarer Technologies PTE LTD
but search of
Way**B**arer Technologies PTE LTD
should findWayfarer Technologies PTE LTD
Crazy hat PTE LTD
in search, should findCRAZY HAT
GITA Nur YULIANI
should not findNUR EKA Zuhrah
IPRINT SG
should findiPrint.Sg
GITA WISMA LAILATUL M
should not findWISMA alat Zuhrah
and there are many more complications, but you started getting the complexity of the problem
Solution
You can find the index and query in the end of this article, so if you just need the answers, scroll to the end. Otherwise, I will cover both topics in detail below.
Index settings
Analysis configuration
The heart of our index configuration lies in its custom analysis settings. These are crucial for tailoring how Elasticsearch processes and searches our company data.
Let’s start a review with building blocks, it’s important to understand what’s happening here, before we jump into the index itself and query.
Character Filters
"char_filter": {
"remove_punctuation": {
"type": "pattern_replace",
"pattern": "[.,]",
"replacement": " "
}
}
We've defined a custom character filter called remove_punctuation
. This filter replaces periods and commas with spaces. The purpose is to normalize text by removing these punctuation marks, which can interfere with tokenization and searching.
Because of this I will be able to split iPrint.id
into 2 tokens iPrint
and id
.
Custom Filters
- Company Stop Filter
"company_stop": {
"type": "stop",
"stopwords": ["ltd", "pte", "pt", "bhd", "sdn", "pty", "limited", "enterprise", "and", "bin", "gmail", "pvt", "inc", "cv", "hotmail"]
}
This custom stop filter removes common company suffixes and irrelevant words, improving the quality of company name matches. I have analysed the frequency of each word in our database of 1m+ company names across South-East Asia, and picked the most common words that will negatively affect the scoring and will create noise.
If we are not careful Shoes PTE LTD
will match to Hats PTE LTD
because of 2 tokens PTE
and LTD
.
- Company Length Filter
"company_length": {
"type": "length",
"min": 3
}
This filter removes very short tokens (less than 3 characters) from company names, which are often not meaningful for search and create noise
Analysers
- Company Name Analyzer
"company_name_analyzer": {
"type": "custom",
"char_filter": ["remove_punctuation"],
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "company_length", "company_stop"]
}
The core for the successful company name searches. This analyzer is specifically tailored for company names. It:
Removes punctuation with
char_filter
before tokenisation. It’s import to use char_filter here instead of a filter, because char_filter will be applied before tokenisation.Uses standard tokenization (which is more sophisticated than whitespace tokenization)
Applies lowercase conversion, and ASCII folding
Removes common company suffixes and words (via
company_stop
)Filters out very short tokens with length < 3 (via
company_length
)
This specialized analyzer helps in matching company names more accurately, accounting for common variations and irrelevant terms.
Let’s follow step by step what’s happening here on the example of iPrint.id PTE LTD
:
Removing punctuation and it becomes
iPrint id PTE LTD
Standard tokenisation into
iPrint
,id
,PTE
,LTD
lowercase filtering into
iprint
,id
,pte
,ltd
ASCII folding is not doing anything with this example.
token length filtering into
iprint
,pte
,ltd
stop words filtering into
iprint
This is the final token that will be stored and compared with future searches.
Whitespace Lowercase Analyzer
"whitespace_lowercase": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "asciifolding", "remove_punctuation"]
}
This analyzer is designed for general text processing. It:
Tokenizes on whitespace
Converts tokens to lowercase
Applies ASCII folding (normalizing accented characters)
Removes punctuation
This combination allows for flexible matching of text fields, accommodating variations in spacing, case, and accents.
Mappings
Finally we are ready to review mapping itself. In our example it’s important to understand the configuration of 3 fields:
companyName - uses the custom
company_name_analyzer
for flexible text matching, with an additional keyword sub-field for exact matching.- We store companyName as both text for a full text tokenised search, and a keyword for exact matching.
parentCompanyId - keyword field for exact matching of the tenant.
searchCompanyKey -
whitespace_lowercase
analyzer for general search functionality, also with a keyword sub-field. We only need keyword field here for exact matches, because we are storing here unique company ids (tax numbers, codes etc).
Query
The query is a Boolean query with filter, should, and minimum_should_match clauses. It's designed to find companies that match specific criteria while providing flexible, yet precise, matching on the company name.
Query Components
Bool query
The top-level query is a bool
query, which allows us to combine multiple query clauses.
{
"query": {
"bool": {
"filter": [...],
"should": [...],
"minimum_should_match": 1
}
},
"size": 10,
"from": 0,
"sort": [
{
"_score": "desc"
}
]
}
Filter Clause
"filter": [
{
"term": {
"parentCompanyId": "0demoFileID"
}
},
{
"term": {
"isCustomer": true
}
}
]
The filter clause contains two term queries:
It filters for companies with a specific
parentCompanyId
.It ensures that only companies marked as customers (
isCustomer: true
) are returned.
Hard requirement in my case. All the results should satisfy these values.
Should Clause
To prevent returning low priority partial matches when there is a high priority full match the should
clause is split into two main parts with minimuim_should_match
= 1 :
Nested Bool query - High priority full matches
Simple Query string - Low priority partial matches
minimum_should_match
in combination withmust_not
that negates the nested bool query ensures that only one should Clause group is being executed.
This nested bool query tries to match the company name in three different ways:
Exact Match: Uses a term query on
companyName.keyword
with the highest boost (10). This favors exact matches.{ "term": { "companyName.keyword": { "value": "crazy hat pte ltd", "boost": 10 } } }
Flexible Match on searchCompanyKey: Uses a match query on
searchCompanyKey
with a boost of 5. This allows to match the company by tax numbers.{ "match": { "searchCompanyKey": { "query": "crazy hat pte ltd", "minimum_should_match": "1", "boost": 5 } } }
Fuzzy Match on companyName: Uses a fuzzy match query on
companyName
with a boost of 3. This allows for slight misspellings or variations, but still requires full match. Operatorand
indicates that each query token should be matched in no particular order.{ "match": { "companyName": { "query": "crazy hat pte ltd", "fuzziness": "AUTO", "operator": "and", "boost": 3 } } }
The minimum_should_match: 1
ensures that at least one of these conditions must be met.
Simple Query String for low priority results
{
"bool": {
"should": [
{
"simple_query_string": {
"query": "*crazy hat pte ltd*",
"fields": [
"companyName",
"companyName._2gram",
"companyName._3gram"
],
"default_operator": "AND",
"minimum_should_match": "100%",
"analyze_wildcard": true,
"boost": 0.5
}
}
],
"must_not": [
{
"term": {
"companyName.keyword": {
"value": "crazy hat pte ltd",
"boost": 10
}
}
},
{
"match": {
"searchCompanyKey": {
"query": "crazy hat pte ltd",
"minimum_should_match": "1",
"boost": 5
}
}
},
{
"match": {
"companyName": {
"query": "crazy hat pte ltd",
"fuzziness": "AUTO",
"operator": "and",
"boost": 3
}
}
}
],
"minimum_should_match": 1
}
}
This query uses simple_query_string
for more advanced text matching:
It searches for the company name with wildcards at the start and end.
It searches in
companyName
and its n-gram fields, allowing for partial matches.The
AND
operator and"minimum_should_match": "100%"
ensure all terms must be present.analyze_wildcard: true
allows the wildcard query to be analyzed.It has a lower boost (0.5), making it less influential than the other queries.
must_not
negates the previousNested query
It is used to ensure that we don’t show any wildcard results if any of the exact matches are being met.
Size and From
Usual pagination, nothing special.
"size": 10,
"from": 0,
Sort
"sort": [
{
"_score": "desc"
}
]
This sorts the results by their relevance score in descending order.
Final Query
{
"query": {
"bool": {
"filter": [
{
"term": {
"parentCompanyId": "0demoFileID"
}
},
{
"term": {
"isCustomer": true
}
}
],
"should": [
{
"bool": {
"should": [
{
"term": {
"companyName.keyword": {
"value": "crazy hat pte ltd",
"boost": 10
}
}
},
{
"match": {
"searchCompanyKey": {
"query": "crazy hat pte ltd",
"minimum_should_match": "1",
"boost": 5
}
}
},
{
"match": {
"companyName": {
"query": "crazy hat pte ltd",
"fuzziness": "AUTO",
"operator": "and",
"boost": 3
}
}
}
],
"minimum_should_match": 1
}
},
{
"bool": {
"must_not": [
{
"term": {
"companyName.keyword": {
"value": "crazy hat pte ltd"
}
}
},
{
"match": {
"searchCompanyKey": {
"query": "crazy hat pte ltd",
"minimum_should_match": "1"
}
}
},
{
"match": {
"companyName": {
"query": "crazy hat pte ltd",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
],
"should": [
{
"simple_query_string": {
"query": "*crazy hat pte ltd*",
"fields": [
"companyName",
"companyName._2gram",
"companyName._3gram"
],
"default_operator": "AND",
"minimum_should_match": "100%",
"analyze_wildcard": true,
"boost": 0.5
}
}
],
"minimum_should_match": 1
}
}
],
"minimum_should_match": 1
}
},
"size": 10,
"from": 0,
"sort": [
{
"_score": "desc"
}
]
}
Final index
{
"indexName": "companies",
"settings": {
"analysis": {
"char_filter": {
"remove_punctuation": {
"type": "pattern_replace",
"pattern": "[.,]",
"replacement": " "
}
},
"analyzer": {
"whitespace_lowercase": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "asciifolding", "remove_punctuation"]
},
"company_name_analyzer": {
"type": "custom",
"char_filter": ["remove_punctuation"],
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "remove_punctuation", "company_length", "company_stop"]
}
},
"normalizer": {
"case_insensitive": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
},
"filter": {
"remove_punctuation": {
"type": "pattern_replace",
"pattern": "[.,]",
"replacement": " "
},
"company_stop": {
"type": "stop",
"stopwords": ["ltd", "pte", "pt", "bhd", "sdn", "pty", "limited", "enterprise", "and", "bin", "gmail", "pvt", "inc", "cv", "hotmail"]
},
"company_length": {
"type": "length",
"min": 3
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"companyName": {
"type": "text",
"analyzer": "company_name_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "case_insensitive"
}
}
},
"parentCompanyId": {
"type": "keyword"
},
"searchCompanyKey": {
"type": "text",
"analyzer": "whitespace_lowercase",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "case_insensitive"
}
}
},
// other fields below ..
}
},
"severity": "INFO",
"message": ""
}