Building a Domestic Business Categorizer

Published at 2026-01-16

While I am working on a personal budgeting project, I have encountered a problem: it is very challenging to categorize transaction automatically using transaction description. A transaction description often contains only a business name, and it is hard to tell the budget category just from business name using LLM. Many business names are not relevant to its economic activity such as Apple. Business names are used more often as a marketing tool nowadays. The problem becomes even more prominent when a transaction is done with a local business. For example, Kim Phat is an asian grocery shop only exists in Quebec, and its name has nothing to do with grocery. Furthermore, many business names in Quebec are in French which makes the auto categorization even more difficult.

After research, there are a few solutions. The easiest solution is to use a LLM with 70 billions parameters, and hope that the model knows the business from its training data. However, this solution is very expensive and unreliable. A LLM's knowledge is highly dependent on its training data. A new business created yesterday would not be included in its training data, so the LLM does not know the business. This solution is also expensive as it requires purchasing API tokens from either of the cloud providers such as OpenAI. The second option is to find an public API which can search a business by its name and then returns its category or business description. After some research, I could not find such public APIs. Although the Quebec government provides a website to search for a business, it is not in a format of API. However, it is possible to develop a web crawler to utilize the website, but it will impact the speed and potentially cause legal issues. However, there is indeed a way to utilize Google Map API to search for the business and filter the result using a LLM or simply regex. The third solution is to choose a zero short model and using SFT technics to teach teach the model to recognize all the domestic businesses. Although this solution also requires additional resources such powerful hardware and high quality dataset, it is very feasible. It is my plan B as I have limited experience with SFT thus a good final result is not guaranteed. The last solution is to use RAG hybrid approach. This requires the least amount of resource but should provide high quality result. I have found all the business registrations in Quebec from Donne Quebec. The dataset contains business name and description. The plan is to find the business description by a business name, and then I can feed it to a zero shot model to categorize the business thus the transaction. I will first perform a full text search based on business names to find the business description. If that does not work because of masked names or acronyms, I will then perform a semantic search. Once I find the business name, I find business description. Then, I will perform a zero shot classification which should put the business into a good personal budget category. The reason that I do not label each business activity when I create embeddings is that I do not want to loose flexibility. For example, if someone else want to use my tool, with zero shot he just could pass his personalized categories even in different languages.

For businesses not in the database, I will implement a fallback feature, that is using a MCP like Brave Search. The categorizer will try to categorize the business from the search result.

Update 2026-02-14 Search Strategy Change

After a lot of testing, I decided to abandon the approach to do semantic search on business names as it is not effective and not the best use case. Without a more detailed context or description, there is little semantic meaning in business names at all. Although, the semantic search will always return a result but it could be irrelevant to the searched business name. A FTS is more adequate in this case as I only need to find the business names that match the most keywords. However, I used semantic search in another place. I decided to develop a web search fallback in case the business name is not found in the database. In fact, it is uncommon to find a business name in my database as many businesses are international, such as amazon and apple.com. In this case, I will search the business using google's api and use semantic search to filter out irrelevant results. This approach provides me the highest quality results so far.