How to implement a visual search in no time
I’ve been using a Google Lens app for a while now and have to admit it helped me more times than I expected. Whenever I had to remind myself how often I should water my plants but didn’t remember their names, or when I wanted to check the price of an item, I would find my smartphone, take a photo, and have the answer in a few seconds. This is really a great general-use tool but struggles if we have some additional criteria to apply or are interested in a specific domain.
The topic of the quality decline of Google search results has been broadly discussed. There is a great Twitter thread describing some domains where the results are typically irrelevant.
I’m not sure if SEO for images is a topic yet, but Google used to produce better results for textual queries before everybody started “optimizing” their content. Even if the algorithm behind the image search is not yet affected, it has another drawback. What if I don’t want to search in general, but want to find an interesting product in the stock of a particular shop? And additionally, I know the item should have a similar shape, but some other attributes should be different than the ones present in the image? I may want to buy a particular car model, but not necessarily the colour seen in a photograph.
Leveraging image search capabilities with the traditional faceted search seems to be a great idea in many domains. It turns out to be not that hard to create your own powerful search engine.
Custom visual search engine
H&M Personalized Fashion Recommendations competition on Kaggle has come to an end. It was a challenge about predicting the items which H&M’s customers were going to buy. The participants were given lots of tabular, textual and image data and were free to choose the approach. For us, the most important will be the photos of real products, along with the metadata like colour, product type, etc.
Those images, together with product attributes, make a powerful dataset to create a visual search system.
Process design
Embeddings / vectorization
Given an image, we need to encode it to a vector of fixed size. This representation, typically called embedding, is usually generated by some deep neural networks. In the case of H&M images, I decided to use a pre-trained ResNet18, without the last layer that performs classification into 1000 classes. The layer before has 512 neurons and directly gives us an embedding in a 512-dimensional space.
As the original ResNet network had been trained on 3-channels of 244x244 pixel images, our database required some conversion but that’s done automatically by img2vec, a library reducing a boilerplate required to create image embeddings. There are just a few lines required in order to have the vectorization done:
This process has been done for all the images, so at the end of it, we had over 100k embeddings and some other attributes of each article.
It’s worth noting, that I didn’t finetune the ResNet18 model and just used the original results. So literally didn’t spend any time on neural network training.
Vector lookup
We could now compare new examples with the ones that we have in our database and choose the most similar ones, using a selected distance function. The naive approach would be to perform a k-NN algorithm, however, that would require performing a comparison with all the other vectors at each search, and won’t be efficient if we had hundreds of thousands or even millions of entries. It’s somehow similar to a full table scan in SQL databases. As long as the number of rows is low, we don’t need to worry. However, if we want to keep the performance on a reasonable level, then we need to approach both problems differently. For SQL databases we typically create indices, so we can access some specific subsets of the tables. For vectors, the only reasonable answer is Approximate Neural Search.
Nowadays, you can perform ANN without going much into detail, as there are some vector libraries and databases available. Since I wanted to apply some faceted filters on top of the results, Qdrant was a great choice here. It utilizes HNSW with some tweaks to allow those additional filtering capabilities. They are especially useful if we look for a piece of clothes with a cut we like, but with a different colour or fabric.
Qdrant exposes REST and gRPC interfaces, so is fairly easy to integrate, and if you’re using Python, then there is already a qdrant-client library that makes things even easier.
Live demo
Since CLI tools are rarely used by people looking for new clothes, we enclosed everything with Streamlit, and ended up with a simple UI. We’ve published a demo allowing you to search for visually similar products based on provided image or the one from a predefined list:
Looking for a jacket, but prefer a metallic one? That’s now pretty simple.
If you are interested in some more technical details, the source code is open and available on Github: https://github.com/qdrant/demo-hnm.