Building a semantic search system with txtai and Qdrant

5 min readOct 11, 2022

Building a semantic search based application requires putting up several pieces together. Assuming you have some data, you still have to choose a model that converts it into embeddings. And there is a wide variety of pretrained models available, but building your own service around the selected model takes quite a lot of time. And this is where txtai may save you lots of time, as it reduces the boilerplate you would need to write if you wanted to implement everything on your own.

According to the official docs, txtai is, among others, a “large-scale similarity search with multiple index backends”. However, all the built-in txtai ANN backends are libraries, not production-grade tools, and search might simply become a bottleneck if you’ll ever want to scale things up. If you treat semantic search seriously, you’ll better start with a proper reliable database designed to perform the search efficiently, like Qdrant. You would never use an in-memory engine and build a service around it for relational data, instead of keeping it in a relational database, so why would you do that for vectors? A vector database is key if you want to make a system reliable and have the ability to scale it up in the future.

Fortunately, txtai 5.0 allows integrating a custom backend. We created a simple wrapper library so you can avoid doing that yourself. It is available as a separate package which you can install using pip.

pip install qdrant-txtai — Installing qdrant-txtai with pip

If you already use txtai, then the transition to Qdrant is seamless and requires setting the backend property of your configuration:

backend: qdrant_txtai.ann.qdrant.Qdrant — YAML configuration for using Qdrant backend in txtai

There is also a programmatically way of doing that. In this mode, your application configuration will look like this:

End-to-end semantic search system

Let’s take a more real-world example. There are a lot of Data Science libraries available on the market and it would be extremely tough to know all of them perfectly. Thus, selecting your stack might be opinionated if you always select the tools you know well. It would be great to have an alternative way, a system that will show you what else might be also well-suited for a specific use case.

Running the vector database

First of all, we need the Qdrant instance to be running. In the simplest case, you can run it locally with Docker, as follows:

docker run -p 6333:6333 -p:6334:6334 qdrant/qdrant:v0.10.2 — Running Qdrant with Docker

Writing the semantic search application with txtai

Right now we are finally able to set up the semantic search application with txtai. We’re going to do that programmatically with Python. Our application will be pretty straightforward and we’re not going to fine-tune any existing model, but just use one of the pre-trained ones directly, namely all-MiniLM-L6-v2. It is a configurable property in txtai. We’re also going to use Qdrant as an embedding backend and cosine distance function to calculate the distance.

Setting up all-MiniLM-L6-v2 and Qdrant in txtai

First of all, we need to feed the database with some vectors. It can be done by calling a .index method of the created Embeddings instance. It accepts triplets, with the first value being the identifier of a particular entry, the second the document and the third representing the tags associated with that entry.

In the background, all the texts are going to be vectorized with the selected model and put into Qdrant so we can query them later on.

Querying the system

With just a few lines of code, we’ve created a semantic search system. It has just a few entries but already allows us to look for a library that might be well-suited for our use case. Let’s assume we want to train a random forest model. What to use? The choice is way simpler with our search system.

Performing a search with embeddings and txtai

We’ve selected the top 3 entries, and it seems the winner is obviously the document with id = 2 as its score is over two times higher than the second result. And scikit-learn is obviously the best choice among all the options available in our database.

Summary

We’ve just created a simple search system. Thanks to txtai, Qdrant and pretrained language models it was simple and already powerful enough to support even some more complex cases. The cool thing about the semantic search is that none of the words used in a query has to be used in any document in our dataset, as the model is already capable of capturing synonyms. This is a huge advantage over conventional search algorithms like BM25.

If you want to start using both txtai and Qdrant, the library is available on Github: https://github.com/qdrant/qdrant-txtai.

Building a semantic search system with txtai and Qdrant

End-to-end semantic search system

Running the vector database

Writing the semantic search application with txtai

Querying the system

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Kacper Łukawski

No responses yet