Build a Semantic Search Engine with Python & ChromaDB (Tutorial)

Building advanced AI and NLP applications often requires more than simple keyword matching. Developers and data scientists frequently encounter scenarios where users express their intent using different phrasing than the exact words in the documents. This limitation makes traditional search engines less effective for sophisticated projects. To overcome this, creating a semantic search engine with Python offers a powerful solution, allowing systems to understand the meaning and context behind queries, not just the keywords.

This tutorial will guide you through creating a semantic search engine using Python, leveraging the power of Sentence Transformers for generating meaningful text embeddings and ChromaDB for efficient vector storage and retrieval. As we will see, when we query a database with this approach, it provides similarity-based searches, understanding the underlying meaning of our request.

Juno School NLP Full Course Thumbnail - Learn to build a semantic search engine with Python and ChromaDB

Recommended Course on JunoNLP Full Course in Hindi

View Course →

Beyond Keyword Search: What is Semantic Search?

Imagine you search for "Jaguar speed." A traditional keyword-based search might return results about both Jaguar cars and the animal, as both contain the word "Jaguar." It lacks the contextual understanding to differentiate your intent. Now, if your subsequent query is "how many wheels does it have?", a semantic search engine would understand that you are referring to the car, because it comprehends the meaning and relationship between words and phrases, not just their literal presence.

Semantic search moves beyond simple lexical matching by using machine learning models to convert text into numerical representations called embeddings. These embeddings capture the semantic meaning of text, allowing the system to find documents that are conceptually similar to a query, even if they don't share exact keywords.

Our Project: A Search Engine for Famous Quotes

For this tutorial, we will embark on a practical project: building a semantic search engine for a collection of famous quotes. Our goal is to index a list of insightful quotes and then enable users to find the most relevant quote for a given query. As part of this, we will build a semantic search engine that, when queried, will provide similarity-based searches from its database. For instance, a user might query the search engine asking for "what makes a person strong?" or "what are the top quotes on life motivation?". Our engine should return quotes that semantically align with the concept of strength, resilience, or personal growth, even if the exact words aren't present in the best matching quote.

This project serves as an excellent example of a vector search python example, demonstrating how to handle unstructured text data and retrieve information based on conceptual similarity. It also lays the groundwork for more complex applications, such as recommendation systems or advanced chatbots.

The Tech Stack: Sentence Transformers + ChromaDB

To achieve our semantic search capabilities, we'll rely on two powerful tools:

Sentence Transformers: This Python library provides state-of-the-art pre-trained models that can convert sentences and paragraphs into dense vector embeddings. These embeddings are crucial because they mathematically represent the semantic meaning of the text. When we encode text, similar meanings result in vectors that are close to each other in a multi-dimensional space.
ChromaDB: An open-source embedding database designed for AI applications. ChromaDB allows us to store our generated embeddings along with their original text documents and perform lightning-fast similarity searches. It abstracts away the complexities of managing vector indexes, making it an ideal choice for a python chromadb tutorial.

As we begin, we will first import the necessary libraries, including Sentence Transformers and ChromaDB, to set up our environment for embedding generation and storage.

Step 1: Setup and Data Preparation

First, let's install the required libraries and prepare our dataset. We'll use a simple list of famous quotes for this example. This initial step ensures all dependencies are met for the tools we'll be importing and using, making our raw text data ready for processing.


# Install necessary libraries
!pip install sentence-transformers chromadb

# Our dataset: a list of famous quotes
quotes = [
    "The only way to do great work is to love what you do.",
    "Believe you can and you're halfway there.",
    "The future belongs to those who believe in the beauty of their dreams.",
    "Strive not to be a success, but rather to be of value.",
    "The mind is everything. What you think you become.",
    "That which does not kill us makes us stronger.",
    "The best way to predict the future is to create it.",
    "Life is 10% what happens to us and 90% how we react to it.",
    "The only impossible journey is the one you never begin.",
    "Success is not final, failure is not fatal: it is the courage to continue that counts."
]

Understanding how to prepare data is a foundational skill for anyone looking to build search engine with nlp capabilities.

Step 2: Generating Embeddings

Now, we'll load a pre-trained Sentence Transformer model and convert our list of quotes into numerical vector embeddings. We'll use a general-purpose model suitable for many languages and tasks, aligning with our earlier discussion of importing Sentence Transformers for this purpose.


from sentence_transformers import SentenceTransformer

# Load a pre-trained Sentence Transformer model
# 'all-MiniLM-L6-v2' is a good balance of speed and performance
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each quote
quote_embeddings = model.encode(quotes)

print(f"Number of quotes: {len(quotes)}")
print(f"Shape of embeddings: {quote_embeddings.shape}")

Each quote is now represented by a high-dimensional vector. These vectors are the core of our sentence transformers semantic search, allowing us to compare the meaning of different pieces of text.

Step 3: Storing Embeddings in ChromaDB

With our embeddings generated, the next step is to store them efficiently in ChromaDB. We'll initialize a ChromaDB client, create a collection (which is like a table in a traditional database), and then add our quotes and their corresponding embeddings. This utilizes the ChromaDB library we imported earlier.


import chromadb

# Initialize ChromaDB client (using a persistent client for local storage)
client = chromadb.PersistentClient(path="./chroma_db")

# Create or get a collection
# A collection is where your embeddings, documents, and metadata are stored
collection_name = "famous_quotes_collection"
collection = client.get_or_create_collection(name=collection_name)

# Prepare data for ChromaDB
# Each document needs a unique ID
ids = [f"quote_{i}" for i in range(len(quotes))]

# Add documents, embeddings, and optionally metadata to the collection
collection.add(
    documents=quotes,
    embeddings=quote_embeddings.tolist(), # ChromaDB expects a list of lists for embeddings
    ids=ids
)

print(f"Added {collection.count()} documents to the collection '{collection_name}'.")

ChromaDB handles the indexing and storage, making it easy to perform fast similarity searches later. This step is fundamental to any practical semantic search engine python implementation.

Step 4: Building the Search Function

Finally, let's create a Python function that encapsulates the search logic. This function will take a user query, encode it into an embedding using the same Sentence Transformer model, and then query our ChromaDB collection to find the top N most similar documents. This forms the core of the semantic search engine we set out to build.


def semantic_search(query: str, top_n: int = 3):
    """
    Performs a semantic search against the ChromaDB collection.

    Args:
        query (str): The user's search query.
        top_n (int): The number of top similar results to return.

    Returns:
        list: A list of dictionaries, each containing 'document' and 'distance'.
    """
    # Encode the user query into an embedding
    query_embedding = model.encode([query]).tolist()

    # Query the ChromaDB collection
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_n,
        include=['documents', 'distances']
    )

    # Process and return the results
    search_results = []
    if results and results['documents']:
        for i in range(len(results['documents'][0])):
            search_results.append({
                "document": results['documents'][0][i],
                "distance": results['distances'][0][i]
            })
    return search_results

print("Semantic search function ready.")

This function forms the core of our search engine, bridging the user's natural language input with the vector-based storage. For those interested in further exploring advanced NLP techniques, Juno School offers a full NLP Course that covers topics like this in more detail.

Putting It All Together: Let's Search!

Now that all components are in place, let's test our semantic search engine with a few example queries. We'll observe how it intelligently retrieves relevant quotes based on meaning, not just keywords.


# Example Query 1: What makes a person strong?
print("Query: What makes a person strong?")
results_strong = semantic_search("What makes a person strong?", top_n=3)
for res in results_strong:
    print(f"  - Document: \"{res['document']}\" (Distance: {res['distance']:.4f})")
# Expected: "That which does not kill us makes us stronger."

print("\n")

# Example Query 2: Top quotes on life motivation
print("Query: Top quotes on life motivation")
results_motivation = semantic_search("Top quotes on life motivation", top_n=3)
for res in results_motivation:
    print(f"  - Document: \"{res['document']}\" (Distance: {res['distance']:.4f})")

When searching for "what makes a person strong," our semantic engine provides highly relevant matches, such as the quote "That which does not kill us makes us stronger." This demonstrates its ability to understand the intent behind the query rather than just matching keywords. Similarly, a query asking for "top quotes on life motivation" would yield quotes that inspire and encourage, even if the words "life" or "motivation" aren't explicitly in every result.

This example showcases the power of a semantic search engine python implementation. It highlights how vector databases like ChromaDB, combined with advanced embedding models, can transform information retrieval, making it more intelligent and user-friendly. For professionals looking to integrate AI into their workflows, understanding how to apply ChatGPT prompts for research paper outlines or develop other sophisticated NLP applications can significantly enhance productivity and innovation.