How to build semantic search over Hubspot in less than an hour

How to build semantic search over Hubspot in less than an hour

Dave Chaplin

Jul 10, 2025

Picture this: You open HubSpot to find a quick answer: What’s the latest update on the Client A deal?

Instead of a direct answer, you get lots of tabs, including a contact record, the company page, an email thread from February, and a custom object labeled “Q4 Playbook – Client A” created by someone who has since left the company.

Each one holds a piece of the question, but none of them tells you the full story.

So, you try to get more specific and type “client A negotiation terms” into the search bar. But it lists out every mention of “terms” across your CRM. Again, not helpful.

As a result, you start scrolling. And clicking. And copy-pasting. Ten minutes later, you still don’t have an answer.

It’s not a data problem. It’s a retrieval problem. HubSpot knows everything, but it can’t tell you anything (not without searching for an hour at least). That’s why we built a semantic search for HubSpot. So you can finally talk to your sales data and ask it questions like you would a teammate and it answers questions like a teammate would. 

Here’s your step-by-step guide to get from CRM chaos to clarity in under an hour, without custom scripts, ETL pipelines, or vector DB guesswork.

Stack overview

Component

Purpose

Python

Core scripting and orchestration

Hubspot SDK

Access and interact with Hubspot data

Ducky API

Semantic search, embeddings, retrieval

API Framework (optional)

Build integration endpoints (e.g., FastAPI/Flask)

Slack API

Slack API (frontend integration) | slack-bolt and slack-sdk

Step-by-step walkthrough of building semantic search over Hubspot 

Here is a step-by-step walkthrough of building semantic search over Hubspot. You can apply this to any CRM you have, but for this walkthrough, we have chosen Hubspot as an example.  

1. Set up your environment

Start by preparing your local machine by installing dependencies and configuring API credentials for HubSpot and Ducky.

  • Install Python and required libraries:

pip install hubspot-api-client ducky-client
  • Obtain API credentials for:

    • HubSpot

    • Ducky

2. Connect to HubSpot

Once your environment is ready, use the HubSpot SDK to authenticate and fetch CRM data (contacts, deals, notes, etc.).

from hubspot import HubSpot
hs = HubSpot(api_key="YOUR_KEY")

Example: Fetching and exporting HubSpot data

You can also use the included fetch_hubspot.py script to handle pagination, associations, and CSV output automatically:

python src/fetch_hubspot.py
# Generates data/hubspot_multi_with_activities.csv with deals, contacts, companies, and activities

Understanding the CSV columns

This example outputs a CSV with fields like Deal ID, Deal Name, Amount, Deal Stage, Pipeline, Close Date, Contact ID, Activity Type, Activity Date, Activity Subject, and Activity Body. You can modify the csv_headers list in fetch_hubspot.py to fetch any HubSpot properties you need.

Snippet from fetch_hubspot.py showing header definition and row construction:

csv_headers = [
    'Deal ID', 'Deal Name', 'Amount', 'Deal Stage', 'Pipeline', 'Close Date',
    # ... other contact/company fields ...
    'Activity ID', 'Activity Type', 'Activity Date', 'Activity Subject', 'Activity Body'
]
# When writing rows:
row = {
    'Deal ID': deal.id,
    'Deal Name': get_property_value(deal, 'dealname'),
    # ...
    'Activity Date': format_date(get_property_value(activity_details, 'hs_timestamp'), '%m/%d/%Y %H:%M'),
    'Activity Subject': get_property_value(activity_details, 'hs_call_title') or get_property_value(activity_details, 'hs_meeting_title'),
    'Activity Body': get_property_value(activity_details, f"hs_{activity_type}_body"),
}

3. Extract and preprocess data

The next step is to select and clean relevant CRM fields… or don't. Ducky will do the chunking for you.

  • Select relevant fields:

    • Notes

    • Emails

    • Deal descriptions

  • Optional cleaning of data before sending to Ducky.

Quick tip: The fetch_hubspot.py script already formats and writes a comprehensive CSV file (hubspot_multi_with_activities.csv) with all necessary properties, so you can jump straight into preprocessing.

4. Send data to Ducky for semantic indexing

Use the Ducky API to upload your CRM data. Ducky will handle:

  • Embedding generation

  • Vector storage

  • Semantic search infrastructure, including chunking & smart re-ranking

Example: Indexing HubSpot activities

Here’s a Python snippet that reads your HubSpot activity CSV and indexes each record:

from duckyai import DuckyAI
import os
from dotenv import load_dotenv
import csv
# Load environment variables
load_dotenv()
client = DuckyAI(api_key=os.getenv("DUCKY_API_KEY"))
index_name = os.getenv("DUCKY_INDEX_NAME")
with open("data/hubspot_multi_with_activities.csv", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        content = f"Deal: {row['Deal Name']} (ID: {row['Deal ID']})\n..."
        metadata = {"deal_id": row['Deal ID'], "activity_id": row['Activity ID'], /* ...other fields... */}
        client.documents.index(
            index_name=index_name,
            content=content,
            metadata=metadata,
        )

5. Query Ducky for semantic search

Use the Ducky API to send search queries and receive relevant CRM records. Integrate results into your workflow, dashboard, or assistant as needed.

6. Integration with Slack as UI

Use Slack Bolt (Socket Mode) to listen for mentions and forward messages to your FastAPI endpoint:

import os, threading, requests
from slack_bolt import App as SlackApp
from slack_bolt.adapter.socket_mode import SocketModeHandler
# Load environment variables
from dotenv import load_dotenv
load_dotenv()
# Initialize Bolt app
bolt = SlackApp(token=os.getenv("SLACK_BOT_TOKEN"))
app_token = os.getenv("SLACK_APP_TOKEN")
@bolt.event("app_mention")
def handle_mention(body, say):
    user_text = body["event"]["text"].split(maxsplit=1)[1]
    # Call local FastAPI /chat endpoint
    resp = requests.post(
        "http://localhost:8000/chat",
        json={"message": user_text}, timeout=15
    ).json()
    reply = resp.get("response", "Sorry, I don't know how to respond yet.")
    say(text=reply, thread_ts=body["event"]["ts"])
# Start Bolt in background
def _start_bolt():
    SocketModeHandler(bolt, app_token).start()
threading.Thread(target=_start_bolt, daemon=True).start()

This runs the Slack listener alongside your FastAPI server, enabling a fully interactive Slack UI. Feel free to customize event handlers or add interactive components as needed.

7. Test and iterate

Finally, test your setup with real queries. Use the feedback to fine-tune your data extraction and indexing to improve results over time.

How Ducky takes you from Hubspot chaos to search clarity in under an hour

Adding semantic search to your CRM doesn't have to be complicated. Ducky is built to remove the usual roadblocks like data cleanup, infrastructure overhead, ML complexity, making advanced search and retrieval simple..

Here's how it all comes together in under an hour.

Doesn't require data cleaning

No need to clean data, write custom chunkers, or fiddle with vector DBs. Ducky is designed to work with the real-world messiness of CRM systems. Instead of forcing you to normalize and clean everything before it’s usable, Ducky ingests your raw data as-is. Just use the API or SDK to upload your data, and Ducky will chunk, embed, and index it automatically.

Handles the messiness of your CRM 

CRM records usually include notes scribbled during calls, half-filled custom objects, and fragmented timelines. Ducky handles it all. It automatically chunks and indexes any format within a vector database for you, so your search actually works across the messy middle.

Ship in under an hour

With Ducky, there’s no waiting on infrastructure setup, provisioning databases, or spending days writing glue code. You can pull your HubSpot data, send it to our API, and start querying almost immediately. This means faster iteration, faster user feedback, and faster value delivery.

No ML expertise required

You don’t need to understand the nuances of machine learning to deliver powerful search. Ducky abstracts complex retrieval logic like reranking, embedding selection, and vector search tuning. You get high-quality semantic results without touching a single model. Just plug in your data and let Ducky handle the rest

Works for any CRM

Finally, Ducky allows you to build a semantic search for your specific CRM, whether it’s HubSpot, Salesforce, or something custom.

While this walkthrough focused on HubSpot, the same approach applies to other CRMs too. Ducky’s infrastructure is flexible enough to handle a wide variety of CRM tools and internal data sources without changes to your underlying stack. 

Need help with your specific CRM? Reach out to our experts and set up semantic search in less than an hour.

Or, get your Ducky API key and hook up your Hubspot to semantic search.

Other RAG example you can build today

Here are some other practical ways teams are already using Ducky today. 

Picture this: You open HubSpot to find a quick answer: What’s the latest update on the Client A deal?

Instead of a direct answer, you get lots of tabs, including a contact record, the company page, an email thread from February, and a custom object labeled “Q4 Playbook – Client A” created by someone who has since left the company.

Each one holds a piece of the question, but none of them tells you the full story.

So, you try to get more specific and type “client A negotiation terms” into the search bar. But it lists out every mention of “terms” across your CRM. Again, not helpful.

As a result, you start scrolling. And clicking. And copy-pasting. Ten minutes later, you still don’t have an answer.

It’s not a data problem. It’s a retrieval problem. HubSpot knows everything, but it can’t tell you anything (not without searching for an hour at least). That’s why we built a semantic search for HubSpot. So you can finally talk to your sales data and ask it questions like you would a teammate and it answers questions like a teammate would. 

Here’s your step-by-step guide to get from CRM chaos to clarity in under an hour, without custom scripts, ETL pipelines, or vector DB guesswork.

Stack overview

Component

Purpose

Python

Core scripting and orchestration

Hubspot SDK

Access and interact with Hubspot data

Ducky API

Semantic search, embeddings, retrieval

API Framework (optional)

Build integration endpoints (e.g., FastAPI/Flask)

Slack API

Slack API (frontend integration) | slack-bolt and slack-sdk

Step-by-step walkthrough of building semantic search over Hubspot 

Here is a step-by-step walkthrough of building semantic search over Hubspot. You can apply this to any CRM you have, but for this walkthrough, we have chosen Hubspot as an example.  

1. Set up your environment

Start by preparing your local machine by installing dependencies and configuring API credentials for HubSpot and Ducky.

  • Install Python and required libraries:

pip install hubspot-api-client ducky-client
  • Obtain API credentials for:

    • HubSpot

    • Ducky

2. Connect to HubSpot

Once your environment is ready, use the HubSpot SDK to authenticate and fetch CRM data (contacts, deals, notes, etc.).

from hubspot import HubSpot
hs = HubSpot(api_key="YOUR_KEY")

Example: Fetching and exporting HubSpot data

You can also use the included fetch_hubspot.py script to handle pagination, associations, and CSV output automatically:

python src/fetch_hubspot.py
# Generates data/hubspot_multi_with_activities.csv with deals, contacts, companies, and activities

Understanding the CSV columns

This example outputs a CSV with fields like Deal ID, Deal Name, Amount, Deal Stage, Pipeline, Close Date, Contact ID, Activity Type, Activity Date, Activity Subject, and Activity Body. You can modify the csv_headers list in fetch_hubspot.py to fetch any HubSpot properties you need.

Snippet from fetch_hubspot.py showing header definition and row construction:

csv_headers = [
    'Deal ID', 'Deal Name', 'Amount', 'Deal Stage', 'Pipeline', 'Close Date',
    # ... other contact/company fields ...
    'Activity ID', 'Activity Type', 'Activity Date', 'Activity Subject', 'Activity Body'
]
# When writing rows:
row = {
    'Deal ID': deal.id,
    'Deal Name': get_property_value(deal, 'dealname'),
    # ...
    'Activity Date': format_date(get_property_value(activity_details, 'hs_timestamp'), '%m/%d/%Y %H:%M'),
    'Activity Subject': get_property_value(activity_details, 'hs_call_title') or get_property_value(activity_details, 'hs_meeting_title'),
    'Activity Body': get_property_value(activity_details, f"hs_{activity_type}_body"),
}

3. Extract and preprocess data

The next step is to select and clean relevant CRM fields… or don't. Ducky will do the chunking for you.

  • Select relevant fields:

    • Notes

    • Emails

    • Deal descriptions

  • Optional cleaning of data before sending to Ducky.

Quick tip: The fetch_hubspot.py script already formats and writes a comprehensive CSV file (hubspot_multi_with_activities.csv) with all necessary properties, so you can jump straight into preprocessing.

4. Send data to Ducky for semantic indexing

Use the Ducky API to upload your CRM data. Ducky will handle:

  • Embedding generation

  • Vector storage

  • Semantic search infrastructure, including chunking & smart re-ranking

Example: Indexing HubSpot activities

Here’s a Python snippet that reads your HubSpot activity CSV and indexes each record:

from duckyai import DuckyAI
import os
from dotenv import load_dotenv
import csv
# Load environment variables
load_dotenv()
client = DuckyAI(api_key=os.getenv("DUCKY_API_KEY"))
index_name = os.getenv("DUCKY_INDEX_NAME")
with open("data/hubspot_multi_with_activities.csv", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        content = f"Deal: {row['Deal Name']} (ID: {row['Deal ID']})\n..."
        metadata = {"deal_id": row['Deal ID'], "activity_id": row['Activity ID'], /* ...other fields... */}
        client.documents.index(
            index_name=index_name,
            content=content,
            metadata=metadata,
        )

5. Query Ducky for semantic search

Use the Ducky API to send search queries and receive relevant CRM records. Integrate results into your workflow, dashboard, or assistant as needed.

6. Integration with Slack as UI

Use Slack Bolt (Socket Mode) to listen for mentions and forward messages to your FastAPI endpoint:

import os, threading, requests
from slack_bolt import App as SlackApp
from slack_bolt.adapter.socket_mode import SocketModeHandler
# Load environment variables
from dotenv import load_dotenv
load_dotenv()
# Initialize Bolt app
bolt = SlackApp(token=os.getenv("SLACK_BOT_TOKEN"))
app_token = os.getenv("SLACK_APP_TOKEN")
@bolt.event("app_mention")
def handle_mention(body, say):
    user_text = body["event"]["text"].split(maxsplit=1)[1]
    # Call local FastAPI /chat endpoint
    resp = requests.post(
        "http://localhost:8000/chat",
        json={"message": user_text}, timeout=15
    ).json()
    reply = resp.get("response", "Sorry, I don't know how to respond yet.")
    say(text=reply, thread_ts=body["event"]["ts"])
# Start Bolt in background
def _start_bolt():
    SocketModeHandler(bolt, app_token).start()
threading.Thread(target=_start_bolt, daemon=True).start()

This runs the Slack listener alongside your FastAPI server, enabling a fully interactive Slack UI. Feel free to customize event handlers or add interactive components as needed.

7. Test and iterate

Finally, test your setup with real queries. Use the feedback to fine-tune your data extraction and indexing to improve results over time.

How Ducky takes you from Hubspot chaos to search clarity in under an hour

Adding semantic search to your CRM doesn't have to be complicated. Ducky is built to remove the usual roadblocks like data cleanup, infrastructure overhead, ML complexity, making advanced search and retrieval simple..

Here's how it all comes together in under an hour.

Doesn't require data cleaning

No need to clean data, write custom chunkers, or fiddle with vector DBs. Ducky is designed to work with the real-world messiness of CRM systems. Instead of forcing you to normalize and clean everything before it’s usable, Ducky ingests your raw data as-is. Just use the API or SDK to upload your data, and Ducky will chunk, embed, and index it automatically.

Handles the messiness of your CRM 

CRM records usually include notes scribbled during calls, half-filled custom objects, and fragmented timelines. Ducky handles it all. It automatically chunks and indexes any format within a vector database for you, so your search actually works across the messy middle.

Ship in under an hour

With Ducky, there’s no waiting on infrastructure setup, provisioning databases, or spending days writing glue code. You can pull your HubSpot data, send it to our API, and start querying almost immediately. This means faster iteration, faster user feedback, and faster value delivery.

No ML expertise required

You don’t need to understand the nuances of machine learning to deliver powerful search. Ducky abstracts complex retrieval logic like reranking, embedding selection, and vector search tuning. You get high-quality semantic results without touching a single model. Just plug in your data and let Ducky handle the rest

Works for any CRM

Finally, Ducky allows you to build a semantic search for your specific CRM, whether it’s HubSpot, Salesforce, or something custom.

While this walkthrough focused on HubSpot, the same approach applies to other CRMs too. Ducky’s infrastructure is flexible enough to handle a wide variety of CRM tools and internal data sources without changes to your underlying stack. 

Need help with your specific CRM? Reach out to our experts and set up semantic search in less than an hour.

Or, get your Ducky API key and hook up your Hubspot to semantic search.

Other RAG example you can build today

Here are some other practical ways teams are already using Ducky today. 

No credit card required - we have a generous free tier to support builders