Moss - Real-time Semantic Search for Conversational AI

Wire Moss into a LangGraph graph as a dedicated retrieve node. The graph passes the user query through retrieval, writes the Moss results into shared state, then feeds that context to a generate node — keeping all LLM responses grounded in your knowledge base.

Full example — see the LangGraph cookbook for the complete runnable demo with interactive mode, metadata filter support, and tests.

Why use Moss with LangGraph?

LangGraph’s state-machine model is a natural fit for retrieval-augmented workflows: each node reads from and writes to a shared typed state, so retrieval latency and results are transparent at every step. Moss plugs in as a single async node and keeps query latency in the 1–10ms range when the index is loaded locally — fast enough that retrieval never becomes the bottleneck in a multi-node graph.

Required tools

Moss account with project credentials
Groq API key (or swap in any LangChain-compatible LLM)
Python 3.11+

Integration guide

Installation

uv add langgraph langchain-groq moss python-dotenv

Environment setup

Create a .env file in your project root:

.env

MOSS_PROJECT_ID=your-project-id
MOSS_PROJECT_KEY=your-project-key
MOSS_INDEX_NAME=your-index-name
GROQ_API_KEY=your-groq-api-key
GROQ_MODEL=llama-3.3-70b-versatile

Define the graph state

LangGraph nodes communicate through a shared TypedDict. Moss results slot in naturally alongside the query and answer fields.

from typing import Any, NotRequired, TypedDict
from moss import SearchResult

class MossGraphState(TypedDict):
    query: str
    metadata_filter: NotRequired[dict[str, Any] | None]
    top_k: NotRequired[int]
    retrieval_results: NotRequired[SearchResult]
    retrieval_context: NotRequired[str]
    answer: NotRequired[str]

Build the retrieve → generate graph

The retrieve node queries Moss and writes results to state. The generate node reads that context and produces the final answer.

from langgraph.graph import END, START, StateGraph
from moss import MossClient, QueryOptions

def build_moss_graph(client: MossClient, index_name: str, llm):
    async def retrieve(state: MossGraphState) -> dict:
        result = await client.query(
            index_name,
            state["query"],
            QueryOptions(
                top_k=state.get("top_k", 4),
                filter=state.get("metadata_filter"),
            ),
        )
        context = "\n\n".join(
            f"[{i+1}] score={doc.score:.3f}\n{doc.text}"
            for i, doc in enumerate(result.docs)
        )
        return {
            "retrieval_results": result,
            "retrieval_context": context,
        }

    async def generate(state: MossGraphState) -> dict:
        response = await llm.ainvoke([
            (
                "system",
                "Answer only from the Moss context below. "
                "If the context is insufficient, say so clearly.",
            ),
            (
                "human",
                f"Question:\n{state['query']}\n\n"
                f"Context:\n{state.get('retrieval_context', 'None')}",
            ),
        ])
        return {"answer": response.content}

    graph = StateGraph(MossGraphState)
    graph.add_node("retrieve", retrieve)
    graph.add_node("generate", generate)
    graph.add_edge(START, "retrieve")
    graph.add_edge("retrieve", "generate")
    graph.add_edge("generate", END)
    return graph.compile()

Load the index and run

Call load_index() before the graph runs. This pulls the index into local memory and keeps retrieval on the ~1–10ms in-memory path instead of the cloud fallback (~100–500ms). Metadata filters also require a locally loaded index to work correctly.

import asyncio
from langchain_groq import ChatGroq
from moss import MossClient

async def main():
    client = MossClient("your-project-id", "your-project-key")

    # Load once before the graph starts
    await client.load_index("your-index-name")

    llm = ChatGroq(
        model="llama-3.3-70b-versatile",
        api_key="your-groq-api-key",
        temperature=0,
    )
    graph = build_moss_graph(client, "your-index-name", llm)

    result = await graph.ainvoke({"query": "What is the refund policy?"})
    print(result["answer"])

asyncio.run(main())

You can pass an optional metadata_filter through graph state to scope retrieval to a specific category:

result = await graph.ainvoke({
    "query": "What is the refund policy?",
    "metadata_filter": {"field": "category", "condition": {"$eq": "returns"}},
})

How it works

User question
      │
      ▼
 retrieve node  ──▶  client.query()  ──▶  Moss index (local, ~1–10ms)
      │
      │  writes retrieval_results + retrieval_context to state
      ▼
 generate node  ──▶  LLM (Groq)  ──▶  grounded answer
      │
      ▼
   Answer

The index is loaded into local memory once at startup. Every subsequent query() call inside the graph hits the in-memory path, so even graphs with many retrieval steps stay fast.

​Why use Moss with LangGraph?

​Required tools

​Integration guide

​How it works

Why use Moss with LangGraph?

Required tools

Integration guide

How it works