Skip to main content
Most retrieval systems treat search as a service you query over the network: every lookup pays for a round trip, TLS, serialization, and a remote vector scan. For a search box that’s fine. For a real-time agent, that round trip is the latency you can feel. Moss reframes retrieval as a function you call, not a service you query. With a session, embedding and search run in-process on the device, so a lookup is a local function call rather than a network request. In practice that moves median retrieval from roughly 67 ms to about 5 ms, and tail latency (p99) from roughly 222 ms to about 13.5 ms.
Sessions are available in the Python, Swift, Elixir, and C SDKs today. JavaScript (Node) session support is coming.

How a session works

A SessionIndex is a local index you open with client.session(name). Every operation runs without a cloud round trip:
  • add_docs - embeds and indexes locally (~1-5 ms per batch).
  • query - semantic search entirely in-memory (~1-10 ms).
  • push_index - optionally persists the index to the cloud when you’re done.
If a cloud index with the session’s name already exists, session() auto-loads it (no re-embedding); otherwise the session starts empty. The workflow is identical either way.

Example

import asyncio
from moss import DocumentInfo, MossClient, QueryOptions

async def main():
    client = MossClient(MOSS_PROJECT_ID, MOSS_PROJECT_KEY)

    # Open a local session (starts empty, or auto-loads an existing cloud index by name).
    session = await client.session(index_name="notes")

    # Index locally - no network.
    added, updated = await session.add_docs([
        DocumentInfo(id="1", text="Ship the on-device SDK by Friday."),
        DocumentInfo(id="2", text="Follow up with the LiveKit team about latency."),
    ])
    print(f"{added} added, {updated} updated, {session.doc_count} total")

    # Query locally - sub-10ms, no network.
    results = await session.query("what's due this week", QueryOptions(top_k=3))
    for doc in results.docs:
        print(f"{doc.id} score={doc.score:.3f} {doc.text}")

asyncio.run(main())

Why local is faster

Moving retrieval in-process removes the parts of a query that never show up in a vector database’s own benchmark but dominate real-world latency:
  • Network round trips - even same-region calls cost ~10-50 ms; cross-region or edge deployments cost far more. In-process, that drops to microseconds.
  • Serialization and connection overhead - no request encoding, TLS, or connection-pool management on the hot path.
  • Embedding - the query is embedded by a local quantized model in ~3 ms instead of a separate ~22 ms network embedding step.
Think of a local index like a CDN for your data: move the data to where it’s consumed rather than sending every query to where the data lives. The workload that suits this is small, fast, repeated, and latency-critical - utterances of a few dozen tokens against a knowledge base of roughly 10K-50K chunks per agent.

When to use a session vs. a loaded cloud index

Use a session when…Use a loaded cloud index when…
You’re indexing data that’s being created right now (a live call, a chat, a working set)You’re querying a large, stable corpus built ahead of time
You need writes and reads in the same millisecond budgetReads dominate and the data changes infrequently
Context is per-conversation or per-user and short-livedKnowledge is shared across many sessions
You can combine both - see Live-call context.

Further reading

Sessions guide

The full session lifecycle.

Live-call context

Short-term + long-term context together.