Skip to main content
In a live conversation, retrieval - not the language model - is usually the dominant source of user-facing latency. Humans read a pause beyond roughly 300 ms as unnatural, and beyond ~500 ms as confusion or disengagement. Yet a single agent turn already spends its budget on speech recognition, inference, and synthesis; a slow retrieval call on top pushes the response past the point where it still feels live. During a call an agent needs two kinds of context at once:
  • Long-term context - durable knowledge and account facts the agent was built with (FAQs, policies, profile). You load this once at the start of the call.
  • Short-term context - the working set of what the customer just said, this call, this minute. You accumulate this in a session as the call unfolds.
Moss serves both from the same sub-10 ms local runtime, so neither one spends your turn’s latency budget on a network round trip.
Sessions are available in the Python, Swift, Elixir, and C SDKs today. JavaScript (Node) session support is coming - use Python or Swift for live-call context for now.

A single agent turn

How it works

1

Load long-term context from the cloud

Load your persistent knowledge index into memory so it’s ready for instant queries.
2

Open a session for the call

client.session(call_id) returns a local SessionIndex. If an index with that name already exists in the cloud, it auto-loads; otherwise it starts empty.
3

Index transcript turns as they arrive

Each add_docs call embeds and indexes locally in ~1-5 ms - fast enough to run inline during the conversation.
4

Query both indexes for the next agent turn

Pull relevant long-term knowledge and recent short-term context together to ground the response.
5

Persist the session at call end

session.push_index() saves the call’s context to the cloud so the next interaction can resume it.

Example

import asyncio
from datetime import datetime
from moss import DocumentInfo, MossClient, QueryOptions

async def main():
    client = MossClient(MOSS_PROJECT_ID, MOSS_PROJECT_KEY)

    # 1. Long-term context: load a persistent knowledge index from the cloud.
    await client.load_index("support-faqs")

    # 2. Short-term context: open a session for this call.
    call_id = f"call-{datetime.now():%Y%m%d-%H%M%S}"
    session = await client.session(index_name=call_id)

    # 3. Index transcript turns locally as the call unfolds (~1-5ms each).
    await session.add_docs([
        DocumentInfo(id="turn-1", text="Customer was billed twice for the same renewal."),
        DocumentInfo(id="turn-2", text="Customer requested a refund for the duplicate $49.99 charge."),
    ])

    # 4. Ground the next agent turn in BOTH long-term and short-term context.
    knowledge = await client.query("support-faqs", "duplicate charge refund policy", QueryOptions(top_k=3))
    recent = await session.query("refund request", QueryOptions(top_k=3))

    for doc in recent.docs:
        print(f"[session] {doc.id} score={doc.score:.3f} {doc.text}")
    for doc in knowledge.docs:
        print(f"[faqs]    {doc.id} score={doc.score:.3f} {doc.text}")

    # 5. Persist the call's context at the end.
    result = await session.push_index()
    print(f"Saved {result.doc_count} turns to cloud index {result.index_name!r}")

asyncio.run(main())

Two kinds of context

Short-term contextLong-term context
WhatWhat was just said this call - working insights, the live transcriptDurable knowledge and account facts: FAQs, policies, profile, history
WhereA local session, built turn by turnA persistent cloud index, loaded once
LifetimeThe current interaction (optionally persisted at the end)Across every interaction
Querying both on each turn lets the agent answer from durable knowledge while staying grounded in the live conversation, instead of treating every turn as isolated.

Result

Every agent turn is grounded in durable knowledge and the live conversation, with no network latency on the hot path. At call end, the session is persisted to the cloud - the next call can resume it (see Cross-agent context & omni-channel handoff).

Further reading

Sessions

The session lifecycle in depth.

Real-time local indexing

Why local sessions are sub-10 ms.