Skip to main content

Why local

Embedding on-device keeps data on the machine and takes the network off the hot path. It is also faster: a local quantized model embeds a short query in around 3 ms, versus roughly 22 ms for a separate network embedding call. For real-time work, that difference - paid on every query - is the gap between a response that feels instant and one that lags.

How it works

Setup

Pick the on-device model at index creation: moss-minilm (fast, lightweight) or moss-mediumlm (higher accuracy). Moss embeds your documents on-device with the model you choose. If you’d rather supply precomputed vectors from your own pipeline, see Custom embeddings.
import { MossClient } from '@moss-dev/moss'
const client = new MossClient(process.env.MOSS_PROJECT_ID!, process.env.MOSS_PROJECT_KEY!)
await client.createIndex('local-embeddings', docs, { modelId: 'moss-minilm' })

Tips

  • Batch inputs for speed
  • Cache vectors for unchanged content
  • Use hybrid retrieval for best relevance

Custom embeddings

Bring your own precomputed vectors.

Sub-10ms knowledge retrieval

The local retrieval pipeline.