Building DocuMind: a production RAG case study

The first version of DocuMind worked. It also embarrassed me.

A user could upload a PDF, ask a question, and get an answer back. On paper that's a RAG app. In practice it took twelve seconds to respond, hallucinated half its citations, and re-embedded the same document if you re-uploaded it. The code path for "answer a question" touched eight files and three services. I had built the demo, not the product.

This post is about what changed between v1 and the version that's live now. It's not a tutorial. It's the seven decisions I had to make twice — once badly, once on purpose — and what I'd tell myself if I were starting today.

What DocuMind is#

DocuMind is a document Q&A SaaS. You sign up, upload a PDF, and ask questions in natural language. The answer streams back token by token, with the source chunks it came from shown alongside. It's live at docu-mind-neon.vercel.app on a free plan that allows two documents and ten questions per rolling seven days. The frontend is React 19 on Create React App; the backend is Express 5 on Render; state lives in MongoDB Atlas; vectors live in Pinecone; the LLM is OpenAI's gpt-4o-mini with text-embedding-3-small for embeddings. The code is at github.com/HAONANTAO/DocuMind.

Architecture at a glance#

DocuMind has two pipelines you can hold in your head. The first runs once per upload. The second runs every time someone asks a question. Most of the work between v1 and v2 was rebalancing what happens in each.

PDF (multer)  ->  pdf-parse  ->  RecursiveCharacterTextSplitter
                                   |  (1,000 chars, 200 overlap)
                                   v
                                chunks  ->  text-embedding-3-small
                                              |  (1,536-dim vectors)
                                              v
                                       Pinecone  (namespace: user_<userId>,
                                                 metadata: { documentId })
                                              |
                                              v
                                MongoDB: Document.status = 'ready', chunkCount = N

Indexing is the expensive, off-the-critical-path work. Query is what the user actually waits on. Treating them differently was the single biggest change between v1 and v2 — most of v1's slowness came from doing too much at query time.

question  ->  Pinecone.similaritySearch(top-k = 5,
                                        namespace: user_<userId>,
                                        filter: { documentId })
                  |
                  v
              top 5 chunks
                  |
                  v
       prompt = system  +  last 6 messages  +  context (5 chunks)  +  question
                  |
                  v
              gpt-4o-mini (streaming)
                  |
                  v
              SSE: data: { token }   x N
                  |
                  v
              SSE: data: { done: true, sources: [...] }
                  |
                  v
              Mongo: Conversation.messages.push({user}, {assistant})

The v1 pipeline did the same thing in shape, but every arrow was synchronous and the model didn't start streaming until the full completion was ready. That's where the twelve seconds came from.

The seven decisions that mattered#

1. Chunk size: 1,000 characters with 200 overlap#

I picked 1,000 characters per chunk, 200 characters of overlap, and top-5 retrieval. The sweet spot is roughly one paragraph idea per chunk. A thousand characters is around 250 tokens — small enough that the embedding still captures specifics, big enough to be a complete thought. The 200-character overlap means a sentence cut by the splitter survives whole in at least one of the two neighbouring chunks; retrieval almost never misses across that boundary.

The other side: smaller chunks (256 chars) make retrieval noisier. You fetch five fragments and the model has to reconstruct the context, which it sometimes does badly. Citations become useless because each cited passage is too short to read on its own. Larger chunks (4,000 chars) blow the prompt budget. Top-5 by 4,000 chars is twenty thousand characters of context, and you're paying for a lot of text that's mostly off-topic.

2. Multi-tenancy: Pinecone namespace plus metadata filter#

Every user's vectors land in namespace: user_<userId>. Every chunk also carries metadata: { documentId }, and retrieval filters on both. Two layers, because security mistakes happen at the application boundary, not the infrastructure boundary. The namespace is a hard wall — a wrong query in code physically can't see another tenant's data. The metadata filter narrows a single user's search to one document at a time.

// backend/src/config/retriever.js
const retrieveChunks = async (question, documentId, userId) => {
  const pineconeIndex = getPineconeIndex();
  const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {
    pineconeIndex,
    namespace: `user_${userId}`,
    filter: { documentId: documentId.toString() },
  });
  return vectorStore.similaritySearch(question, 5);
};

With only the namespace, one user querying document A would still retrieve hits from document B inside their own data. Quietly broken UX, embarrassing in a screenshot. With only the metadata filter, a single forgotten clause leaks one user's data into another user's response. Catastrophic and silent. Defence in depth on a small project costs ten extra characters per query and saves you from a bug you didn't write.

A small operational note that fits here: PineconeStore.fromExistingIndex above is a thin LangChain wrapper that does what I want. Its .delete() method, by contrast, accepts only { ids } or { deleteAll } — not { filter }. So the route that deletes a document's vectors drops into the Pinecone SDK directly:

// backend/src/routes/documents.js — DELETE /:id
await pineconeIndex
  .namespace(`user_${req.userId}`)
  .deleteMany({ documentId: document._id.toString() });

Same library, two different abstractions. I take the LangChain wrapper where it removes boilerplate and drop into the native SDK when the abstraction hides the wrong thing.

3. SSE over WebSocket for streaming#

Streaming chat is one-way: server to client. SSE is plain chunked HTTP, so every proxy, CDN, and load balancer already handles it. Auth flows through the existing Authorization header. Reconnection is a second fetch(). The frontend uses native fetch with body.getReader() rather than the built-in EventSource, because EventSource doesn't support custom headers and the API is JWT-protected.

// backend/src/routes/chat.js
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");

let fullAnswer = "";
const result = await ragQuery(question, documentId, req.userId, chatHistory, (token) => {
  res.write(`data: ${JSON.stringify({ token })}\n\n`);
  fullAnswer += token;
});
res.write(`data: ${JSON.stringify({ done: true, sources: result.sources })}\n\n`);
res.end();

WebSocket is a bidirectional pipe and charges you for the duplex you don't need: reconnection logic, heartbeat frames, binary framing, auth-on-connect. None of that buys a feature for a chat interface. The week you spend debugging "why does the connection drop in production every sixty seconds" is a week you weren't shipping.

The bigger gain wasn't the protocol; it was sending the first byte sooner. v1 waited for the full completion before responding. v2 starts streaming the moment the model emits its first token. Total time is the same. Time-to-first-token dropped from twelve seconds to roughly six hundred milliseconds. The user calls it "fast" now, even though end-to-end it isn't.

4. Async indexing: 201 immediately, embed in the background#

A 50-page PDF takes around thirty seconds to embed. If POST /documents blocks on that, the user gets a spinner with no progress, browsers timeout, and a frustrated refresh re-uploads the same file. Instead, the route writes the upload to disk, creates a Mongo document with status: 'uploading', returns 201, and an IIFE on the same Node process runs the pipeline asynchronously, updating the document to 'processing' then 'ready' (or 'error').

// backend/src/routes/documents.js — POST /upload
res.status(201).json({ message: "File uploaded, processing started", document });

(async () => {
  try {
    await Document.findByIdAndUpdate(document._id, { status: "processing" });
    const chunkCount = await processDocument(req.file.buffer, document._id, req.userId);
    await Document.findByIdAndUpdate(document._id, { status: "ready", chunkCount });
  } catch (err) {
    await Document.findByIdAndUpdate(document._id, {
      status: "error",
      errorMessage: err.message,
    });
  }
})();

The catch is that an IIFE on the API process is not a real job queue. If the server crashes mid-embed, the document is stuck at 'processing' forever. The current mitigation is a boot-time janitor that flips any stuck row to 'error' on every server start:

// backend/src/index.js
const { modifiedCount } = await Document.updateMany(
  { status: { $in: ["uploading", "processing"] } },
  { status: "error", errorMessage: "Server restarted during processing — please re-upload." }
);
if (modifiedCount > 0) console.log(`Marked ${modifiedCount} stuck document(s) as error`);

This is honest, not clever. A real fix is BullMQ on Redis with retries and dead-lettering. It's on the roadmap. It's not the highest priority while the API process is rarely restarted and the boot-time sweep catches the failure mode that actually happens.

5. Model choice: gpt-4o-mini, not gpt-4#

I started on gpt-4. On retrieval-grounded document Q&A, the model's job is reading comprehension over passages I just handed it — not first-principles reasoning. gpt-4o-mini matches answer quality on this task at roughly ten times lower cost and around twice the speed. I run temperature: 0 because reproducibility on the eval set matters more than creativity in this product.

The "use the big model" instinct is wrong if you haven't measured. With gpt-4 my eval set didn't move; I was paying ten times the price for outcomes I couldn't tell apart. The big model's reasoning ability is wasted when the answer is sitting in the retrieved chunks already.

6. Conversation memory: the last six messages#

Memory is conversation.messages.slice(-6) — three user-assistant pairs prepended to the prompt as chat history. Real follow-ups in this product reference one or two turns back ("what about clause 4.2?" → "is that different from the original draft?"). Six messages cover that natural depth without dragging stale context into the prompt.

With longer windows (the last twenty messages, say), prompt cost grows linearly and quality often goes down. The model anchors on early-context details that no longer matter and drifts off the current question. With shorter windows (the last two), the user has to repeat themselves every turn. They don't, and they leave.

7. A composite unique index on Conversation#

A chat session is per-(user, document). The first-message handler does findOne() and create() if missing. Without a unique index on { userId, documentId }, two requests racing on the very first message both pass findOne() (sees nothing), both create(), and the user ends up with two Conversation documents splitting their history.

Mongoose makes this one line: Conversation.index({ userId: 1, documentId: 1 }, { unique: true }). The race is otherwise silent — you only notice when a user complains the bot "forgot what we just talked about." That's the kind of bug that disappears for weeks until a user types fast enough to hit it. Better to make the database refuse the duplicate than rely on the application to prevent it.

Security on a side-project budget#

DocuMind is a side project with one developer. The security work has to be cheap, layered, and impossible to forget. Five things that earned their keep:

Boot-time CORS validation. In production, the server throws at startup if ALLOWED_ORIGIN is unset. A missing CORS env in production usually means default-allow-everything. Refusing to boot is safer than booting in an insecure default and shipping it to traffic.

Two-tier rate limiting, keyed appropriately. Auth routes (login and register) are 10 per 15 minutes, IP-keyed — the attacker is the IP. Chat is 30 per minute, userId-keyed — the attacker (or the runaway client script) already has a token, and multiple users behind one NAT shouldn't share a budget.

Anti-enumeration on login. Same error string ("Invalid email or password") for unknown email and wrong password. Without this, a probe can confirm which emails are registered without ever signing up.

zod validation on every state-changing route. A small validate(schema) middleware runs safeParse on req.body, returns 400 with the first error message, and replaces req.body with the parsed (normalised) data on success. Bad password rules, oversized questions, malformed documentIds all 400 before they touch business logic.

// backend/src/lib/validators.js
const chatSchema = z.object({
  documentId: z.string().min(1),
  question:   z.string().min(1).max(2000),
});

function validate(schema) {
  return (req, res, next) => {
    const result = schema.safeParse(req.body);
    if (!result.success) {
      return res.status(400).json({ message: result.error.errors[0].message });
    }
    req.body = result.data;
    next();
  };
}

Tenant isolation defence-in-depth. Pinecone namespace and metadata filter, both layers, every query. The namespace is the hard wall; the metadata filter is the precision tool. If the application code is wrong, the namespace still saves you.

The honest list of things still wrong#

I'd rather list these myself than let a reader find them.

JWT in localStorage. It's XSS-readable. Mitigated today by Helmet's CSP and DOMPurify-sanitised Markdown rendering, but the right fix is httpOnly cookie plus CSRF token. On the roadmap.

No automated tests. Manual testing only. The README admits this. An integration suite is the next priority — the routes are small enough that supertest plus an in-memory Mongo would cover most regressions in a weekend.

Async indexing is an IIFE on the API process, not a real job queue. Server crashes mid-embed leave documents stuck at 'processing'. The boot-time janitor (snippet in decision #4) is the cheap fix that catches the failure mode that actually happens, but it's not the right long-term answer. BullMQ on Redis is.

Free plan limits hardcoded in three places. The numbers 2 docs / 10 questions per 7 days are duplicated across routes/documents.js, routes/chat.js, and routes/auth.js. Small refactor debt; one constants file would do it. Today the duplication has bitten me zero times. It will eventually.

What I'd do differently next time#

If I were starting again with what I know now:

Build the eval set on day one. Twenty Q&A pairs with known-good answers. Run them every time you change chunking, k, or the prompt. I built this in week six. Should've been week zero — most of the v1 → v2 work would have been faster, and several decisions I made by feel I would have made by measurement.
Pick the smallest model that works. I started on GPT-4. GPT-4o-mini hits the same quality on this task at roughly 10× lower cost. The "use the big model" instinct is wrong if you haven't measured. The eval set is what makes "smallest that works" a question you can answer instead of a guess.
Citations are the product. Users don't trust AI answers — they trust AI answers with sources they can click. I treated citations as a UI feature. They're actually the trust mechanism the whole thing rides on. Get them right early; the rest of the UI can be ugly.
Stream from the first byte. Even if the backend isn't ready, fake the streaming. Perceived latency is the difference between "this is slow" and "this is alive." v1 was the same total time as v2 from first byte to last; v1 felt twenty times slower because the user saw nothing for twelve seconds.

What's next#

The roadmap, in priority order:

PDF viewer with highlighted citations. Clicking a source jumps to the page and highlights the matched passage. The biggest single thing I can do for trust.
Stripe billing for Pro. Upgrade flow, webhooks, subscription state. The free plan caps are how I keep my OpenAI bill predictable; Pro is how the project funds itself.
Agent mode. Let the LLM decide when to retrieve versus when to answer directly. Current behaviour retrieves on every turn, which is wasteful for follow-ups like "summarise that."
Team workspaces. Shared library, owner/editor/viewer roles. Real teams want this; individual users don't.
More file types. DOCX, TXT, Markdown, web URLs. Mostly a parsing layer; the rest of the pipeline doesn't care.
Mobile client. React Native. Furthest out — only after the web product earns it.

The point#

DocuMind isn't novel. It's the same RAG pattern a thousand teams shipped in 2024. The point of building it wasn't to invent something. It was to internalise, by hand, every layer between "user types a question" and "model returns an answer." Vector stores stop being mysterious once you've watched your own embeddings fail to retrieve obvious matches. Streaming stops being magic once you've stared at a hung response and realised the bottleneck was you, not the model.

The repo is on GitHub and the demo is at docu-mind-neon.vercel.app. Happy to talk about any of this — the trade-offs are the interesting part, and most of them have a defensible argument on the other side.