Security

medium

mid

How do you securely call and handle responses from AI APIs like OpenAI or Hugging Face?

Proxy via server (never API key in browser), use streaming (SSE) for responsive UX, schema-validate structured outputs (Zod), sanitize any HTML output (DOMPurify), handle rate-limit/timeout/error with retry+backoff+circuit-breaker, log redacted, enforce per-user budget. For OpenAI specifically: stream:true, structured outputs via JSON schema mode, function calling for tool use. Treat AI responses as untrusted input — never eval, never render raw HTML.

10 min read·~5 min to think through

Calling LLM APIs has a few production-grade requirements that go beyond the basic curl example in provider docs.

Architecture: always proxy via your server

Browser ↔ Your API ↔ OpenAI / Anthropic / etc.

Your server:

Holds the API key (never ship it to browser — bearer tokens with no scope).
Authenticates the user.
Enforces rate limits, quotas, budgets.
Redacts sensitive input before forwarding.
Sanitizes/validates output before returning.
Logs (redacted) for audit + debug.

Basic call (OpenAI streaming chat)

Server (Node):

import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function POST(req: Request) {
  const { messages, userId } = await req.json();
  if (!await rateLimitOk(userId)) return new Response('Rate limited', { status: 429 });

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'You are a helpful assistant. Be concise.' },
      ...messages.map(redactPII),
    ],
    stream: true,
    max_tokens: 1000,
  });

  const encoder = new TextEncoder();
  return new Response(
    new ReadableStream({
      async start(controller) {
        try {
          for await (const chunk of stream) {
            const token = chunk.choices[0]?.delta?.content ?? '';
            if (token) controller.enqueue(encoder.encode(`data: ${JSON.stringify(token)}\n\n`));
          }
          controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        } catch (e) {
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: String(e) })}\n\n`));
        } finally {
          controller.close();
        }
      },
    }),
    { headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache' } }
  );
}

Client (React):

tsx

async function send(input: string) {
  const res = await fetch('/api/chat', {
    method: 'POST',
    body: JSON.stringify({ messages: [...history, { role: 'user', content: input }] }),
  });
  const reader = res.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = '';
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('\n\n');
    buffer = lines.pop()!;
    for (const line of lines) {
      const data = line.replace(/^data: /, '');
      if (data === '[DONE]') return;
      const token = JSON.parse(data);
      appendToken(token);
    }
  }
}

Structured outputs

When you need JSON, use the provider's structured-output mode (Zod schema → response):

import { zodResponseFormat } from 'openai/helpers/zod';
import { z } from 'zod';

const Schema = z.object({
  summary: z.string(),
  topics: z.array(z.string()),
  sentiment: z.enum(['positive', 'neutral', 'negative']),
});

const completion = await openai.chat.completions.parse({
  model: 'gpt-4o-2024-08-06',
  messages,
  response_format: zodResponseFormat(Schema, 'analysis'),
});
const data = completion.choices[0].message.parsed;  // typed!

Even with structured output, validate — providers occasionally drift. Wrap in try/catch + Zod safeParse.

Function / tool calling

const tools = [{
  type: 'function',
  function: {
    name: 'getWeather',
    description: 'Get current weather for a city',
    parameters: { type: 'object', properties: { city: { type: 'string' } }, required: ['city'] },
  },
}];

const res = await openai.chat.completions.create({ model, messages, tools });
const call = res.choices[0].message.tool_calls?.[0];
if (call?.function.name === 'getWeather') {
  const { city } = JSON.parse(call.function.arguments);
  const weather = await fetchWeather(city);   // ← YOUR code executes
  // send back to LLM with the result
}

Critical: the LLM only suggests the call. Your code decides whether to execute and validates inputs. Never blindly execute arbitrary tool calls.

Output handling

LLM outputs are untrusted. Treat them like user input.

HTML rendering: <div dangerouslySetInnerHTML={{ __html: DOMPurify.sanitize(output) }} />.
Code execution: never eval. If user wants to run AI-generated code, sandbox (iframe with restrictive CSP, WebWorker, server-side container).
URLs: validate against allowlist before rendering as clickable links.
JSON: schema-validate.
Tool calls: re-authorize, validate args.

Error handling

async function safeLLMCall(req) {
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await openai.chat.completions.create(req);
    } catch (e: any) {
      if (e.status === 429 || e.status >= 500) {
        const retryAfter = e.headers?.get('retry-after');
        const wait = retryAfter ? parseInt(retryAfter) * 1000 : 1000 * 2 ** attempt + Math.random() * 500;
        await sleep(wait);
        continue;
      }
      throw e;
    }
  }
  throw new Error('Exhausted retries');
}

Different providers have different status semantics — read their docs.

Aborting

Pass AbortController signal so user-initiated cancel propagates:

const ctrl = new AbortController();
const stream = await openai.chat.completions.create({...}, { signal: ctrl.signal });
// later: ctrl.abort() stops generation, saves tokens

Pitfalls

API key in browser code.
No streaming — blank screen for 10s.
Trusting structured output without schema validation.
innerHTML the response → XSS.
Executing tool calls without re-validation.
No retry on 429 → user sees errors at moderate load.
No abort → user navigates away but the call keeps running, burning tokens.
No max_tokens — runaway generation, $$$.
Logging full prompts with PII.

Mental model

LLM calls are: proxy through your server, stream for UX, schema for structure, sanitize for safety, retry for reliability, abort for cost. Validate everything coming back — the LLM is a probabilistic system, not a deterministic API.

Follow-up questions

•How do you handle function calls safely?
•What's the right way to stream responses to the client?
•How do you validate structured JSON from an LLM?
•What should you log and what should you redact?

Common mistakes

•API key in browser.
•No streaming — blank screen UX.
•Trusting structured output without schema check.
•Rendering LLM HTML without sanitization — XSS.
•Auto-executing tool calls — privilege escalation.
•No max_tokens — runaway generation cost.

Performance considerations

•Streaming cuts perceived latency 10x. Caching deterministic queries cuts cost significantly for fact lookups. Abort on user cancel prevents pointless token burn.

Edge cases

•Stream connection drops mid-response — client should display partial + offer retry.
•Tokenization differs by model — count via the right tokenizer.
•Function call args may be invalid JSON — wrap in try/catch.
•Very long single turn can exceed context — chunk or summarize.
•Multilingual content + emoji affects token count significantly.

Real-world examples

•ChatGPT, Claude.ai, Perplexity — all stream over SSE.
•OpenAI SDK has built-in streaming + structured output helpers.
•Vercel AI SDK abstracts streaming + provider switching in React.
•LangChain / Llama Index for more complex orchestration.

Senior engineer discussion

Seniors design LLM integration as untrusted-input handling first, UX second. They proxy, stream, validate, sanitize, retry, abort, log redacted. They treat tool calls as suggestions, not commands, and structured outputs as schema-validate-or-reject.