Definition

LLM

An LLM is a neural network trained on massive text corpora to predict the next token. Modern LLMs power coding agents, chat, and tool-using assistants.

A large language model (LLM) is a neural network — almost always a transformer — trained on very large text corpora to predict the next token given the previous tokens. "Large" here means billions to trillions of parameters. Modern LLMs like Claude, GPT, Gemini, Qwen, and Kimi are the engines behind every serious AI developer tool.

Why it matters

Every agentic coding CLI is just a thin orchestration loop around an LLM. When you run Claude Code, Codex CLI, or Qwen Code, the CLI packages your files and instructions into a prompt, sends it to an LLM, and interprets the response as tool use calls. The model is doing the thinking; the CLI is doing the plumbing.

Understanding LLMs — their context window, their tendency to hallucinate, the effect of a system prompt — makes you a better user of every AI coding tool, including the ones SpaceSpider hosts.

How it works

At inference time an LLM takes a sequence of input tokens and produces a probability distribution over the next token. The client samples from that distribution (with temperature, top-p, top-k parameters tuning how random the pick is), appends the chosen token, and repeats. This happens one token at a time, which is why you see streaming output.

Key properties developers care about:

  • Context window — the maximum number of tokens a model can attend to (8k, 200k, 1M+)
  • Training cutoff — the date after which the model has no knowledge (without retrieval)
  • Capability tier — frontier vs. smaller/cheaper models, with tradeoffs in speed and cost
  • Tool use — whether the model can emit structured function calls

Post-training includes instruction tuning, RLHF, and often fine-tuning on code-specific data, which is what makes coding-specialized models good at diffs, compilation errors, and test output.

How it's used

In an agentic CLI loop:

  1. Client builds a prompt: system prompt + conversation history + available tools
  2. LLM emits text or a tool call
  3. Client executes the tool call, adds the result to the conversation
  4. Loop

Compression via RAG, embeddings, and context summarization keeps conversations productive across long tasks.

  • Token — the atomic unit LLMs operate on
  • Context window — how much the model can see at once
  • Hallucination — the failure mode you care about
  • Fine-tuning — how models get specialized
  • RAG — bolting external knowledge onto an LLM

FAQ

Can I run an LLM locally?

Yes — open-weights models (Llama, Qwen, Mistral, DeepSeek) run on consumer GPUs via llama.cpp, vLLM, or Ollama. Frontier closed models (Claude, GPT-4/5, Gemini Ultra) don't run locally.

Why are coding-specialized LLMs better at code?

They've seen far more code during training and are often fine-tuned on diffs, test cases, and error output. The architecture is the same; the data is what differs.

Related terms