Ollama

AI & Advanced Technologies

Ollama

Ollama is an open-source runtime that makes running large language models on your own laptop, workstation, or server trivial. One command, one binary — and any Hugging Face-compatible model is serving behind an OpenAI-style API.

What is it?

Ollama is a local LLM runtime and model manager that bundles llama.cpp, GGUF quantisation, and an HTTP server into a single binary. It runs on macOS, Linux, and Windows, and exposes an OpenAI-compatible API on localhost without any cloud account.

What does it do?

Ollama downloads, quantises, and serves open-weight models with a one-line command like `ollama run llama3`. It handles GPU acceleration via Metal, CUDA, or ROCm, streams tokens, supports multi-modal inputs, and manages concurrent sessions — all without writing a line of Python.

Where is it used?

Ollama has become the default on-ramp for developer LLM experimentation and on-device AI prototyping. It powers local-first coding copilots, offline chatbots, privacy-preserving RAG demos, and air-gapped enterprise pilots where data cannot leave the perimeter.

When & why it emerged

Ollama was released in 2023 to make local LLM inference as frictionless as `docker run`. It abstracted the complexity of compiling llama.cpp, picking a quant format, and wiring an HTTP server — a process that had been gate-keeping non-specialist developers from trying open-weight models.

Why we use it at Internative

We use Ollama for rapid LLM proofs-of-concept, offline demos, and client workshops where giving a laptop that runs an LLM entirely without the internet changes the conversation. For production we graduate workloads to vLLM on dedicated inference infrastructure, but Ollama is how most of our discovery-phase work starts.