What is a Neural Network?

The fundamental building block of all modern AI

Why Neural Networks Matter

Every large language model — ChatGPT, Claude, Gemini, Llama — is built on neural networks. Before you can build an LLM, you need to understand what a neural network actually is, not as a metaphor, but as a mathematical function.

A neural network is a function that takes numbers as input, transforms them through a series of operations, and produces numbers as output. That's it. The magic isn't in the structure — it's in how the network learns which transformations to apply.

Neurons: The Simplest Unit

A single artificial neuron does three things:

Receives inputs — a set of numbers (x₁, x₂, ..., xₙ)
Multiplies each input by a weight — w₁x₁ + w₂x₂ + ... + wₙxₙ + b (where b is a bias term)
Applies an activation function — transforms the sum through a nonlinear function like ReLU: max(0, result)

The weights and bias are the parameters — the numbers the network learns during training. When people say GPT-4 has "trillions of parameters," they mean trillions of these weight values.

Layers: Building Complexity

A single neuron can only learn simple patterns (linear boundaries). Stack neurons into layers, and layers into a network, and you can approximate any continuous function — this is the Universal Approximation Theorem.

A typical neural network has:

Input layer — receives the raw data (e.g., pixel values of an image, token IDs of text)
Hidden layers — where the learning happens. Each layer extracts increasingly abstract features
Output layer — produces the final result (e.g., probability of each possible next word)

The "deep" in "deep learning" simply means many hidden layers. GPT-3 has 96 layers. Each layer transforms the data, building richer representations of meaning.

How Learning Works: Gradient Descent

Training a neural network means finding the right values for all those weights. The process:

Forward pass — run an input through the network, get a prediction
Calculate loss — measure how wrong the prediction was (using a loss function)
Backward pass (backpropagation) — calculate how each weight contributed to the error
Update weights — nudge each weight slightly in the direction that reduces the error

Repeat this millions of times with millions of examples, and the network gradually learns to make accurate predictions. This process is called gradient descent — "gradient" because we follow the slope of the error surface, "descent" because we're going downhill toward lower error.

Why This Matters for LLMs

A Large Language Model is a neural network trained to predict the next word (technically, the next token) in a sequence. The network sees "The cat sat on the" and learns to predict "mat" with high probability.

What makes LLMs remarkable is that this simple objective — predict the next token — when applied at massive scale (trillions of tokens, billions of parameters), produces emergent capabilities: reasoning, translation, coding, summarization, and more.

The neural network is the engine. The transformer architecture (which we'll cover in the next lesson) is the specific engine design that made LLMs possible. And the data is the fuel.

Key Concepts to Remember

Parameters = the learnable weights in the network (what "175 billion parameters" means)
Layers = sequential transformations of data (deeper = more abstract features)
Activation functions = nonlinear functions (ReLU, GELU) that let networks learn complex patterns
Gradient descent = the learning algorithm that adjusts weights to reduce prediction error
Backpropagation = the method for calculating how much each weight contributed to the error
Loss function = the mathematical measure of "how wrong was the prediction"

In the next lesson, we'll see how the transformer architecture arranged these neurons into the specific pattern that revolutionized AI.

How Transformers Work