SARATH THARAYILHS.T.[W] WRITEUPSWWRITEUPS[K] CONCEPTSKCONCEPTS[P] PROJECTSPPROJECTS[A] ABOUTAABOUT
മ
/ SYSTEM

Building thoughtful software, writing notes, and shipping experiments across data, AI, and the web.

No cookies, no tracking. Preferences are stored locally in your browser. Anonymous view counts are kept server-side.

© 2026 Sarath Tharayil/IST --:--:--
++++

Database Normalization, From Chaos to Clean

Database Normalization, From Chaos to Clean

/ METADATA
DATE:2026.5.3
AUTHOR:SARATH THARAYIL
READING TIME:15 MIN READ
ENGAGEMENT:--
CATEGORIES:
DatabasesSQLLearning
NAVIGATE:← GO BACK
2026.5.3·15 MIN READ← back
DatabasesSQLLearning
/ ARTICLE

Databases have a natural tendency to get messy. Not because the people building them are careless, but because when you're moving fast, the easiest thing to do is stuff everything into one table and deal with the consequences later. The consequences usually show up as bugs that are hard to trace: a customer's city is listed as "New York" in three rows and "NYC" in one, a product's price is slightly different depending on which order you pull it from, or deleting an old invoice somehow loses a supplier's contact information forever.

These problems are not random. They follow predictable patterns, and they all come from the same root cause: storing the same fact in more than one place, or storing multiple independent facts in the same row. Database normalization is the process of eliminating these patterns. It gives you a set of rules, called normal forms, that tell you when a table is clean and when it still has a structural problem hiding inside it.

There are six standard normal forms. Each one is stricter than the last, catching a different category of redundancy. This post walks through all six using a single example: GigaStore, a fictional e-commerce company whose database has grown in exactly the wrong direction.

The Starting Point: GigaStore's Mess

GigaStore started as a spreadsheet. Someone exported it to a database table without thinking much about structure. This is what they ended up with:

OrderIDCustomerNameCustomerContactOrderedItemsOrderDate
1001Alice Johnsonalice@gs.com, 555-0101Laptop Pro x2 @999,USBHubx1@999, USB Hub x1 @999,USBHubx1@292024-01-15
1002Bob Williamsbob@gs.com, 555-0202Gaming Mouse x1 @49,Mousepadx2@49, Mousepad x2 @49,Mousepadx2@152024-01-16
1003Alice Johnsonalice@gs.com, 555-0101Laptop Pro x1 @$9992024-02-03

Look at the CustomerContact column. It has Alice's email and phone number shoved into one cell, separated by a comma. The database has no idea that there are two different things in there. If you run a query to find all customers by phone number, it cannot do it. Not reliably. It would have to search inside a string for a partial match, which is slow, fragile, and easy to get wrong.

OrderedItems is even worse. It has product names, quantities, and prices all mashed together. How do you write a query to find the total revenue for Laptop Pro sales? You cannot do it with SQL as it stands. You would have to extract the data from those strings manually, in application code, one row at a time.

Alice also appears twice. Her email and phone are repeated in row 1001 and row 1003. If she updates her phone number, someone has to find every row she appears in and update them all. Miss one and the database has two different phone numbers for the same person. Both look equally valid. There is no way to know which one is correct.

This is what bad structure costs you. Not just in performance, but in trust. You stop being able to rely on what the data says.


First Normal Form: One Value Per Cell

The first normal form has a simple rule: every cell in every table must contain exactly one atomic value. Atomic means it cannot be broken down into smaller parts that the database would need to reason about separately.

A cell with "alice@gs.com, 555-0101" is not atomic. It has two values: an email address and a phone number. A cell with "Laptop Pro x2 @999,USBHubx1@999, USB Hub x1 @999,USBHubx1@29" is definitely not atomic. It has multiple products, each with its own quantity and price.

Think of each cell as a box that can hold exactly one thing. If you are jamming multiple things into the box, separated by commas or semicolons or any other delimiter, you have broken the first rule. The database will treat that whole mess as a single opaque string. It cannot search inside it, cannot sort by any part of it, cannot count distinct values within it, and cannot enforce any constraints on the individual pieces.

Repeating groups also violate first normal form. A repeating group is when you have multiple columns that represent the same kind of thing, like Product1, Product2, Product3. This is the column version of the same problem. If a new product shows up and you only have three columns, you either need to add a new column or leave it out. That is not a database, that is a spreadsheet with delusions.

The fix for first normal form is to give each atomic piece of data its own column and its own row. For GigaStore, that means splitting CustomerContact into CustomerEmail and CustomerPhone, and exploding each order's items into separate rows. The table gets taller, but every cell now has exactly one value, and the database can work with each value individually.

Before

orders_raw
OrderIDCustomerName!CustomerContact!OrderedItemsOrderDate
1001Alice Johnsonalice@gs.com, 555-0101Laptop Pro x2 @$999, USB Hub x1 @$292024-01-15
1002Bob Williamsbob@gs.com, 555-0202Gaming Mouse x1 @$49, Mousepad x2 @$152024-01-16
1003Alice Johnsonalice@gs.com, 555-0101Laptop Pro x1 @$9992024-02-03

SQL

-- CustomerContact has two values crammed into one cell.
-- OrderedItems has multiple products with prices and quantities.
-- Fix: give each piece of data its own column, and one product per row.

CREATE TABLE orders (
  OrderID       INT,
  CustomerName  VARCHAR(100),
  CustomerEmail VARCHAR(100),
  CustomerPhone VARCHAR(20),
  ProductName   VARCHAR(100),
  UnitPrice     DECIMAL(10, 2),
  Qty           INT,
  OrderDate     DATE,
  PRIMARY KEY (OrderID, ProductName)
);

INSERT INTO orders VALUES
  (1001, 'Alice Johnson', 'alice@gs.com', '555-0101', 'Laptop Pro',   999.00, 2, '2024-01-15'),
  (1001, 'Alice Johnson', 'alice@gs.com', '555-0101', 'USB Hub',       29.00, 1, '2024-01-15'),
  (1002, 'Bob Williams',  'bob@gs.com',   '555-0202', 'Gaming Mouse',  49.00, 1, '2024-01-16'),
  (1002, 'Bob Williams',  'bob@gs.com',   '555-0202', 'Mousepad',      15.00, 2, '2024-01-16'),
  (1003, 'Alice Johnson', 'alice@gs.com', '555-0101', 'Laptop Pro',   999.00, 1, '2024-02-03');

One thing to notice: the table now needs a composite primary key. Before, there was no primary key at all. After the fix, a single order can have multiple rows (one per product), so the combination of OrderID and ProductName together is what makes each row unique.

The table is bigger now. That is expected. First normal form trades compactness for correctness. You will get the compactness back in the next step.


Second Normal Form: Every Column Earns Its Place

After first normal form, the orders table has a composite primary key: the pair (OrderID, ProductName). For a table to be in second normal form, every non-key column must depend on the entire primary key, not just part of it.

Look at CustomerName, CustomerEmail, CustomerPhone, and OrderDate. These columns do not need to know the ProductName part of the key. They only depend on OrderID. If you know the OrderID, you know who the customer is and when the order was placed, regardless of which product you are looking at. That is called a partial dependency: a non-key column depends on only part of the composite key.

UnitPrice is the same problem in the other direction. If you know the ProductName, you know the price. It has nothing to do with which OrderID it is attached to. Laptop Pro costs 999inorder1001and999 in order 1001 and 999inorder1001and999 in order 1003 because it is the same product. The price depends on the product, not the order.

Partial dependencies cause three classic problems. First, there is the update anomaly: if Alice moves, you have to update every row she appears in, and if you miss one, the database now has conflicting information about her. Second, there is the delete anomaly: if you delete order 1002, the only order Bob Williams ever placed, you lose Bob's contact information entirely. He vanishes. Third, there is the insert anomaly: you cannot add a new product to the catalog unless it is already part of an order. The product can only exist if it is attached to a row that also needs an OrderID and a customer.

The fix is to take each column and ask: what is the smallest thing it actually depends on? CustomerName depends on CustomerEmail. UnitPrice depends on ProductName. OrderDate depends on OrderID. Qty depends on both OrderID and ProductName together. Put each column in the table that matches exactly what it depends on.

Before

ordersPK: (OrderID, ProductName)
OrderID!CustomerName!CustomerEmail!CustomerPhoneProductName!UnitPriceQty!OrderDate
1001Alice Johnsonalice@gs.com555-0101Laptop Pro$999.0022024-01-15
1001Alice Johnsonalice@gs.com555-0101USB Hub$29.0012024-01-15
1002Bob Williamsbob@gs.com555-0202Gaming Mouse$49.0012024-01-16
1002Bob Williamsbob@gs.com555-0202Mousepad$15.0022024-01-16
1003Alice Johnsonalice@gs.com555-0101Laptop Pro$999.0012024-02-03

SQL

-- CustomerName, Email, Phone, and OrderDate only need OrderID to be known.
-- UnitPrice only needs ProductName to be known.
-- Neither depends on the full (OrderID, ProductName) key.
-- Fix: put each fact in the table that matches what it actually depends on.

CREATE TABLE customers AS
  SELECT DISTINCT CustomerEmail, CustomerName, CustomerPhone FROM orders;

CREATE TABLE products AS
  SELECT DISTINCT ProductName, UnitPrice FROM orders;

CREATE TABLE orders_v2 AS
  SELECT DISTINCT OrderID, CustomerEmail, OrderDate FROM orders;

CREATE TABLE order_items AS
  SELECT OrderID, ProductName, Qty FROM orders;

DROP TABLE orders;

Four tables instead of one. Each table now represents a single concept: a customer, a product, an order, or a line item. The redundancy is gone. Alice's information exists once, in the customers table. Updating her phone number is a single-row change. Bob Williams survives the deletion of his order because his customer record is separate.


Third Normal Form: No Detours

Third normal form deals with a different kind of redundancy: transitive dependencies. A transitive dependency is when column A determines column B, and column B determines column C. So column C depends on the primary key only because it takes a detour through column B. Column C is not directly connected to the primary key.

To make this concrete: imagine the customers table now includes CustomerZip, CustomerCity, and CustomerState. The primary key is CustomerEmail. CustomerEmail determines CustomerZip (each customer has one zip code). CustomerZip determines CustomerCity and CustomerState (each zip code maps to one city and state). So CustomerEmail does determine CustomerCity, but only indirectly, by going through CustomerZip first.

This creates the same kinds of problems as second normal form did. If zip code 10001 gets reorganized into a new borough and its city name changes, you have to update every customer row that has that zip code. Alice and Carol both live in 10001. Update Alice's row but miss Carol's, and now the database has two different city names for the same zip code.

The same pattern shows up in the products table. ProductName determines SupplierID, and SupplierID determines SupplierName and SupplierPhone. SupplierName does not depend on ProductName directly. It only knows the supplier because the product tells it the SupplierID first.

The rule for third normal form is this: every non-key column must depend directly on the primary key, not on another non-key column. Any time you find a chain where the key leads to a middleman, which then leads to another column, break the chain.

Before

customersPK: CustomerEmail
CustomerEmailCustomerNameCustomerPhone!CustomerZip!CustomerCity!CustomerState
alice@gs.comAlice Johnson555-010110001New YorkNY
bob@gs.comBob Williams555-020202101BostonMA
carol@gs.comCarol Davis555-030310001New YorkNY
productsPK: ProductName
ProductNameUnitPriceCategory!SupplierID!SupplierName!SupplierPhone
Laptop Pro$999.00ElectronicsSUP-01TechWorld800-555-1234
USB Hub$29.00ElectronicsSUP-01TechWorld800-555-1234
Gaming Mouse$49.00PeripheralsSUP-02PeriphPlus800-555-5678
Mousepad$15.00PeripheralsSUP-02PeriphPlus800-555-5678

SQL

-- In customers: CustomerEmail -> CustomerZip -> CustomerCity/State.
-- CustomerCity and CustomerState do not depend on CustomerEmail directly.
-- They depend on CustomerZip, which depends on CustomerEmail.
-- That is a transitive chain. Pull zip data into its own table.

CREATE TABLE zip_codes AS
  SELECT DISTINCT CustomerZip AS Zip, CustomerCity AS City, CustomerState AS State
  FROM customers;

ALTER TABLE customers DROP COLUMN CustomerCity;
ALTER TABLE customers DROP COLUMN CustomerState;

-- In products: ProductName -> SupplierID -> SupplierName/Phone.
-- SupplierName and SupplierPhone depend on SupplierID, not ProductName.
-- Pull supplier data into its own table.

CREATE TABLE suppliers AS
  SELECT DISTINCT SupplierID, SupplierName, SupplierPhone
  FROM products;

ALTER TABLE products DROP COLUMN SupplierName;
ALTER TABLE products DROP COLUMN SupplierPhone;

Zip code data now lives once in zip_codes. Two customers can share a zip code without duplicating the city and state. Supplier contact information lives once in suppliers. To look up where Alice lives, you join customers with zip_codes on the CustomerZip column. This is not extra work. This is how relational databases are designed to work.


Boyce-Codd Normal Form: Closing the Loophole

Boyce-Codd Normal Form, often abbreviated BCNF, is a slightly stricter version of third normal form. In most cases, a table that satisfies 3NF also satisfies BCNF. But there is a narrow edge case where 3NF's wording allows something through that still causes redundancy. BCNF closes that gap.

To see the difference, you need to understand what a candidate key is. A candidate key is any column or combination of columns that can uniquely identify every row in the table. A table can have multiple candidate keys. The primary key is just the one you choose to use.

Third normal form says: for every functional dependency (A determines B), either A must be a superkey, or B must be part of some candidate key. That second option is the loophole. BCNF removes it entirely. BCNF says: for every functional dependency, A must always be a superkey. No exceptions.

Here is a situation where the loophole shows up. GigaStore's training platform tracks which students are enrolled in which courses with which instructors. Each instructor teaches exactly one course. So if you know the instructor, you know the course. That is the dependency: InstructorName determines CourseName.

The table has two candidate keys: (StudentEmail, CourseName) and (StudentEmail, InstructorName). Both pairs can uniquely identify a row. All three columns are part of at least one candidate key, which makes them all "prime attributes" in technical terms. Third normal form's wording allows InstructorName to determine CourseName because CourseName is prime. So the table is in 3NF.

But look at the actual data. Dr. Smith always teaches Databases. That fact is repeated in every single row where Dr. Smith appears. If Dr. Smith switches courses, you have to update every row that mentions him. That is a redundancy and BCNF catches it.

Before

enrollmentCandidate keys: (StudentEmail, CourseName) and (StudentEmail, InstructorName)
StudentEmailInstructorName!CourseName
alice@gs.comDr. SmithDatabases
alice@gs.comDr. JonesAlgorithms
bob@gs.comDr. SmithDatabases
carol@gs.comDr. SmithDatabases
carol@gs.comDr. JonesAlgorithms

SQL

-- InstructorName -> CourseName: each instructor teaches exactly one course.
-- But InstructorName is not a superkey. It determines CourseName without
-- being able to uniquely identify a row on its own.
-- This slips through 3NF because CourseName is part of a candidate key.
-- BCNF catches it: every determinant must be a superkey.

CREATE TABLE instructor_courses AS
  SELECT DISTINCT InstructorName, CourseName
  FROM enrollment;
-- InstructorName is the primary key.

CREATE TABLE student_instructors AS
  SELECT StudentEmail, InstructorName
  FROM enrollment;
-- (StudentEmail, InstructorName) is the primary key.

DROP TABLE enrollment;

Now Dr. Smith's course is stored exactly once. The enrollment information is separate. Join instructor_courses and student_instructors on InstructorName and you get back the exact same data as before, with nothing lost.

A practical note: 3NF and BCNF are the same for most tables you will write. The edge case only matters when a table has overlapping candidate keys, where one candidate key overlaps with another. That is relatively rare in practice, but when it does happen, BCNF is the form you actually want.


Fourth Normal Form: One Fact at a Time

Fourth normal form deals with multi-valued dependencies. To understand these, you first need to understand what they are, because they are slightly different from the functional dependencies we have been working with so far.

A functional dependency says: if you know A, you know exactly one value of B. It is a one-to-one relationship from A's perspective.

A multi-valued dependency says: if you know A, you know a set of values for B. It is a one-to-many relationship. The notation is A -->> B, sometimes written as A multi-determines B.

Multi-valued dependencies are not automatically a problem. The issue arises when a table has two independent multi-valued dependencies at the same time. If StaffEmail -->> Skill and StaffEmail -->> Language, and skills and languages have nothing to do with each other, then putting them in the same table forces every skill to appear with every language. The table stores the cross-product of two independent sets.

Imagine Alice speaks English and Spanish, and knows Python and SQL. In the combined table, you get four rows: Python/English, Python/Spanish, SQL/English, SQL/Spanish. Those rows are not four independent facts. They are the mechanical result of combining two separate lists. Now imagine Alice learns a new programming language, Go. You do not just add one row. You add one row per language she speaks, because the structure demands it. That is two new rows for something that should be one fact.

This is the anomaly fourth normal form prevents. The fix is to give each independent multi-valued dependency its own table.

Before

staff_profilesPK: (StaffEmail, Skill, Language) - all three columns
StaffEmail!Skill!Language
alice@gs.comPythonEnglish
alice@gs.comPythonSpanish
alice@gs.comSQLEnglish
alice@gs.comSQLSpanish
bob@gs.comJavaEnglish
bob@gs.comJavaFrench

SQL

-- StaffEmail ->-> Skill (each person has multiple skills).
-- StaffEmail ->-> Language (each person speaks multiple languages).
-- These two sets are completely independent of each other.
-- The table forces every skill to pair with every language.
-- Adding one new skill for Alice requires inserting one row per language.
-- Fix: one table per independent fact.

CREATE TABLE staff_skills AS
  SELECT DISTINCT StaffEmail, Skill
  FROM staff_profiles;

CREATE TABLE staff_languages AS
  SELECT DISTINCT StaffEmail, Language
  FROM staff_profiles;

DROP TABLE staff_profiles;

The six-row table became two tables with three and four rows respectively. The artificial cross-product is gone. Adding a new skill is one row. Adding a new language is one row. The two facts are completely independent of each other, as they should be.


Fifth Normal Form: The Full Decomposition

Fifth normal form is the most advanced and the hardest to explain intuitively. It deals with join dependencies. A join dependency says: a table is equal to the natural join of some number of its own projections. That sounds abstract, so here is what it means in practice.

Sometimes three entities have a relationship with each other that cannot be described by any two of them alone. The relationship only makes sense when all three are in the room at the same time. When that happens, any attempt to split the table into two binary tables will either lose information or create rows that should not exist.

GigaStore's fulfillment system has this kind of relationship. The business rule is: a supplier ships a product to a warehouse only when three conditions all hold at the same time. One, the supplier stocks that product. Two, the supplier ships to that warehouse. Three, that warehouse handles that product. All three must be true for the route to exist.

Because of this three-way constraint, the table supply_routes cannot be losslessly decomposed into two binary tables. If you split it into (SupplierID, ProductName) and (ProductName, WarehouseID), then joining them back together might produce rows that do not actually exist. For example, SUP-01 supplies Laptop Pro and USB Hub, and SUP-01 ships to WH-NYC and WH-LA. But WH-LA only handles Laptop Pro, not USB Hub. A naive two-way join would invent the route (SUP-01, USB Hub, WH-LA), which is wrong.

The only correct decomposition is into three binary tables, one for each pair of entities. The original table can then be reconstructed by joining all three. This is called a join dependency, and eliminating it is what fifth normal form requires.

Before

supply_routesPK: (SupplierID, ProductName, WarehouseID)
!SupplierID!ProductName!WarehouseID
SUP-01Laptop ProWH-NYC
SUP-01Laptop ProWH-LA
SUP-01USB HubWH-NYC
SUP-02Gaming MouseWH-NYC
SUP-02MousepadWH-NYC

SQL

-- supply_routes encodes a three-way constraint:
-- a supplier ships a product to a warehouse only if all three hold:
--   (1) the supplier stocks that product
--   (2) the supplier ships to that warehouse
--   (3) that warehouse handles that product
-- This table equals the join of its three binary projections.
-- No two-way split can represent it losslessly.

CREATE TABLE supplier_products AS
  SELECT DISTINCT SupplierID, ProductName FROM supply_routes;

CREATE TABLE supplier_warehouses AS
  SELECT DISTINCT SupplierID, WarehouseID FROM supply_routes;

CREATE TABLE product_warehouses AS
  SELECT DISTINCT ProductName, WarehouseID FROM supply_routes;

-- To reconstruct:
-- SELECT * FROM supplier_products sp
--   JOIN supplier_warehouses sw USING (SupplierID)
--   JOIN product_warehouses pw USING (ProductName, WarehouseID);

DROP TABLE supply_routes;

The three binary tables each record one pair of facts. To get back the full set of supply routes, you join all three on their shared keys. The join will produce exactly the original rows, no more and no less. No phantom routes, no missing combinations.


When to Stop

Most production databases stop at third normal form or BCNF. That is a reasonable decision in most cases.

Fourth and fifth normal form decompositions come with a real cost. Every query that needs to reconstruct the original data now requires joining three or more tables. Joins are not free. On large datasets, they add planning time and execution time. If your data is read far more often than it is written, the query cost might outweigh the benefit of the cleaner structure. This is why you will see deliberate denormalization in analytics databases and data warehouses: they trade write integrity for read speed.

The decision comes down to the shape of your data and how it changes. If you have two genuinely independent multi-valued facts about the same entity and they grow at different rates, fourth normal form will save you from a lot of unnecessary rows and confusing anomalies. If you have a three-way relationship governed by a cyclic constraint, fifth normal form is worth the extra joins because the alternative is silently storing incorrect data.

Getting to third normal form eliminates the bugs that actually show up in production: the customer whose address is wrong in half the rows, the product whose price disagrees with itself, the record that disappears when an unrelated thing is deleted. BCNF catches the remaining edge cases. For most applications, that is enough.

The value of knowing all six forms is not that you always apply all six. It is that you understand what each form is protecting you from. When you choose to stop at a lower form, you are making an informed decision, not an accident.

If this was worth sharing, send it to someone on 𝕏 or LinkedIn. Got a question or a thought? Drop me a message — I read everything. If this was worth your time, .

Sarath Tharayil
/ CONTENTS(8)
The Starting Point: GigaStore's MessFirst Normal Form: One Value Per CellSecond Normal Form: Every Column Earns Its PlaceThird Normal Form: No DetoursBoyce-Codd Normal Form: Closing the LoopholeFourth Normal Form: One Fact at a TimeFifth Normal Form: The Full DecompositionWhen to Stop
--
/ THAT'S A WRAP

Have a great day.

Thanks for reading all the way to the end.