Database Normalization, From Chaos to Clean
Databases have a natural tendency to get messy. Not because the people building them are careless, but because when you're moving fast, the easiest thing to do is stuff everything into one table and deal with the consequences later. The consequences usually show up as bugs that are hard to trace: a customer's city is listed as "New York" in three rows and "NYC" in one, a product's price is slightly different depending on which order you pull it from, or deleting an old invoice somehow loses a supplier's contact information forever.
These problems are not random. They follow predictable patterns, and they all come from the same root cause: storing the same fact in more than one place, or storing multiple independent facts in the same row. Database normalization is the process of eliminating these patterns. It gives you a set of rules, called normal forms, that tell you when a table is clean and when it still has a structural problem hiding inside it.
There are six standard normal forms. Each one is stricter than the last, catching a different category of redundancy. This post walks through all six using a single example: GigaStore, a fictional e-commerce company whose database has grown in exactly the wrong direction.
The Starting Point: GigaStore's Mess
GigaStore started as a spreadsheet. Someone exported it to a database table without thinking much about structure. This is what they ended up with:
| OrderID | CustomerName | CustomerContact | OrderedItems | OrderDate |
|---|---|---|---|---|
| 1001 | Alice Johnson | alice@gs.com, 555-0101 | Laptop Pro x2 @29 | 2024-01-15 |
| 1002 | Bob Williams | bob@gs.com, 555-0202 | Gaming Mouse x1 @15 | 2024-01-16 |
| 1003 | Alice Johnson | alice@gs.com, 555-0101 | Laptop Pro x1 @$999 | 2024-02-03 |
Look at the CustomerContact column. It has Alice's email and phone number shoved into one cell, separated by a comma. The database has no idea that there are two different things in there. If you run a query to find all customers by phone number, it cannot do it. Not reliably. It would have to search inside a string for a partial match, which is slow, fragile, and easy to get wrong.
OrderedItems is even worse. It has product names, quantities, and prices all mashed together. How do you write a query to find the total revenue for Laptop Pro sales? You cannot do it with SQL as it stands. You would have to extract the data from those strings manually, in application code, one row at a time.
Alice also appears twice. Her email and phone are repeated in row 1001 and row 1003. If she updates her phone number, someone has to find every row she appears in and update them all. Miss one and the database has two different phone numbers for the same person. Both look equally valid. There is no way to know which one is correct.
This is what bad structure costs you. Not just in performance, but in trust. You stop being able to rely on what the data says.
First Normal Form: One Value Per Cell
The first normal form has a simple rule: every cell in every table must contain exactly one atomic value. Atomic means it cannot be broken down into smaller parts that the database would need to reason about separately.
A cell with "alice@gs.com, 555-0101" is not atomic. It has two values: an email address and a phone number. A cell with "Laptop Pro x2 @29" is definitely not atomic. It has multiple products, each with its own quantity and price.
Think of each cell as a box that can hold exactly one thing. If you are jamming multiple things into the box, separated by commas or semicolons or any other delimiter, you have broken the first rule. The database will treat that whole mess as a single opaque string. It cannot search inside it, cannot sort by any part of it, cannot count distinct values within it, and cannot enforce any constraints on the individual pieces.
Repeating groups also violate first normal form. A repeating group is when you have multiple columns that represent the same kind of thing, like Product1, Product2, Product3. This is the column version of the same problem. If a new product shows up and you only have three columns, you either need to add a new column or leave it out. That is not a database, that is a spreadsheet with delusions.
The fix for first normal form is to give each atomic piece of data its own column and its own row. For GigaStore, that means splitting CustomerContact into CustomerEmail and CustomerPhone, and exploding each order's items into separate rows. The table gets taller, but every cell now has exactly one value, and the database can work with each value individually.
One thing to notice: the table now needs a composite primary key. Before, there was no primary key at all. After the fix, a single order can have multiple rows (one per product), so the combination of OrderID and ProductName together is what makes each row unique.
The table is bigger now. That is expected. First normal form trades compactness for correctness. You will get the compactness back in the next step.
Second Normal Form: Every Column Earns Its Place
After first normal form, the orders table has a composite primary key: the pair (OrderID, ProductName). For a table to be in second normal form, every non-key column must depend on the entire primary key, not just part of it.
Look at CustomerName, CustomerEmail, CustomerPhone, and OrderDate. These columns do not need to know the ProductName part of the key. They only depend on OrderID. If you know the OrderID, you know who the customer is and when the order was placed, regardless of which product you are looking at. That is called a partial dependency: a non-key column depends on only part of the composite key.
UnitPrice is the same problem in the other direction. If you know the ProductName, you know the price. It has nothing to do with which OrderID it is attached to. Laptop Pro costs 999 in order 1003 because it is the same product. The price depends on the product, not the order.
Partial dependencies cause three classic problems. First, there is the update anomaly: if Alice moves, you have to update every row she appears in, and if you miss one, the database now has conflicting information about her. Second, there is the delete anomaly: if you delete order 1002, the only order Bob Williams ever placed, you lose Bob's contact information entirely. He vanishes. Third, there is the insert anomaly: you cannot add a new product to the catalog unless it is already part of an order. The product can only exist if it is attached to a row that also needs an OrderID and a customer.
The fix is to take each column and ask: what is the smallest thing it actually depends on? CustomerName depends on CustomerEmail. UnitPrice depends on ProductName. OrderDate depends on OrderID. Qty depends on both OrderID and ProductName together. Put each column in the table that matches exactly what it depends on.
Four tables instead of one. Each table now represents a single concept: a customer, a product, an order, or a line item. The redundancy is gone. Alice's information exists once, in the customers table. Updating her phone number is a single-row change. Bob Williams survives the deletion of his order because his customer record is separate.
Third Normal Form: No Detours
Third normal form deals with a different kind of redundancy: transitive dependencies. A transitive dependency is when column A determines column B, and column B determines column C. So column C depends on the primary key only because it takes a detour through column B. Column C is not directly connected to the primary key.
To make this concrete: imagine the customers table now includes CustomerZip, CustomerCity, and CustomerState. The primary key is CustomerEmail. CustomerEmail determines CustomerZip (each customer has one zip code). CustomerZip determines CustomerCity and CustomerState (each zip code maps to one city and state). So CustomerEmail does determine CustomerCity, but only indirectly, by going through CustomerZip first.
This creates the same kinds of problems as second normal form did. If zip code 10001 gets reorganized into a new borough and its city name changes, you have to update every customer row that has that zip code. Alice and Carol both live in 10001. Update Alice's row but miss Carol's, and now the database has two different city names for the same zip code.
The same pattern shows up in the products table. ProductName determines SupplierID, and SupplierID determines SupplierName and SupplierPhone. SupplierName does not depend on ProductName directly. It only knows the supplier because the product tells it the SupplierID first.
The rule for third normal form is this: every non-key column must depend directly on the primary key, not on another non-key column. Any time you find a chain where the key leads to a middleman, which then leads to another column, break the chain.
Zip code data now lives once in zip_codes. Two customers can share a zip code without duplicating the city and state. Supplier contact information lives once in suppliers. To look up where Alice lives, you join customers with zip_codes on the CustomerZip column. This is not extra work. This is how relational databases are designed to work.
Boyce-Codd Normal Form: Closing the Loophole
Boyce-Codd Normal Form, often abbreviated BCNF, is a slightly stricter version of third normal form. In most cases, a table that satisfies 3NF also satisfies BCNF. But there is a narrow edge case where 3NF's wording allows something through that still causes redundancy. BCNF closes that gap.
To see the difference, you need to understand what a candidate key is. A candidate key is any column or combination of columns that can uniquely identify every row in the table. A table can have multiple candidate keys. The primary key is just the one you choose to use.
Third normal form says: for every functional dependency (A determines B), either A must be a superkey, or B must be part of some candidate key. That second option is the loophole. BCNF removes it entirely. BCNF says: for every functional dependency, A must always be a superkey. No exceptions.
Here is a situation where the loophole shows up. GigaStore's training platform tracks which students are enrolled in which courses with which instructors. Each instructor teaches exactly one course. So if you know the instructor, you know the course. That is the dependency: InstructorName determines CourseName.
The table has two candidate keys: (StudentEmail, CourseName) and (StudentEmail, InstructorName). Both pairs can uniquely identify a row. All three columns are part of at least one candidate key, which makes them all "prime attributes" in technical terms. Third normal form's wording allows InstructorName to determine CourseName because CourseName is prime. So the table is in 3NF.
But look at the actual data. Dr. Smith always teaches Databases. That fact is repeated in every single row where Dr. Smith appears. If Dr. Smith switches courses, you have to update every row that mentions him. That is a redundancy and BCNF catches it.
Now Dr. Smith's course is stored exactly once. The enrollment information is separate. Join instructor_courses and student_instructors on InstructorName and you get back the exact same data as before, with nothing lost.
A practical note: 3NF and BCNF are the same for most tables you will write. The edge case only matters when a table has overlapping candidate keys, where one candidate key overlaps with another. That is relatively rare in practice, but when it does happen, BCNF is the form you actually want.
Fourth Normal Form: One Fact at a Time
Fourth normal form deals with multi-valued dependencies. To understand these, you first need to understand what they are, because they are slightly different from the functional dependencies we have been working with so far.
A functional dependency says: if you know A, you know exactly one value of B. It is a one-to-one relationship from A's perspective.
A multi-valued dependency says: if you know A, you know a set of values for B. It is a one-to-many relationship. The notation is A -->> B, sometimes written as A multi-determines B.
Multi-valued dependencies are not automatically a problem. The issue arises when a table has two independent multi-valued dependencies at the same time. If StaffEmail -->> Skill and StaffEmail -->> Language, and skills and languages have nothing to do with each other, then putting them in the same table forces every skill to appear with every language. The table stores the cross-product of two independent sets.
Imagine Alice speaks English and Spanish, and knows Python and SQL. In the combined table, you get four rows: Python/English, Python/Spanish, SQL/English, SQL/Spanish. Those rows are not four independent facts. They are the mechanical result of combining two separate lists. Now imagine Alice learns a new programming language, Go. You do not just add one row. You add one row per language she speaks, because the structure demands it. That is two new rows for something that should be one fact.
This is the anomaly fourth normal form prevents. The fix is to give each independent multi-valued dependency its own table.
The six-row table became two tables with three and four rows respectively. The artificial cross-product is gone. Adding a new skill is one row. Adding a new language is one row. The two facts are completely independent of each other, as they should be.
Fifth Normal Form: The Full Decomposition
Fifth normal form is the most advanced and the hardest to explain intuitively. It deals with join dependencies. A join dependency says: a table is equal to the natural join of some number of its own projections. That sounds abstract, so here is what it means in practice.
Sometimes three entities have a relationship with each other that cannot be described by any two of them alone. The relationship only makes sense when all three are in the room at the same time. When that happens, any attempt to split the table into two binary tables will either lose information or create rows that should not exist.
GigaStore's fulfillment system has this kind of relationship. The business rule is: a supplier ships a product to a warehouse only when three conditions all hold at the same time. One, the supplier stocks that product. Two, the supplier ships to that warehouse. Three, that warehouse handles that product. All three must be true for the route to exist.
Because of this three-way constraint, the table supply_routes cannot be losslessly decomposed into two binary tables. If you split it into (SupplierID, ProductName) and (ProductName, WarehouseID), then joining them back together might produce rows that do not actually exist. For example, SUP-01 supplies Laptop Pro and USB Hub, and SUP-01 ships to WH-NYC and WH-LA. But WH-LA only handles Laptop Pro, not USB Hub. A naive two-way join would invent the route (SUP-01, USB Hub, WH-LA), which is wrong.
The only correct decomposition is into three binary tables, one for each pair of entities. The original table can then be reconstructed by joining all three. This is called a join dependency, and eliminating it is what fifth normal form requires.
The three binary tables each record one pair of facts. To get back the full set of supply routes, you join all three on their shared keys. The join will produce exactly the original rows, no more and no less. No phantom routes, no missing combinations.
When to Stop
Most production databases stop at third normal form or BCNF. That is a reasonable decision in most cases.
Fourth and fifth normal form decompositions come with a real cost. Every query that needs to reconstruct the original data now requires joining three or more tables. Joins are not free. On large datasets, they add planning time and execution time. If your data is read far more often than it is written, the query cost might outweigh the benefit of the cleaner structure. This is why you will see deliberate denormalization in analytics databases and data warehouses: they trade write integrity for read speed.
The decision comes down to the shape of your data and how it changes. If you have two genuinely independent multi-valued facts about the same entity and they grow at different rates, fourth normal form will save you from a lot of unnecessary rows and confusing anomalies. If you have a three-way relationship governed by a cyclic constraint, fifth normal form is worth the extra joins because the alternative is silently storing incorrect data.
Getting to third normal form eliminates the bugs that actually show up in production: the customer whose address is wrong in half the rows, the product whose price disagrees with itself, the record that disappears when an unrelated thing is deleted. BCNF catches the remaining edge cases. For most applications, that is enough.
The value of knowing all six forms is not that you always apply all six. It is that you understand what each form is protecting you from. When you choose to stop at a lower form, you are making an informed decision, not an accident.
If this was worth sharing, send it to someone on 𝕏 or LinkedIn. Got a question or a thought? Drop me a message — I read everything. If this was worth your time, .