MongoDB Data Modeling Best Practices

Welcome to Part 4 of our MongoDB Zero to Hero series. After mastering CRUD operations, it's time to understand how to structure your data effectively in MongoDB.

Understanding Document-Oriented Design

Unlike relational databases with rigid schemas, MongoDB offers flexible document-based data modeling. This flexibility is both powerful and potentially problematic if not used wisely.

Key Principles

Model for Your Application: Design around your application's data access patterns
Embrace Denormalization: It's often better than complex joins
Think in Terms of Documents: Not tables and rows
Consider Read/Write Patterns: Optimize for your most common operations

Data Modeling Patterns

1. Embedding (One-to-One and One-to-Few)

When to Use: When you have related data that's accessed together and doesn't grow unbounded.

// User with embedded address (One-to-One)
{
  _id: ObjectId("..."),
  name: "John Doe",
  email: "john@example.com",
  address: {
    street: "123 Main St",
    city: "New York",
    state: "NY",
    zipCode: "10001",
    country: "USA"
  },
  createdAt: ISODate("2024-01-15")
}

// Blog post with embedded comments (One-to-Few)
{
  _id: ObjectId("..."),
  title: "MongoDB Data Modeling",
  content: "Content of the blog post...",
  author: "Jane Smith",
  tags: ["mongodb", "database", "nosql"],
  comments: [
    {
      _id: ObjectId("..."),
      author: "Reader1",
      text: "Great post!",
      date: ISODate("2024-01-16")
    },
    {
      _id: ObjectId("..."),
      author: "Reader2",
      text: "Very helpful, thanks!",
      date: ISODate("2024-01-17")
    }
  ],
  publishedAt: ISODate("2024-01-15")
}

Advantages:

Single query to get all related data
Atomic updates
Better performance for read operations

Disadvantages:

Document size can grow (16MB limit)
Potential for data duplication
Complex updates when embedded data changes

2. Referencing (One-to-Many and Many-to-Many)

When to Use: When you have large amounts of related data or many-to-many relationships.

// User document
{
  _id: ObjectId("user1"),
  name: "Alice Johnson",
  email: "alice@example.com",
  createdAt: ISODate("2024-01-15")
}

// Order documents (One-to-Many)
{
  _id: ObjectId("order1"),
  userId: ObjectId("user1"),  // Reference to user
  items: [
    {
      productId: ObjectId("product1"),
      name: "MongoDB Book",
      price: 29.99,
      quantity: 1
    }
  ],
  total: 29.99,
  status: "completed",
  orderDate: ISODate("2024-01-16")
}

// Many-to-Many: Users and Roles
// User document
{
  _id: ObjectId("user1"),
  name: "Bob Smith",
  email: "bob@example.com",
  roleIds: [ObjectId("role1"), ObjectId("role2")]  // Array of references
}

// Role documents
{
  _id: ObjectId("role1"),
  name: "admin",
  permissions: ["read", "write", "delete"]
}
{
  _id: ObjectId("role2"),
  name: "editor",
  permissions: ["read", "write"]
}

Advantages:

Avoids data duplication
Better for frequently changing data
Supports large datasets
Easier to maintain consistency

Disadvantages:

Requires multiple queries or $lookup
No foreign key constraints
More complex application logic

3. Hybrid Approach

Often, the best solution combines embedding and referencing:

// E-commerce product with embedded variants but referenced reviews
{
  _id: ObjectId("product1"),
  name: "Laptop",
  brand: "TechBrand",
  category: "Electronics",

  // Embedded variants (few, stable)
  variants: [
    {
      sku: "LAPTOP-001-BLK",
      color: "Black",
      storage: "256GB",
      price: 999.99,
      inventory: 50
    },
    {
      sku: "LAPTOP-001-SLV",
      color: "Silver",
      storage: "512GB",
      price: 1199.99,
      inventory: 30
    }
  ],

  // Basic review stats (frequently accessed)
  reviewSummary: {
    averageRating: 4.2,
    totalReviews: 156,
    lastUpdated: ISODate("2024-01-16")
  },

  // Reference to detailed reviews (many, growing)
  // Reviews stored in separate collection

  createdAt: ISODate("2024-01-10")
}

// Separate reviews collection
{
  _id: ObjectId("review1"),
  productId: ObjectId("product1"),
  userId: ObjectId("user1"),
  rating: 5,
  title: "Excellent laptop!",
  content: "Fast performance, great build quality...",
  helpful: 23,
  verified: true,
  createdAt: ISODate("2024-01-16")
}

Schema Design Patterns

1. Polymorphic Pattern

Store different types of entities in the same collection:

// Events collection with different event types
{
  _id: ObjectId("event1"),
  type: "user_registration",
  timestamp: ISODate("2024-01-16"),
  userId: ObjectId("user1"),
  data: {
    email: "user@example.com",
    source: "website"
  }
}

{
  _id: ObjectId("event2"),
  type: "purchase",
  timestamp: ISODate("2024-01-16"),
  userId: ObjectId("user1"),
  data: {
    orderId: ObjectId("order1"),
    amount: 99.99,
    paymentMethod: "credit_card"
  }
}

{
  _id: ObjectId("event3"),
  type: "page_view",
  timestamp: ISODate("2024-01-16"),
  userId: ObjectId("user1"),
  data: {
    page: "/products/laptop",
    referrer: "https://google.com",
    duration: 45
  }
}

2. Attribute Pattern

Handle documents with many similar fields or sparse data:

// Traditional approach (sparse, many null values)
{
  _id: ObjectId("product1"),
  name: "Laptop",
  color: "Black",
  weight: 2.5,
  screenSize: 15.6,
  ramSize: 16,
  storageSize: 512,
  cpuSpeed: 2.4,
  // ... potentially hundreds of other attributes
  batteryLife: null,
  waterproof: null,
  // ... many null values
}

// Attribute pattern (flexible, efficient)
{
  _id: ObjectId("product1"),
  name: "Laptop",
  category: "Electronics",
  attributes: [
    { name: "color", value: "Black", type: "string" },
    { name: "weight", value: 2.5, type: "number", unit: "kg" },
    { name: "screenSize", value: 15.6, type: "number", unit: "inches" },
    { name: "ramSize", value: 16, type: "number", unit: "GB" },
    { name: "storageSize", value: 512, type: "number", unit: "GB" },
    { name: "cpuSpeed", value: 2.4, type: "number", unit: "GHz" }
  ]
}

// Create index for efficient attribute queries
db.products.createIndex({ "attributes.name": 1, "attributes.value": 1 })

// Query example
db.products.find({
  "attributes": {
    $elemMatch: {
      "name": "ramSize",
      "value": { $gte: 16 }
    }
  }
})

3. Bucket Pattern

Aggregate time-series or similar data:

// Instead of one document per data point
{
  _id: ObjectId("reading1"),
  sensorId: "sensor001",
  timestamp: ISODate("2024-01-16T10:00:00Z"),
  temperature: 23.5,
  humidity: 45.2
}

// Bucket pattern: group multiple readings
{
  _id: ObjectId("bucket1"),
  sensorId: "sensor001",
  date: ISODate("2024-01-16"),
  hour: 10,
  readings: [
    {
      minute: 0,
      temperature: 23.5,
      humidity: 45.2
    },
    {
      minute: 1,
      temperature: 23.6,
      humidity: 45.1
    },
    // ... up to 60 readings per hour
  ],
  count: 60,
  averageTemp: 23.8,
  averageHumidity: 45.0
}

4. Outlier Pattern

Handle documents that don't fit the normal pattern:

// Normal social media post
{
  _id: ObjectId("post1"),
  userId: ObjectId("user1"),
  content: "Just learned MongoDB data modeling!",
  likes: ["user2", "user3", "user4"],  // Few likes, can embed
  comments: [
    { userId: ObjectId("user2"), text: "Great!" },
    { userId: ObjectId("user3"), text: "Awesome!" }
  ],
  createdAt: ISODate("2024-01-16")
}

// Viral post (outlier with many likes)
{
  _id: ObjectId("post2"),
  userId: ObjectId("user1"),
  content: "Viral post content...",
  likes: {
    count: 50000,
    isOverflow: true  // Flag indicating likes are in separate collection
  },
  comments: {
    count: 5000,
    isOverflow: true  // Comments also in separate collection
  },
  createdAt: ISODate("2024-01-16")
}

// Separate collections for overflow data
// likes_overflow collection
{
  _id: ObjectId("likes1"),
  postId: ObjectId("post2"),
  userIds: ["user1", "user2", ..., "user1000"]  // Batch of 1000 user IDs
}

// comments_overflow collection
{
  _id: ObjectId("comments1"),
  postId: ObjectId("post2"),
  comments: [
    { userId: ObjectId("user1"), text: "Amazing!", date: ISODate("...") },
    // ... more comments
  ]
}

Relationships in MongoDB

One-to-One Relationships

// Embed when data is accessed together
{
  _id: ObjectId("user1"),
  name: "John Doe",
  email: "john@example.com",
  profile: {  // One-to-one embedded
    bio: "Software developer...",
    avatar: "https://example.com/avatar.jpg",
    preferences: {
      theme: "dark",
      notifications: true
    }
  }
}

// Reference when data is large or accessed separately
{
  _id: ObjectId("user1"),
  name: "John Doe",
  email: "john@example.com",
  profileId: ObjectId("profile1")  // Reference to separate profile document
}

One-to-Many Relationships

// Embed Many in One (when "many" is limited)
{
  _id: ObjectId("order1"),
  customerId: ObjectId("customer1"),
  items: [  // Embedded line items
    {
      productId: ObjectId("product1"),
      name: "Product Name",
      price: 19.99,
      quantity: 2
    }
  ],
  total: 39.98
}

// Reference One from Many (when "many" is unlimited)
// Customer document
{
  _id: ObjectId("customer1"),
  name: "Customer Name",
  email: "customer@example.com"
}

// Many order documents
{
  _id: ObjectId("order1"),
  customerId: ObjectId("customer1"),  // Reference to customer
  total: 39.98,
  date: ISODate("2024-01-16")
}

Many-to-Many Relationships

// Students and Courses (embed array of references)
// Student document
{
  _id: ObjectId("student1"),
  name: "Alice Smith",
  email: "alice@university.edu",
  courseIds: [  // Array of course references
    ObjectId("course1"),
    ObjectId("course2"),
    ObjectId("course3")
  ]
}

// Course document
{
  _id: ObjectId("course1"),
  name: "Database Systems",
  code: "CS301",
  instructor: "Dr. Johnson",
  studentIds: [  // Array of student references
    ObjectId("student1"),
    ObjectId("student2"),
    // ... more students
  ]
}

// Alternative: Junction collection for complex many-to-many
// enrollment collection
{
  _id: ObjectId("enrollment1"),
  studentId: ObjectId("student1"),
  courseId: ObjectId("course1"),
  enrollmentDate: ISODate("2024-01-16"),
  grade: null,
  status: "active"
}

Denormalization Strategies

When to Denormalize

Frequently accessed together: Data that's always read together
Read-heavy workloads: Optimize for query performance
Stable data: Information that doesn't change often
Acceptable redundancy: When storage cost is less than query complexity

Denormalization Example

// Normalized approach (requires multiple queries)
// Users collection
{
  _id: ObjectId("user1"),
  name: "Alice Johnson",
  email: "alice@example.com"
}

// Posts collection
{
  _id: ObjectId("post1"),
  title: "My First Post",
  content: "Post content...",
  authorId: ObjectId("user1"),
  createdAt: ISODate("2024-01-16")
}

// Denormalized approach (single query)
// Posts collection with embedded author info
{
  _id: ObjectId("post1"),
  title: "My First Post",
  content: "Post content...",
  author: {  // Denormalized author data
    _id: ObjectId("user1"),
    name: "Alice Johnson",
    email: "alice@example.com"
  },
  createdAt: ISODate("2024-01-16")
}

Managing Denormalized Data

// Update user name in both users and posts collections
function updateUserName(userId, newName) {
    // Update users collection
    db.users.updateOne({ _id: userId }, { $set: { name: newName } });

    // Update denormalized data in posts
    db.posts.updateMany({ 'author._id': userId }, { $set: { 'author.name': newName } });
}

Performance Considerations

Document Size Limits

16MB limit: Keep documents under this limit
Working set: Frequently accessed documents should fit in memory
Index size: Consider index size when designing schema

Query Patterns

// Design for your most common queries
// If you frequently query posts by author and date:
{
  _id: ObjectId("post1"),
  authorId: ObjectId("user1"),  // Index: { authorId: 1, publishedAt: -1 }
  title: "Post Title",
  publishedAt: ISODate("2024-01-16"),
  // ... other fields
}

// Create compound index
db.posts.createIndex({ authorId: 1, publishedAt: -1 })

Write Patterns

// Optimize for write-heavy workloads
// Time-series data with bucketing
{
  _id: ObjectId("metrics_2024_01_16_10"),
  date: ISODate("2024-01-16"),
  hour: 10,
  data: [
    { minute: 0, cpu: 45.2, memory: 67.8 },
    { minute: 1, cpu: 46.1, memory: 68.2 },
    // ... more data points
  ]
}

Schema Evolution

Versioning Strategy

// Version field approach
{
  _id: ObjectId("user1"),
  schemaVersion: 2,
  name: "John Doe",
  email: "john@example.com",
  // New fields in version 2
  preferences: {
    theme: "dark",
    notifications: true
  }
}

// Handle different versions in application code
function getUser(userId) {
  const user = db.users.findOne({ _id: userId });

  if (user.schemaVersion === 1) {
    // Migrate or provide defaults
    user.preferences = {
      theme: "light",
      notifications: true
    };
  }

  return user;
}

Migration Strategies

// Lazy migration: Update documents as they're accessed
db.users.updateMany(
    { schemaVersion: { $exists: false } },
    {
        $set: {
            schemaVersion: 2,
            preferences: {
                theme: 'light',
                notifications: true,
            },
        },
    },
);

// Progressive migration script
const cursor = db.users.find({ schemaVersion: 1 });
while (cursor.hasNext()) {
    const user = cursor.next();

    // Perform migration
    const updatedUser = migrateUserToV2(user);

    db.users.replaceOne({ _id: user._id }, updatedUser);
}

Common Anti-Patterns to Avoid

1. Unnecessary Normalization

// Anti-pattern: Over-normalization
{
  _id: ObjectId("address1"),
  street: "123 Main St",
  cityId: ObjectId("city1")  // Unnecessary reference
}

{
  _id: ObjectId("city1"),
  name: "New York",
  stateId: ObjectId("state1")  // Another unnecessary reference
}

// Better: Embed stable data
{
  _id: ObjectId("user1"),
  name: "John Doe",
  address: {
    street: "123 Main St",
    city: "New York",
    state: "NY",
    country: "USA"
  }
}

2. Massive Arrays

// Anti-pattern: Unbounded array growth
{
  _id: ObjectId("post1"),
  title: "Popular Post",
  likes: [userId1, userId2, ..., userId50000]  // Too many elements
}

// Better: Use separate collection or bucketing
{
  _id: ObjectId("post1"),
  title: "Popular Post",
  likeCount: 50000,
  // Store likes in separate collection
}

3. Inappropriate Embedding

// Anti-pattern: Embedding frequently changing data
{
  _id: ObjectId("user1"),
  name: "John Doe",
  orders: [  // Orders change frequently, grow unbounded
    { orderId: ObjectId("order1"), total: 99.99, status: "shipped" },
    { orderId: ObjectId("order2"), total: 149.99, status: "pending" },
    // ... potentially thousands of orders
  ]
}

// Better: Use references
{
  _id: ObjectId("order1"),
  userId: ObjectId("user1"),
  total: 99.99,
  status: "shipped"
}

Best Practices Summary

Understand your data access patterns
Favor embedding for one-to-few relationships
Use references for one-to-many and many-to-many
Denormalize frequently accessed data
Keep document size reasonable
Plan for schema evolution
Index your queries
Monitor and optimize performance

What's Next?

Now that you understand data modeling, it's time to optimize your queries with Indexing and Performance, or learn how to process your data with the Aggregation Pipeline.

Series Navigation

Previous: MongoDB CRUD Operations
Next: MongoDB Indexing and Performance
Hub: MongoDB Zero to Hero - Complete Guide

This is Part 4 of the MongoDB Zero to Hero series. Data modeling is crucial for building scalable MongoDB applications - take time to understand these patterns before moving to more advanced topics.

MongoDB Data Modeling Best Practices

MongoDB Data Modeling Best Practices

Understanding Document-Oriented Design

Key Principles

Data Modeling Patterns

1. Embedding (One-to-One and One-to-Few)

2. Referencing (One-to-Many and Many-to-Many)

3. Hybrid Approach

Schema Design Patterns

1. Polymorphic Pattern

2. Attribute Pattern

3. Bucket Pattern

4. Outlier Pattern

Relationships in MongoDB

One-to-One Relationships

One-to-Many Relationships

Many-to-Many Relationships

Denormalization Strategies

When to Denormalize

Denormalization Example

Managing Denormalized Data

Performance Considerations

Document Size Limits

Query Patterns

Write Patterns

Schema Evolution

Versioning Strategy

Migration Strategies

Common Anti-Patterns to Avoid

1. Unnecessary Normalization

2. Massive Arrays

3. Inappropriate Embedding

Best Practices Summary

What's Next?

Series Navigation

Enjoyed this post?

Discussion (0)