Lesson 3 - Hands-On with MongoDB

Welcome to MongoDB

In Lesson 2 you learned what NoSQL is and when to use it. Now you put document databases to work. MongoDB is the most popular NoSQL database, and if you are coming from SQL it can feel like learning a new language — so this lesson is fully hands-on. You will see exactly how document databases solve real data engineering problems.

Here is the scenario you will build around. You are a data engineer at a growing e-commerce company. Your customer review system started simple: star ratings and text in a SQL database. But success brought complexity. Marketing wants verified-purchase badges. The mobile team is adding photo uploads. Product management is launching video reviews. In SQL, each change means a schema migration that takes hours across millions of rows. This is the schema evolution problem that drives data engineers to NoSQL — and you are about to watch MongoDB make it disappear.

By the end of this lesson, you will be able to:

  • Set up a MongoDB Atlas cluster and configure access
  • Insert flexible documents and evolve the schema with zero migrations
  • Query documents with field matches, operators, and existence checks
  • Build aggregation pipelines that answer real business questions
  • Connect MongoDB to Python with PyMongo for a complete data pipeline

You only need a little Python comfort. Let’s get started.

Data for this lesson

Platform: MongoDB (via MongoDB Atlas + Compass).

Data: you’ll create your own database and collections during the lesson — no SQL tables and no download needed.


Setting Up MongoDB Atlas

You will use MongoDB Atlas, MongoDB’s managed cloud service. Atlas mirrors how you would deploy MongoDB professionally, and it is fast to set up so you get straight to the concepts. (You could install MongoDB locally instead, but Atlas keeps the setup out of your way.)

Create your account and cluster

Go to MongoDB’s Atlas page and create a free account — no credit card required. The free tier gives you 512 MB of storage, more than enough for learning and even small production workloads.

Click Build a Database and choose the free shared cluster. Pick any cloud provider and a region near you; the defaults are fine because you are learning concepts, not optimizing performance. Name the cluster something simple like learning-cluster and click Create.

Configure access: user and network

While Atlas builds your cluster — yes, even the free tier is distributed across multiple servers — you configure two things: a database user and network access.

For the database user, click Database Access in the left menu and add a user. Choose password authentication, create credentials you will remember, and set permissions to Read and write to any database.

For network access, click Network Access. Atlas may have configured this during signup. If not, click Add IP Address and choose Allow Access from Anywhere for now.

Learning settings are not production settings

“Read and write to any database” and “Allow Access from Anywhere” are convenient for learning, but they are not production practices. In production you would restrict network access to specific IP addresses and limit each user to only the databases and permissions they need. Atlas provides the security features — it is on you to configure them.

Connect with Compass

Your cluster is ready in about three minutes. Click Connect on the cluster, then choose MongoDB Compass — MongoDB’s GUI for exploring data visually. Download Compass if needed, then copy your connection string. It looks like this:

mongodb+srv://myuser:<password>@learning-cluster.abc12.mongodb.net/

Replace <password> with your actual password and connect. On success you will see your cluster with a few pre-populated databases like admin, local, and maybe sample_mflix (MongoDB’s demo movie data). Those are system and sample databases — you will create your own next.

Connection timeout? Check network access

If Compass times out while connecting, the cause is almost always network access, not your password. A wrong password gives an authentication error; a timeout means your IP address is not allowed. Add your IP (or “Allow Access from Anywhere” while learning) under Network Access.

You have just stood up a distributed database that can scale to millions of documents — the same process whether you are learning or launching a startup.


Understanding Documents Through Real Data

In Compass, click the green Create Database button. Create a database called ecommerce_analytics with a collection called customer_reviews.

A quick note on terminology. In MongoDB, a database contains collections, and collections contain documents. Coming from SQL, think of collections like tables and documents like rows — except documents are far more flexible.

Click into your collection and open the built-in shell: at the top right of Compass, find the shell icon (>_) and click Open MongoDB shell. First, point the shell at your database:

use ecommerce_analytics

Insert your first document

Insert a customer review with insertOne:

db.customer_reviews.insertOne({
  customer_id: "cust_12345",
  product_id: "wireless_headphones_pro",
  rating: 4,
  review_text: "Great sound quality, battery lasts all day. Wish they were a bit more comfortable for long sessions.",
  review_date: new Date("2024-10-15"),
  helpful_votes: 23,
  verified_purchase: true,
  purchase_date: new Date("2024-10-01")
})

MongoDB confirms it worked:

{
  acknowledged: true,
  insertedId: ObjectId('68d31786d59c69a691408ede')
}

This is a complete review stored as a single document. In SQL this information might span several tables — reviews, votes, maybe purchases for verification. Here, all the related data lives together in one place.

Evolve the schema with no migration

Now the scenario that breaks SQL schemas: the mobile team ships photo uploads. Instead of planning a migration, they just start storing photos:

db.customer_reviews.insertOne({
  customer_id: "cust_67890",
  product_id: "wireless_headphones_pro",
  rating: 5,
  review_text: "Perfect headphones! See the photo for size comparison.",
  review_date: new Date("2024-10-20"),
  helpful_votes: 45,
  verified_purchase: true,
  purchase_date: new Date("2024-10-10"),
  photo_url: "https://cdn.example.com/reviews/img_2024_10_20_abc123.jpg",
  device_type: "mobile_ios"
})

Notice the two new fields, photo_url and device_type. MongoDB did not complain about missing columns or demand a migration — each document simply stores what makes sense for it. To add a brand-new field to some reviews, you do exactly nothing first: you just insert documents with the field. There is no ALTER TABLE, no need to backfill existing documents, no separate collection. The flexibility has a trade-off, though: your application code must handle documents that differ. You will need to check whether a photo exists before trying to display it.

Add a few more reviews at once with insertMany:

db.customer_reviews.insertMany([
  {
    customer_id: "cust_11111",
    product_id: "laptop_stand_adjustable",
    rating: 3,
    review_text: "Does the job but feels flimsy",
    review_date: new Date("2024-10-18"),
    helpful_votes: 5,
    verified_purchase: false
  },
  {
    customer_id: "cust_22222",
    product_id: "wireless_headphones_pro",
    rating: 5,
    review_text: "Excelente producto! La calidad de sonido es increíble.",
    review_date: new Date("2024-10-22"),
    helpful_votes: 12,
    verified_purchase: true,
    purchase_date: new Date("2024-10-15"),
    video_url: "https://cdn.example.com/reviews/vid_2024_10_22_xyz789.mp4",
    video_duration_seconds: 45,
    language: "es"
  },
  {
    customer_id: "cust_33333",
    product_id: "laptop_stand_adjustable",
    rating: 5,
    review_text: "Much sturdier than expected. Height adjustment is smooth.",
    review_date: new Date("2024-10-23"),
    helpful_votes: 8,
    verified_purchase: true,
    sentiment_score: 0.92,
    sentiment_label: "very_positive"
  }
])

Look at what you just created. One document has video metadata, another has sentiment scores, one is in Spanish. In a SQL world you would be juggling nullable columns or extra tables. Here, each review carries exactly the data that makes sense for it.


Querying Documents

MongoDB’s query language uses JSON-like syntax that feels natural once you see the pattern.

Find by exact match

Find documents by passing field names as keys:

// Find all 5-star reviews
db.customer_reviews.find({ rating: 5 })

// Find reviews for a specific product
db.customer_reviews.find({ product_id: "wireless_headphones_pro" })

Use operators for richer queries

MongoDB has operators like $gte (greater than or equal), $lt (less than), and $ne (not equal):

// Highly-rated reviews (4 stars or higher)
db.customer_reviews.find({ rating: { $gte: 4 } })

// Recent verified-purchase reviews
db.customer_reviews.find({
  verified_purchase: true,
  review_date: { $gte: new Date("2024-10-15") }
})

Query fields that might not exist

Here is something that would be painful in SQL — querying for fields only some documents have, using $exists:

// Reviews that include a video
db.customer_reviews.find({ video_url: { $exists: true } })

// Reviews that have sentiment analysis
db.customer_reviews.find({ sentiment_score: { $exists: true } })

These queries do not fail on documents that lack the field. MongoDB simply returns the documents that do match and ignores the rest — a direct payoff of the flexible schema.

Speed things up with an index

As a collection grows past a few thousand documents, create an index on fields you query often. Think of it like the index in a book: instead of flipping through every page, you jump straight to the right section.

db.customer_reviews.createIndex({ product_id: 1 })

The 1 means ascending order (-1 is descending). MongoDB now keeps a sorted reference to all product_id values, making product queries fast even with millions of reviews. You do not change your queries at all — MongoDB uses the index automatically when it helps.

Update existing documents

Updating is just as flexible. Say the customer service team starts adding sentiment scores. Use updateOne with $set to add or modify fields:

db.customer_reviews.updateOne(
  { customer_id: "cust_12345" },
  {
    $set: {
      sentiment_score: 0.72,
      sentiment_label: "positive"
    }
  }
)

MongoDB tells you exactly what happened:

{
  acknowledged: true,
  insertedId: null,
  matchedCount: 1,
  modifiedCount: 1,
  upsertedCount: 0
}

You added new fields to one document; the others are untouched, with no migration required.

When someone marks a review helpful, increment the count with $inc:

db.customer_reviews.updateOne(
  { customer_id: "cust_67890" },
  { $inc: { helpful_votes: 1 } }
)

This operation is atomic — it is safe even when many users vote at the same time. That matters: the naive read-then-write approach (read the count, add one, write it back) has a race condition where simultaneous updates lose each other. $inc performs the increment as a single indivisible operation, so it never loses a vote.


Analytics Without Leaving MongoDB

MongoDB’s aggregate method runs analytics directly on your operational data using an aggregation pipeline — a series of stages that transform your documents step by step.

Average rating and review count per product

Answer a real business question: what is the average rating and review count for each product?

db.customer_reviews.aggregate([
  {
    $group: {
      _id: "$product_id",
      avg_rating: { $avg: "$rating" },
      review_count: { $sum: 1 },
      total_helpful_votes: { $sum: "$helpful_votes" }
    }
  },
  {
    $sort: { avg_rating: -1 }
  }
])

Output:

{
  _id: 'wireless_headphones_pro',
  avg_rating: 4.666666666666667,
  review_count: 3,
  total_helpful_votes: 81
}
{
  _id: 'laptop_stand_adjustable',
  avg_rating: 4,
  review_count: 2,
  total_helpful_votes: 13
}

The pipeline runs in stages. First $group groups documents by product_id and computes metrics for each group with operators like $avg and $sum ($sum: 1 counts documents). Then $sort orders the groups by average rating, with -1 for descending. The result is exactly what a product manager needs.

Now something more advanced — review trends by month:

db.customer_reviews.aggregate([
  {
    $group: {
      _id: {
        month: { $month: "$review_date" },
        year: { $year: "$review_date" }
      },
      review_count: { $sum: 1 },
      avg_rating: { $avg: "$rating" },
      verified_percentage: {
        $avg: { $cond: ["$verified_purchase", 1, 0] }
      }
    }
  },
  {
    $sort: { "_id.year": 1, "_id.month": 1 }
  }
])

Output:

{
  _id: {
    month: 10,
    year: 2024
  },
  review_count: 5,
  avg_rating: 4.4,
  verified_percentage: 0.8
}

This groups reviews by month and year using MongoDB’s date operators $month and $year. To compute the verified-purchase percentage, it uses $cond to convert each true/false into 1/0, then $avg to average them — so 0.8 means 80% of reviews were verified. Two details worth remembering: to group by more than one field, _id must be an object with named fields (here month and year), and $avg will not treat booleans as numbers on its own, which is why the $cond conversion is needed.

These queries answer real questions directly on operational data. Next, connect this to Python.


Connecting MongoDB to Python

Real data engineering connects systems. MongoDB rarely works in isolation — it is one part of a larger ecosystem. Let’s bridge it to Python, where you can integrate it with the rest of a pipeline.

First, install the official MongoDB driver, PyMongo, along with pandas:

pip install pymongo pandas

Here is a practical extract-and-analyze example:

from pymongo import MongoClient
import pandas as pd

# In production, store this as an environment variable for security
connection_string = "mongodb+srv://username:[email protected]/"
client = MongoClient(connection_string)
db = client.ecommerce_analytics

# Query high-rated reviews
high_rated_reviews = list(
    db.customer_reviews.find({
        "rating": {"$gte": 4}
    })
)

# Convert to a DataFrame for analysis
df = pd.DataFrame(high_rated_reviews)

# Clean up MongoDB's internal _id field
if '_id' in df.columns:
    df = df.drop('_id', axis=1)

# Handle optional fields gracefully (remember our schema flexibility?)
df['has_photo'] = df['photo_url'].notna()
df['has_video'] = df['video_url'].notna()

# Analyze product performance
product_metrics = df.groupby('product_id').agg({
    'rating': 'mean',
    'helpful_votes': 'sum',
    'customer_id': 'count'
}).rename(columns={'customer_id': 'review_count'})

print("Product Performance (Last 30 Days):")
print(product_metrics)

# Export for downstream processing
df.to_csv('recent_positive_reviews.csv', index=False)
print(f"\nExported {len(df)} reviews for downstream processing")

A few things to notice. The same $gte operator you used in the shell works inside Python — only the syntax wraps it in a dictionary. The flexible schema follows the data into pandas: not every review has photo_url or video_url, so .notna() safely turns them into clean boolean columns. And MongoDB’s _id field holds an ObjectId, which does not serialize cleanly to CSV — dropping it (or converting it to a string with str() if you need it) is the standard move before exporting.

Where MongoDB fits in larger architectures

This pattern — different databases for different purposes — is called polyglot persistence. In production it typically looks like this:

  • MongoDB handles operational workloads — flexible schemas, high write volumes, real-time applications.
  • SQL databases handle analytical workloads — complex queries, reporting, business intelligence.
  • Python bridges the gap — extracting, transforming, and loading data between systems.

You might capture raw user events in MongoDB in real time, then periodically extract and transform them into a PostgreSQL data warehouse where analysts run complex reports. Each database does what it does best. And note that this is not because PostgreSQL cannot handle JSON — it can, with its JSONB type — it is because MongoDB’s native flexibility and horizontal scaling fit high-volume, rapidly evolving event data better, while PostgreSQL’s optimizer and joins make it ideal for analytics. Modern data engineering is not MongoDB or SQL; it is both, combined thoughtfully.


Practice Exercises

Try these against the customer_reviews collection you built.

Exercise 1: Insert and evolve

Insert a new review for a product called usb_c_hub that includes a brand-new field the collection has never seen before, such as reviewer_age: 34. Confirm the insert succeeds without any schema change, then write a query that returns only the reviews that have a reviewer_age field.

Hint

You do not need to prepare the collection at all — just insertOne with the new field. To find documents that have it, use db.customer_reviews.find({ reviewer_age: { $exists: true } }).

Exercise 2: Query with operators

Write queries to find:

  1. All reviews with a rating below 4.
  2. All verified-purchase reviews with more than 10 helpful votes.
  3. All reviews that include a photo_url.
// Your queries here

Hint

Use $lt for “below 4,” combine verified_purchase: true with helpful_votes: { $gt: 10 } in one filter, and use { $exists: true } for the photo field.

Exercise 3: Build an aggregation pipeline

Write an aggregation that, for each product, returns the number of verified reviews and the average rating, sorted from highest average rating to lowest.

Hint

Start your pipeline with a $match stage on verified_purchase: true, then $group by $product_id using $sum: 1 and $avg: "$rating", and finish with $sort: { avg_rating: -1 }.


Summary

You built a real document database from scratch, evolved its schema without a single migration, queried it with operators and existence checks, ran analytics with aggregation pipelines, and connected it to Python — the exact workflow data engineers use in production.

Key Concepts

  • Atlas, cluster, Compass — Atlas is MongoDB’s managed cloud service; a cluster is your distributed database; Compass is the GUI for exploring it. Connecting requires both a database user and network access.
  • Database, collection, document — a database holds collections (like tables) which hold documents (like flexible rows).
  • Flexible schema — you can insert documents with new fields anytime, with no migration; your application code must handle documents that differ.
  • Query operators$gte, $lt, $ne, and $exists build expressive filters; $exists safely queries fields that only some documents have.
  • IndexescreateIndex speeds up frequent queries; MongoDB uses them automatically.
  • Updates$set adds or changes fields; $inc increments atomically, avoiding race conditions.
  • Aggregation pipeline$group, $sort, $avg, $sum, and $cond run analytics directly on operational data; group by multiple fields with an object _id.
  • PyMongo and polyglot persistence — Python bridges MongoDB to the rest of your pipeline; drop the ObjectId _id before exporting, and combine MongoDB with SQL where each excels.

Why This Matters

You worked through the same challenges that come up in real projects: evolving data structures, flexible document storage, and integrating NoSQL with analytical tools. These concepts reach beyond MongoDB — document flexibility appears in DynamoDB and CouchDB, aggregation pipelines exist in Elasticsearch, and polyglot persistence is standard in modern systems. The next time requirements change rapidly or you need to scale past a single server, you will recognize these as problems NoSQL was built to solve, and you will know how to combine it with SQL to build systems that are both flexible and powerful.


Next Steps

Congratulations — by finishing this lesson you have completed the entire SQL & Databases course! You started by writing your first SELECT statement and have arrived at running cloud data warehouses, reasoning about NoSQL trade-offs, and building real document databases with MongoDB. That is the full arc from querying data to engineering the systems that store it, and it is exactly the toolkit modern data teams expect. Be proud of how far you have come.

Browse more courses

Explore the full DataTweets catalog and pick your next learning path

Back to Module Overview

Return to the Production Database Tools module overview


Continue Building Your Skills

Keep the momentum going. If you want to go deeper into MongoDB, explore indexing strategies, change streams for real-time processing, and time-series collections. If you want the broader NoSQL ecosystem, try Redis for caching or Neo4j for relationship analysis. And if you want to build production systems, wire MongoDB into a full ETL pipeline that lands in PostgreSQL. The best portfolios show real systems thinking — combining the right databases for the right jobs. You now have the skills to build exactly that. Well done, and keep building.