← All articles
PythonData Analysis

Cleaning Messy Data with Pandas: A Practical Guide

Real datasets are never as clean as tutorial datasets. This guide builds a detect-decide-fix workflow for pandas, then applies it to a real, freely-licensed museum collection dataset — missing values, disguised placeholders, inconsistent text, duplicates, and messy dates included.

Every dataset you download is lying to you a little. Not maliciously — a spreadsheet that’s been edited by a dozen people over a decade will always end up with a blank cell here, a typo’d category there, and a date typed three different ways depending on who was filling it in that day. The real question isn’t “how do I clean this data” so much as “how do I decide what to do with each kind of mess I find,” because dropping a row and filling it in are both correct answers, just to different problems.

This is where a lot of people freeze up: .dropna() feels like the safe default, but a careless one can silently throw away a third of your dataset, and you won’t notice until a chart looks wrong three steps later. This guide gives you a small mental model for making that call deliberately, then walks it through a real, messy museum collection dataset — missing values, inconsistent text, duplicate rows, and dates written half a dozen different ways.

The Mental Model: Detect, Decide, Fix

Cleaning is not one operation, it’s a loop you run once per problem you find, and it only has three steps:

  1. Detect. Find out what’s actually wrong — which columns have missing values, which categories are secretly the same thing spelled two ways, which rows are duplicates. You do this with .info(), .isna(), .value_counts(), and .duplicated(), not by scrolling through rows.
  2. Decide. For this specific column, is a missing or messy value something you drop, fill, or leave alone? The right answer depends on what the column means and what question you’re trying to answer — there’s no universal rule.
  3. Fix. Apply the decision, then re-run the detect step to confirm it actually did what you expected. Cleaning one column often reveals the next problem.
Diagram of the detect, decide, fix cycle for data cleaning: detect problems with methods like info, isna, and value_counts; decide whether to drop, fill, or leave each column based on what the data means; fix by applying the decision and looping back to detect again.

Notice this loop never says “drop everything with a missing value” — that’s a single, specific decision you might make for one column and reject for another. Keep the three steps separate in your head and the rest of this post is just applying them, one column at a time.

A Dataset You Can Reproduce

Toy datasets clean up too easily to teach anything. For real mess, this post uses a sample of the Museum of Modern Art’s own collection data — the museum publishes metadata for its full collection (title, artist, nationality, medium, acquisition date, and more) as an open dataset on GitHub, released into the public domain under a CC0 license. It’s genuinely messy in the ways real catalog data is: missing attributions, inconsistent formatting inherited from decades of paper records, and dates written as anything from a plain year to "c. 1963, assembled 1964–65".

Data: a 313-row sample of MoMA’s Artworks.csv collection dataset (CC0-1.0), via github.com/MuseumofModernArt/collection, vendored at /datasets/blog/cleaning-messy-data-with-pandas/moma-artworks-sample.csv. The sample was drawn with a fixed seed so these numbers are reproducible, and deliberately includes a mix of missing values, disguised placeholders, and duplicate-looking rows to give the cleaning steps something real to do.

import pandas as pd

art = pd.read_csv(
    "https://datatweets.com/datasets/blog/cleaning-messy-data-with-pandas/moma-artworks-sample.csv"
)
art.shape
(313, 8)
art.columns.tolist()
['Title', 'Artist', 'Nationality', 'Gender', 'Date', 'Medium', 'Department', 'DateAcquired']

313 artworks, 8 columns. (The outputs in this post come from pandas 3.0 — the general approach also works on pandas 2.x, though a couple of the exact warning behaviors near the end are pandas 3.0-specific, and called out where it matters.)

Detecting the Mess: .info() and .isna().sum()

Before touching anything, find out what you’re actually dealing with:

art[["Title", "Artist", "Nationality", "Gender", "Date"]].head(3)
                                               Title  ...  Date
0                              9 Bilder (9 Pictures)  ...  1971
1  Patriotic propaganda postcard with verse by Vl...  ...  1914
2                              County Clare, Ireland  ...  1954

[3 rows x 5 columns]
art.info()
<class 'pandas.DataFrame'>
RangeIndex: 313 entries, 0 to 312
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Title         313 non-null    str
 1   Artist        307 non-null    str
 2   Nationality   307 non-null    str
 3   Gender        307 non-null    str
 4   Date          308 non-null    str
 5   Medium        275 non-null    str
 6   Department    313 non-null    str
 7   DateAcquired  298 non-null    str
dtypes: str(8)
memory usage: 19.7 KB

.info() gives you the shape of the problem at a glance: every column is loaded as plain text (more on that later), and five of the eight columns have gaps. But it collapses each column to one number — you still need .isna().sum() for the exact counts you’ll act on:

art.isna().sum()
Title            0
Artist           6
Nationality      6
Gender           6
Date             5
Medium          38
Department       0
DateAcquired    15
dtype: int64

And one more detection pass before deciding anything — are any rows exact duplicates of each other?

print(art.duplicated().sum())
12

Twelve full-row duplicates already, using nothing more than the default settings. Keep that number in mind; it’s about to get bigger once we start comparing rows more loosely.

Deciding What Missing Actually Means

Artist is missing for 6 rows — but “missing” isn’t the whole story here. MoMA’s own cataloguers already have a label for artworks with no known creator:

print("true NaN Artist:", art["Artist"].isna().sum())
print("'Unknown artist' string:", (art["Artist"] == "Unknown artist").sum())
true NaN Artist: 6
'Unknown artist' string: 6

Six rows are genuinely blank, and a separate six already say "Unknown artist" in plain text. Those mean the same thing to a human, but .isna() only sees the first group. Grouping or counting by Artist right now would quietly split “unknown” into two different buckets. The fix is to fold the true blanks into the museum’s own convention rather than inventing a new label:

art["Artist"] = art["Artist"].fillna("Unknown artist")
print((art["Artist"] == "Unknown artist").sum())
12

Medium (the material and technique — “gelatin silver print,” “pencil on paper”) is a different kind of decision. Nearly one in eight rows is missing it:

print(art["Medium"].isna().sum())
38

Dropping those rows would work, but it throws away real artworks just because one descriptive field is blank:

len(art), len(art.dropna(subset=["Medium"]))
(313, 275)

That’s 38 artworks gone — a meaningful chunk of the sample — for a field that doesn’t affect most other questions you’d ask of this data (which department has the most acquisitions, which decade an artist was most active). Filling with an explicit placeholder keeps every row and makes the gap visible instead of silently absent:

art["Medium"] = art["Medium"].fillna("not recorded")
print(art["Medium"].isna().sum())
0

Same missing-value problem, two different fixes, because the two columns answer different questions. That’s the “decide” step doing its job — see pandas’ user guide on working with missing data for the full menu of fillna strategies (forward-fill, interpolation, per-group fill) beyond the two used here.

Fixing Inconsistent Text: .value_counts() and .str

Nationality and Gender both come wrapped in parentheses, straight from how MoMA’s source records store them:

art["Nationality"].value_counts(dropna=False).head(8)
Nationality
(American)    154
(French)       40
()             26
(German)       23
(British)      13
(Italian)       9
NaN             6
(Chilean)       6
Name: count, dtype: int64
art["Gender"].value_counts(dropna=False)
Gender
(male)          242
(female)         38
()               26
NaN               6
(non-binary)      1
Name: count, dtype: int64

.value_counts() is the right detection tool for text columns — it surfaces exactly this kind of thing: an empty "()" sitting right next to NaN, both meaning “unknown,” neither one obviously “missing” until you look. Strip the parentheses with .str.strip(), and collapse both flavors of unknown into one label:

art["Nationality"] = art["Nationality"].fillna("()").str.strip("()")
art["Nationality"] = art["Nationality"].replace("", "unknown")

art["Gender"] = art["Gender"].fillna("()").str.strip("()")
art["Gender"] = art["Gender"].replace("", "unknown")

art["Nationality"].value_counts().head(8)
Nationality
American    154
French       40
unknown      32
German       23
British      13
Italian       9
Chilean       6
Dutch         5
Name: count, dtype: int64
art["Gender"].value_counts()
Gender
male          242
female         38
unknown        32
non-binary      1
Name: count, dtype: int64

unknown now correctly reads as 32 (26 empty parens plus 6 true NaN) for both columns — one clean category instead of two disguised ones. That’s the pattern worth remembering: any time you see a .str column, run .value_counts() on it before you trust it. Typos, casing differences, and placeholder values almost always show up there first.

Finding and Handling Duplicates: .duplicated()

Earlier, .duplicated() with default settings found 12 exact-duplicate rows. But two rows can describe the same artwork without matching on every single column — DateAcquired or a formatting quirk in Nationality can differ even when the artwork is clearly the same print catalogued twice. Comparing on a meaningful subset of columns catches more:

subset_cols = ["Title", "Artist", "Date", "Medium"]
dup_mask = art.duplicated(subset=subset_cols, keep=False)
print(dup_mask.sum())
26
art.loc[dup_mask, subset_cols].sort_values(subset_cols).head(6)
                                                 Title  ...                                             Medium
0                                9 Bilder (9 Pictures)  ...  Staple-bound book of half-tone photographs, 12...
236                              9 Bilder (9 Pictures)  ...  Staple-bound book of half-tone photographs, 12...
69   Gift 13: Paper Cutting (Kindergarten material ...  ...                                  Various materials
96   Gift 13: Paper Cutting (Kindergarten material ...  ...                                  Various materials
110                                   Medals, New York  ...                               Gelatin silver print
278                                   Medals, New York  ...                               Gelatin silver print

[6 rows x 4 columns]

More than double the count from the strict full-row check — the choice of subset is doing real work here. keep=False marks every row in a duplicate group so you can inspect them before deciding, rather than blindly deleting. Once you’re satisfied the matches are genuine, drop_duplicates keeps the first occurrence of each group and removes the rest:

before = len(art)
art = art.drop_duplicates(subset=subset_cols, keep="first")
before, len(art)
(313, 299)

14 rows removed — 2 more than the naive full-row check found, because a couple of those “duplicate” pairs differed only in Nationality formatting or DateAcquired, not in what artwork they actually describe.

Fixing Dates: pd.to_datetime on a Column That Resists It

DateAcquired (when the museum acquired the piece) is already a clean, consistent format:

art["DateAcquired"] = pd.to_datetime(art["DateAcquired"], errors="coerce")
print(art["DateAcquired"].isna().sum(), "missing, out of", len(art))
print(art["DateAcquired"].dtype)
14 missing, out of 299
datetime64[us]

14 missing values, matching the 15 minus one lost to deduplication — no surprises, because acquisition dates are entered by the museum’s own registrar in one system. Date (when the artwork was made) is a different story, because it’s transcribed from a century of catalog cards, wall labels, and artist statements:

art["Date"].sample(8, random_state=42).tolist()
['(1946)', '1931', '1991', 'c. 1927', '1893-1903', '1925, published 1926', '1910', '1981']

A plain year, a parenthesized year, an approximate year ("c." for circa), a date range, and a compound date all sitting in the same column. Asking pd.to_datetime to parse it as a single 4-digit year is a reasonable first guess — and it mostly fails:

parsed_year = pd.to_datetime(art["Date"], format="%Y", errors="coerce")
print(parsed_year.isna().sum())
121

121 out of 299 rows — errors="coerce" at least turns unparseable values into NaT instead of crashing, but that’s still two in five rows lost. For this dataset, a full calendar date is more precision than the source data actually has anyway; what you usually want is just the year. A regular expression pulling the first 4-digit number out of the string recovers almost everything:

year = art["Date"].str.extract(r"(\d{4})")[0]
art["year_made"] = pd.to_numeric(year, errors="coerce")
print(art["year_made"].isna().sum())
23
art["year_made"].describe()
count     276.000000
mean     1952.630435
std        33.907055
min      1849.000000
25%      1927.750000
50%      1953.000000
75%      1977.250000
max      2023.000000
Name: year_made, dtype: float64

Down to 23 unrecoverable rows — the ones genuinely marked "n.d." (“no date,” the archival convention for “we don’t know”) or similarly undated, which no amount of regex can fix, because the information was never recorded in the first place. That’s a real limit worth knowing rather than working around: some missing values are missing at the source, and no cleaning step invents data that was never captured. to_datetime has plenty more options than the plain format= string used here if your own dates need them.

Three Gotchas Worth Knowing

A bare .dropna() drops a row if any column has a gap — not just the one you’re worried about.

before = len(art)
after = len(art.dropna())
before, after
(299, 263)

36 rows gone, over a tenth of what’s left, even though most of those rows are only missing something harmless like DateAcquired. Always pass subset=[...] unless you genuinely mean “drop a row for any gap anywhere.”

Missing and “recorded as unknown” are not the same story, and .isna() only catches one of them.

print("truly blank:", art["Date"].isna().sum())
print("literal 'n.d.':", (art["Date"] == "n.d.").sum())
truly blank: 5
literal 'n.d.': 17

Five rows are truly blank cells; seventeen more say "n.d." in plain text and would sail right past a .isna().sum() check. Before you trust a missing-value count, run .value_counts() on the column too and look for suspiciously common placeholder strings — "n.d.", "unknown", "N/A", even a lone dash. Treating a placeholder and a genuinely absent value as the same thing (or, just as easily, missing one because you only checked the other) skews whatever you compute next.

Chained assignment on a filtered slice doesn’t update your DataFrame — and in pandas 3.0, it tells you so.

import warnings

with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    art[art["Department"] == "Photography"]["Medium"] = "RETOUCHED"

print(caught[0].category.__name__)
ChainedAssignmentError
print((art["Medium"] == "RETOUCHED").any())
False

The assignment silently does nothing to art — pandas 3.0 raises a ChainedAssignmentError warning explaining exactly why: art[mask]["Medium"] builds a temporary copy, and setting a value on a copy never reaches the original. The fix is to make the copy explicit and assign into it directly, or use .loc in one step:

photos = art[art["Department"] == "Photography"].copy()
photos["Medium"] = photos["Medium"].str.upper()
photos["Medium"].head(3).tolist()
['GELATIN SILVER PRINT', 'GELATIN SILVER PRINT', 'GELATIN SILVER PRINT']

.copy() says “I mean to work on an independent DataFrame,” which is exactly what you want when you’re building a filtered subset on purpose.

Wrapping Up

Every cleaning decision in this post came from the same loop:

  • Detect with .info(), .isna().sum(), .value_counts(), and .duplicated() — never assume, always count.
  • Decide per column: fill when the column has an existing “unknown” convention worth reusing (Artist, Nationality, Gender) or when dropping would cost you real rows for a non-essential field (Medium), or accept a smaller-but-honest recovery when the source data itself has a hard limit (Date’s 23 truly undated rows).
  • Fix, then re-run detect to confirm the fix did what you expected — cleaning Nationality is what revealed the empty-parens-versus-NaN gotcha in the first place.

None of this required a library beyond pandas itself, and the whole workflow scales down to a five-column CSV or up to a real production pipeline without changing shape.

Once your data is actually clean, grouping and aggregating it is the natural next step — our guide to pandas GroupBy picks up exactly there with the same split-apply-combine dataset approach. If you want the fuller course context — reading data, selecting rows, handling missing data, duplicates, and dtypes as a connected sequence rather than one blog post — the Handling Missing Data and Duplicates & Outliers lessons in our free Python for Data Analytics course cover this material in depth, with exercises.

More from the blog