🟢 How to Find the Bad Data Before It Finds You (Simple SQL Audits)


Hello Reader,

Here's a scenario that happens more often than anyone admits:

You build a report. Your director uses it in a board presentation. A week later, someone notices the customer count includes 58 records with obviously fake email addresses like evan.clarke743@noemail.

The report wasn't wrong. The data underneath it was dirty. And now your credibility takes a hit. Bummer. Not because of your SQL skills, but because you didn't check the final product before sending it off.

Data quality auditing is one of the most practical skills you can develop. It takes 10 minutes per query, it saves you from embarrassing situations, and the same patterns work in every database you'll ever touch.

Build Your Own Data Quality Audit Process

The idea here is to build data quality checks for what you need (invalid emails is the example I use here). Then, combine those into a unified audit script to run all checks at once.

I ran these against the Summit Adventures database (the fake adventure tourism company I created to help people learn business analytics). This database has intentional data quality issues built in. About 15% of records have some kind of anomaly specifically so students can practice finding them.

Check #1: Invalid Email Addresses

-- Find emails that don't follow standard format
SELECT 
    email,
    first_name || ' ' || last_name AS customer_name
FROM customers
WHERE email LIKE '%@noemail'
    OR email NOT LIKE '%@%.%'
    OR email LIKE '%@%@%'
    OR email LIKE '% %'
LIMIT 10;

These are records where no valid email was provided. In a real company, these customers are unreachable by email — which means any marketing campaign that targets them is wasting resources.

Check #2: The Full Audit Summary

Here's the powerful part. One query that summarizes all major data quality issues:

-- Complete data quality audit summary
SELECT 
    'Invalid emails' AS issue_type,
    COUNT(*) AS issue_count,
    ROUND(
        COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
    ) AS pct_affected
FROM customers
WHERE email LIKE '%@noemail'
UNION ALL 
SELECT 
    'Placeholder phones' AS issue_type,
    COUNT(*) AS issue_count,
    ROUND(
        COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
    ) AS pct_affected
FROM customers
WHERE phone LIKE '000-%' OR phone LIKE '555-%'
UNION ALL
SELECT 
    'Unrealistic ages' AS issue_type,
    COUNT(*) AS issue_count,
    ROUND(
        COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
    ) AS pct_affected
FROM customers
WHERE EXTRACT(YEAR FROM AGE(date_of_birth)) < 18 
   OR EXTRACT(YEAR FROM AGE(date_of_birth)) > 100
UNION ALL
SELECT 
    'Missing dietary info' AS issue_type,
    COUNT(*) AS issue_count,
    ROUND(
        COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
    ) AS pct_affected
FROM customers
WHERE dietary_restrictions IS NULL
ORDER BY pct_affected DESC;

This is your data quality scorecard. In one glance, you know:

  • 78.2% missing dietary info: This is an optional field, not necessarily a quality issue. But operations should know that dietary data is only available for 218 out of 1,000 customers.
  • 5.8% invalid emails: 58 customers are unreachable. Marketing should exclude these from campaigns.
  • 3.5% placeholder phones: 35 customers have fake phone numbers (000- and 555- prefixes). Not useful for SMS campaigns.
  • 1.6% unrealistic ages: 16 records with ages under 18 or over 100. Likely data entry errors.

Over time, you can update based on new findings or requirements and build your own audit query.

How to Build Your Own Audit Query

The pattern is the same for any database:

-- Template: Data quality audit
SELECT 'Description of issue' AS issue_type,
    COUNT(*) AS issue_count,
    ROUND(
        COUNT(*)::numeric / (SELECT COUNT(*) FROM your_table) * 100, 1
    ) AS pct_affected
FROM your_table
WHERE [condition that identifies bad data]
UNION ALL
SELECT 'Next issue' AS issue_type,
    COUNT(*),
    ROUND(COUNT(*)::numeric / (SELECT COUNT(*) FROM your_table) * 100, 1)
FROM your_table
WHERE [next condition]
ORDER BY pct_affected DESC;

Each UNION ALL block checks for one type of issue. The ORDER BY at the end sorts by severity so you see the biggest problems first.

What to Check in Any Database

Here's a checklist you can adapt for your specific needs:

Contact information:

  • Emails without @ and domain → WHERE email NOT LIKE '%@%.%'
  • Phone numbers with placeholder prefixes → WHERE phone LIKE '000-%'
  • Missing required fields → WHERE field IS NULL

Dates and ages:

  • Future dates → WHERE signup_date > CURRENT_DATE
  • Unrealistic ages → WHERE age < 0 OR age > 120
  • Dates before the company existed → WHERE created_at < '2020-01-01'

Financial data:

  • Negative amounts → WHERE amount < 0
  • Zero-dollar transactions → WHERE amount = 0
  • Suspiciously large amounts → WHERE amount > 100000

Duplicates:

  • Same email, different names → GROUP BY email HAVING COUNT(*) > 1
  • Same phone, different customers → GROUP BY phone HAVING COUNT(*) > 1

When to Run Audits

  • Before any analysis. Five minutes of checking saves you from presenting conclusions based on dirty data.
  • Before sending reports to executives. A 3-row audit summary at the bottom of your report ("Note: 58 records excluded due to invalid email addresses") shows professionalism and builds trust.
  • After any data import or migration. New data loads are the most common source of quality problems.

Common Mistakes to Avoid

Mistake 1: Fixing data silently

Don't just exclude bad records without mentioning it. Document what you found and what you excluded. Transparency builds more trust than a "clean" report that hides problems.

Mistake 2: Treating all NULLs as errors

Missing dietary restrictions isn't a data quality issue, it's an optional field. Missing email addresses on records that need email outreach IS a quality issue. Context matters.

Mistake 3: Auditing once and never again

Data quality degrades over time as new records come in. Build your audit as a saved query you can re-run monthly.

Try This Monday Morning

Before your next analysis:

  1. Pick your most-used table
  2. Write three WHERE conditions that identify suspicious records
  3. Use the UNION ALL pattern to create a one-query audit summary
  4. Add a note to your report documenting what you found

It takes less than 15 minutes. And the first time it catches something before your boss does, you'll never skip it again.

Until next time,

Brian

Brian Graves, creator of Analytics in Action

Say 👋 on X/Twitter, LinkedIn, or book a call with me. You can always reply to these emails. I check them all.

P.S. Data quality thinking is woven throughout SQL for Business Impact. Every module uses real-world data with intentional messiness, because that's what you'll face at work. The course teaches you to handle it confidently. Check it out at sqlforbusinessimpact.com.

P.P.S. What's the worst data quality surprise you've found at work? Hit reply. I collect these stories because they're genuinely fascinating. I read every response.

Starting With Data

Learn to build analytics projects with SQL, Tableau, Excel, and Python. For data analysts looking to level up their career and complete beginners looking to get started. No fluff. No theory. Just step-by-step tutorials anyone can follow.

Read more from Starting With Data

Hey Reader, Last month, we covered LEFT JOIN to find customers who never booked. That's one type of absence — people who never showed up. But there's another kind that's even more expensive: people who showed up, committed, and then left. The dreaded cancellation. Customers who said "yes", then said "actually, never mind". In the Summit Adventures database (the fake adventure tourism company I created to help people learn business analytics), nearly 47% of all bookings end in cancellation....

Hello Reader, This week's newsletter is a bit different. There's SQL in here, but the real topic is career strategy. Because one of the most common questions I get from readers is some version of: "I've been learning SQL for a few months. How do I prove I can actually do this job?" The honest answer: a portfolio of 2-3 projects that demonstrate business thinking, not just technical syntax. The problem is most portfolios I've reviewed look the same. A Kaggle competition. A tutorial from...

Hello Reader, A common analytics question that sounds simple but isn't: "Which customers have booked the same type of expedition more than once?" You can't answer this with a regular JOIN between two different tables. The information lives in one table. You need to compare rows within that table to find patterns. That's what a self-JOIN does. It joins a table to itself. It sounds unusual, but once you see the pattern, you'll recognize situations where it's exactly what you need. The Business...