Leveraging Data for AI Solutions (Part 1): Data Quality in the Age of AI

Aug 24, 2025

Credit: Video generated by Google Veo 3 and Google AI Pro in the attempted style of New Yorker Cartoons, inspired by a particular superhero’s tech lab

Data remains the foundation of AI success, even as GPT-5 and advanced models reshape our industries. After my five-month journey through Cornell University’s eCornell “Designing and Building AI Solutions” program post-Cornell Johnson MBA, I continue to reflect upon my learning and insights from Lutz Finger’s instruction in the 3rd program module “Leveraging Data for AI solutions”. This first article reinforced my conviction about why data quality fundamentals matter more than ever in our rapidly evolving AI-landscape, regardless of model architecture and approach.

The Enduring Truth: Quality Over Quantity

Data refinement remains the foundation of AI success, regardless of model sophistication. Lutz Finger’s metaphor of data as “new oil” requiring refinement to fuel AI capabilities proves especially relevant as organizations deploy GPT-5 and multimodal models. Raw data requires systematic processing before it can generate meaningful insights.

Lutz Finger had invited guest speaker Brad Cordova, founder and CEO of the AI Intelligent Document Processing platform Super.ai, who crystallized AI quality definition challenges. Cordova shared that “For something to be high quality, it means it gives you answers in line with what you want.” This philosophical challenge of defining desired outcomes becomes exponentially complex when training models that will interact with millions of users across diverse contexts. The concept of “data half-life” offers crucial guidance for modern practitioners. Cordova introduced this framework to describe how long training data remains relevant before re-training or fine-tuning. While facial recognition data might stay valid for decades, stock market patterns expire in mere seconds. This principle directly impacts Large Langage Models (LLMs) considering how quickly training data became outdated for current events for a frontier or foundation model.

Missing Data: The Challenge That Scales with Complexity

Under Lutz Finger’s instruction, the program hands-on exercises taught us to categorize missing data systematically. Lutz Finger guided us through practical applications revealing three critical categories that every AI practitioner must understand:

MCAR (Missing Completely at Random): Equipment failures affecting data collection randomly
MAR (Missing at Random): Patterns in missingness related to observed variables
MNAR (Missing Not at Random): When missingness relates to unobserved values

The program’s exercises demonstrated how modern AI amplifies these challenges. I discovered that large language models (LLMs) excel at pattern recognition but may hallucinate when filling/imputing missing values. Lutz Finger’s instruction showed that traditional statistical approaches remain essential for reliable data preparation.

From Dairy Farm to Algorithm: The ‘Remoteness’ Problem with Data

Lutz Finger’s visit to Cornell’s Dairy Research Center illuminated a crucial reality about AI deployment. Watching researchers collect biometric data from real cows to predict illnesses highlighted how AI decisions can be detached from their subjects. The data scientist analyzing health metrics may never see the animals, just as engineers optimizing AI systems never meet end-users.

This separation between data collection & analysis, and decision-making has profound implications. As Finger reflected, “We can set up a system where a computer takes a decision and a human then executes on that decision.” This remoteness drives power and a call for greater responsibility, especially as agentic AI systems gain autonomy (and sometimes ‘opaqueness’) to act on predictions without human intervention.

The Competitive Edge: How Data Quality Drives Business Value

How did a 140-year-old soda company (S&P 500 company at US$301B market cap) become an AI powerhouse while some tech companies struggle? By recognizing a truth others missed: Superior data quality creates measurable competitive advantages across all sources of competitive advantages. Key empirical learnings from my Cornell Johnson MBA program taught the five sources of competitive advantages: Economies of scale, economies of scope, efficiency frontier positioning, network effects, and accumulated investments.

A 2020 McKinsey survey report had found that organizations that excel at data quality management will most likely achieve higher AI project success rates than others. This success translates directly to financial performance through cost reduction and revenue generation (e.g., improved segmentation for cross-selling/up-selling).

Accumulated investments in data pipelines and curation create the most durable competitive advantages for AI systems. Historical data collection, labeling, and cleaning represent sunk investments that competitors cannot instantly replicate. Just as The Coca-Cola Company has built irreplaceable brand equity over 140-years, companies accumulate “data equity” through sustained investment in high quality datasets. A competitor cannot recreate 20 years of medical imaging datasets or search query logs any more than they can replicate Coca Cola’s century of brand recognition. This accumulated investment compounds over time as organizations build proprietary datasets, refine data pipelines, and develop institutional knowledge about data quality management.

Leading organizations systematically track AI performance through comprehensive metrics that connect data investments to business outcomes. While these metrics represent hypothetical frameworks for thought exercises rather than universal benchmarks, they illustrate how companies might measure data quality ROI:

AI R&D ROI: (Net New AI Revenue + Cost Savings) / Total AI R&D Investment
Data Quality ROI: (Error Reduction Savings + Productivity Gains) / Data Quality Investment
Quarterly Revenue per AI Model: Total AI-Attributed Revenue / Number of Production Models
Data to Decision Velocity: Time from data collection to actionable insights per quarter
Model Performance Efficiency: Accuracy achieved per million training samples

Organizations could consider tracking operational and outcome focused metrics to measure competitive advantage. These hypothetical metrics serve as thought starters for developing company specific KPIs:

Data Freshness Score: Percentage of datasets updated within required timeframes
Model Drift Detection Rate: Time to identify performance degradation (Example target: Less than 48 hours)
Data Pipeline Reliability: Uptime of critical data flows (Aspirational target: 99.9%)
Cross Functional Data Reuse: Number of use cases per dataset (Example target: greater than 5)
Customer Experience Lift: NPS improvement from AI features (Hypothetical target: plus 15 points)
Decision Accuracy Improvement: % Error rate reduction in AI assisted decisions
Time to Insight Reduction: % Velocity improvement in analysis tasks
Innovation Velocity: New AI features deployed per quarter (Example target: 2x year-over-year growth)

Economies of scale and scope amplify data quality advantages through systematic reuse and amortization. Here are several company examples:

Coca Cola exemplifies this principle by leveraging its vast data assets across multiple AI applications. Coca Cola uses AI for demand forecasting that analyzes weather patterns and crop yields. It also creates personalized marketing campaigns that generated over 120,000 unique videos during the FIFA World Cup.
Google reuses search data across Assistant, Translate, and Ads.
OpenAI amortizes billion dollar training costs across millions of users through consumer ChatGPT subscriptions, API access, and enterprise partnerships.

Organizations that combine high quality data with scale can also achieve their optimal position on the “efficiency frontier.” This represents maximum model performance for a given compute budget. High-quality, carefully curated datasets can achieve superior performance with significantly less computational cost than large volumes of lower-quality data. This fundamentally challenges the “bigger is always better” paradigm that led the inception of early LLM development.

The strategic imperative becomes clear: Given data as an AI input, invest in data quality as a core AI competency, not an operational expense. Companies treating data quality as accumulated investment rather than ongoing cost build sustainable, defensible barriers of entry or rules of the game. As Coca Cola’s CIO Neeraj Tolmare noted, “AI is the foundation for everything we do.” This philosophy has enabled the 140-year old company to maintain market leadership through digital transformation. Success requires patient capital, systematic collection processes, and long term commitment to curation excellence.

Taking a hypothetical Fortune 500 company as another thought exercise example, improving data quality from prior average standards might generate $15 million to $25 million in annual value through reduced rework, faster time to market, and improved precision. But of course, actual results can vary significantly by industry and real execution.

From Program Learning to Real World: The Ongoing War for Modern Data Collection

The data acquisition landscape has transformed into a sophisticated ‘arms race’. Modern AI-powered scraping tools like Import.io, Firecrawl and ScrapeGraphAI deliver distinct and use-case-specific capabilities. These platforms use natural language prompts to extract structured data, eliminating the technical complexity that previously limited scraping to specialized teams.

Google delivered the year’s most disruptive anti-scraping update on January 15, 2025. The company mandated JavaScript execution for search results access and implemented enhanced behavioral analysis. This change disrupted major SEO tools including SemRush and SimilarWeb, forcing the industry to adopt more sophisticated circumvention techniques.

Defense mechanisms have evolved beyond simple IP blocking to include AI-powered detection. Modern systems collect over 70 unique device fingerprints, from screen resolution to installed fonts and WebGL configurations, creating digital fingerprints that persist even when users switch IP addresses or clear cookies. These technologies enable invisible CAPTCHAs that verify the solving device matches the one that received the challenge, effectively blocking CAPTCHA farms and AI agents.

Cloudflare’s July 2025 announcement introduced infrastructure-level AI crawler blocking by default. Their system affects 20% of global web traffic and implements a “Pay Per Crawl” system enabling publishers to charge AI crawlers for access. This fundamentally shifts the economic model from free scraping to paid licensing.

For practitioners, understanding both sides of this arms race proves essential. Python remains the dominant language for web scraping with key tools including:

Beautiful Soup: Ideal for beginners scraping static HTML content
Selenium: Essential for JavaScript-heavy sites, controlling real browsers
Scrapy: A comprehensive framework for large-scale production projects
Playwright: Microsoft’s modern alternative offering faster performance

The Regulatory Reckoning: Legal Pressures Reshape Data Collection

Thomson Reuters v. Ross Intelligence delivered 2025’s most significant legal precedent. Judge Bibas reversed his 2023 decision to rule that using copyrighted legal headnotes for AI training was NOT fair use as a matter of law. This landmark ruling establishes concerning precedent for AI developers, challenging assumptions about transformative use in AI contexts.

The U.S. Copyright Office’s Part 3 Report in May 2025 provided critical guidance. The Report states that using copyrighted works for AI training may constitute prima facie infringement and that fair use analysis must be case-by-case rather than presumptively transformative. Commercial training competing with original works faces significant legal risks.

EU AI Act Phase 2 implementation in August 2025 introduced mandatory transparency requirements. General Purpose AI models with training compute exceeding 1⁰²³ FLOP must provide technical documentation, public summaries of training content, and copyright compliance measures. Penalties reach €35 million or 7% of global annual turnover.

Federal Trade Commission (FTC) enforcement intensified with Operation AI Comply. The FTC secured a $193,000 settlement from DoNotPay for false “robot lawyer” claims. The FTC’s approach emphasizes that AI receives no exemption from existing consumer protection laws and requires transparent AI disclosure.

The Evolving Data Landscape: From Web Scraping to Partnerships

The data collection paradigm has fundamentally shifted from adversarial scraping to collaborative partnerships. Recent research from Epoch AI projects (Villalobos et al., 2024) that human-generated public text could be exhausted between 2026–2032, creating an unprecedented data bottleneck for AI development . This scarcity drives organizations toward more sustainable approaches to data acquisition.

Major content creators are restructuring how they protect and monetize data assets.

Reddit’s strategic pivot exemplifies this transformation, implementing API pricing that blocks free scraping while securing $60 million annual deals with companies like Google.
The New York Times lawsuit against OpenAI for alleged copyright infringement signals a broader legal reckoning.
Publishers deploy JavaScript rendering requirements, CAPTCHA systems triggered by crawler behavior, and dynamic watermarking to track content origin. Rate limiting and API honeypots are also options against web scraping.

The rise of synthetic data offers a path forward amid increasing data scarcity. Companies like Scale AI have long pivoted to providing high-quality labeled data for foundation models. This shift from quantity to quality reflects a fundamental truth: carefully filtered datasets consistently outperform larger, noisier alternatives.

Modern Platforms Transform Data Quality Management

Enterprise data platforms have evolved from storage solutions to comprehensive AI-native architectures. Industry analysis reveals that Databricks and Snowflake now offer integrated environments where natural language interfaces enable users to query and clean data through conversational prompts. These platforms achieve significant performance improvements while reducing costs compared to traditional solutions.

Open-source frameworks remain the foundation of modern data quality management with massive credits to our passionate community maintainers. The most commonly known “Core Five” technologies — Apache Kafka, Apache Spark, Apache Airflow, cloud data warehouses (like Snowflake or Google Cloud BigQuery), and dbt Labs’s dbt — cover streaming, processing, orchestration, storage, and transformation across the full spectrum of data engineering needs. However, each component also has compelling alternatives: modern orchestrators like Dagster Labs and Prefect challenge Airflow’s dominance with asset-centric approaches and improved developer experience, while transformation tools like SQLMesh and Dataform offer dbt alternatives with enhanced performance and more simplified workflows.

On top of the “Core Five” stack, data lakehouses represent the latest architectural paradigm beyond traditional data warehouses. Platform providers like Databricks (optimized for Apache Spark workloads) are pioneering lakehouse architecture that combines the flexibility and scale of data lakes with the reliability and performance of data warehouses. Unlike traditional warehouses that primarily handle structured data and require costly ETL processes, lakehouses natively support structured, semi-structured, and unstructured data — including JSON, images, videos, and documents — in a single platform with ACID transactions and schema evolution. This unified approach eliminates data silos, reduces infrastructure costs, and enables both business intelligence and machine learning workloads on the same datasets without data duplication, positioning lakehouses as a compelling alternative to the traditional “Core Five” storage layer.

Data quality monitoring for AI projects also ‘shifted left’ from reactive to proactive. Modern systems like industry leading Monte Carlo (Recognized as #1 by G2 for eight consecutive quarters in Jun 2025) and emerging platforms like SYNQ’s Scout autonomous agent perform root-cause analysis and directly fix data quality issues within modern data stacks. Monte Carlo’s breakthrough observability agents can investigate hundreds of hypotheses across datasets and reduce incident resolution time by 80 percent, while SYNQ’s Scout generates ready-to-ship code suggestions for instant fixes. These platforms now provide sub-second anomaly detection with automated remediation capabilities, extending beyond traditional structured data to monitor unstructured data in Snowflake, Databricks, and Google BigQuery.

Data versioning in data pipelining has become ‘almost’ as simple as saving document drafts (Keyword is almost). Modern tools make data versioning work like Git for code, allowing teams to experiment in isolated branches without affecting production datasets. Smart versioning systems only save changes that affect AI performance, reducing storage costs while maintaining full reproducibility. This approach integrates naturally with dbt’s transformation workflows and Airflow’s orchestration capabilities.

The most sophisticated data quality strategies now combine open-source foundations with the advanced, purpose-built features from commercial data platforms and tools. Rather than replacing traditional tools, leading organizations build data stacks comprising of AI-driven observability platforms over their existing infrastructure (e.g., Apache Spark, Airflow, and dbt). This hybrid approach achieves the critical capabilities for maintaining reliable training data in AI and machine learning pipelines by:

Maintaining the flexibility and cost-effectiveness of open-source tools
Ensuring data quality through advanced automated monitoring, anomaly detection, and intelligent remediation

Cloud Infrastructure Transforms Data Quality Management

Modern cloud platforms have revolutionized how organizations store, version, and interact with AI training data. The competition and collaborations (e.g., open standards like Apache Iceberg and cross-platform between AWS, Google Cloud, and Azure has driven innovations (e.g., Amazon Web Services (AWS) S3, Google Cloud Storage, and Microsoft Azure Data Lake) that make enterprise-scale data quality management accessible to organizations of all sizes.

One of the providers MinIO emerges as the bridge between cloud providers, solving the API incompatibility problem. By providing S3-compatible APIs across any infrastructure, MinIO enables consistent development experiences whether data resides on AWS, Google Cloud, on-premises systems or hybrid deployments.

For another example, Google Cloud BigQuery service’s native Gemini integration allows users to query massive datasets using simple natural language prompts, automatically converting questions to SQL. This democratization means business users can directly verify data quality without high levels of technical expertise.

Time travel capabilities have become essential for reproducibility of AI model, experiments, training data or even training conditions. BigQuery’s automatic 7-day versioning with FOR SYSTEM_TIME AS OF syntax enables teams to trace exactly which data version trained specific models. Amazon Web Services (AWS) requires manual implementation through S3 versioning and Apache Iceberg, while Microsoft Azure Time Series Insights provides similar functionality for temporal data.

Privacy-Preserving Techniques at Production Scale

Breakthrough technologies now enable organizations to use sensitive data without exposing it. Recent advances in homomorphic encryption have achieved performance improvements that allow computers to process encrypted data without decryption. Apple deployed this technology for iPhone visual search features, proving viability at massive scale.

Federated learning/training enables collaborative AI development without data sharing. First introduced in the paper “Flower: A Friendly Federated Learning Research Framework” published at arXiv in 2020 (Beutel et al., 2020) and subsequently in recent research, the Flower framework now supports millions of participants training models together. Hospitals collaborate on COVID-19 prediction models without moving patient records between institutions. Smart buildings achieve high accuracy in occupancy detection while preserving individual privacy.

Major platforms have made privacy tools accessible to all organizations. Cloud providers offer privacy-preserving features as standard services rather than requiring specialized expertise. This democratization means small businesses can now implement privacy protections previously available only to tech giants.

Why Traditional Data Quality Matters More in the AI Era

Advanced language models amplify rather than reduce data quality requirements. GPT-5 and other multimodal systems process text, images, audio, and video simultaneously, creating multiplicative validation challenges. Text tokenization errors compound with image preprocessing artifacts, while temporal inconsistencies in video data affect cross-modal understanding. Each modality requires specific quality controls that must work in concert to prevent cascading failures across the entire system.

Synthetic data generation offers valuable data solutions but cannot replace high-quality real-world data. While generative models can create training data to address class imbalances and edge cases, synthetic datasets often lack the subtle correlations and real-world complexity found in authentic data from the real world. Epoch AI research demonstrates a critical phenomenon known as “model collapse.” When AI models are trained on AI-generated data, the quality of the model’s output gradually degrades. This occurs because synthetic data lacks the diversity of human-written content, causing models to overfit to patterns specific to the synthetic data rather than learning/generalizing to broader real-world patterns.

Iterative training scenarios reveal particularly concerning implications for synthetic data use. Each generation of models trained on synthetic data produces increasingly homogeneous outputs. This process causes a progressive loss of the nuanced variability that authentic human generated data naturally contains. Epoch AI found that synthetic data introduces biased or deceptive results due to its limited variability and correlation structures. Models that perform well on benchmarks often fail when teams deploy them in production environments. Healthcare, finance, and autonomous systems face especially critical risks where decisions carry real world consequences.”

Summary of Key Obstacles When Using Synthetic Data:

Model Collapse: Training on AI generated data causes progressive quality degradation over successive generations
Limited Data Diversity: Synthetic datasets lack the natural variability found in human created content
Overfitting Patterns: AI models learn synthetic specific patterns instead of real world complexity
Production Failures: Strong benchmark performance does not guarantee real world success
Correlation Gaps: Synthetic data misses subtle relationships present in authentic datasets

Yet, leading AI companies recognize synthetic data as an essential tool despite its limitations:

OpenAI leverages synthetic data through their o1 model (formerly Strawberry) to generate high-quality training examples for next-generation systems, with Canvas achieving 30% accuracy improvements using entirely synthetic training data.
Meta has shifted to producing “the vast majority” of fine-tuning examples through synthetic generation for Llama 3.1, while releasing their open-source Synthetic Data Kit to democratize access.
Scale AI (Invested by Meta in Jun 2025) combines real-world and synthetic data for enterprise clients, helping companies like Kaleido AI improve model performance from 0.657 to 0.794 IoU using mixed datasets. Scale’s technical research presentation at NeurIPS 2024 provided frameworks for optimizing synthetic data strategies based on “query budget ratios.”
Google DeepMind focuses on privacy-preserving synthetic data for sensitive applications.
Anthropic’s Constitutional AI represents the largest confirmed usage of synthetic data in production. Anthropic CEO Dario Amodei has publicly outlined Anthropic’s synthetic data strategy. Character training across Claude models uses entirely synthetic datasets generated by Claude itself, demonstrating the model’s capability for self-improvement through AI-generated personality and trait development examples.

The key lies in using synthetic data strategically rather than as a complete replacement for authentic datasets. These companies understand that synthetic data serves as a crucial supplement when high quality human generated data becomes scarce.

Similarly, next-generation model architectures like world models and self-supervised systems fundamentally require high quality data as their foundation for learning systems. These advanced architectures build internal representations through prediction and pattern recognition. Poor data quality during training systematically distorts the model’s learned representations rather than simply introducing occasional errors. Models absorb biases, gaps, and mislabeling from flawed training data directly into their core understanding. These embedded flaws then influence every downstream prediction and decision the model makes. Organizations must prioritize data quality from the start because retroactively fixing a model trained on poor data proves far more costly than the initial, upfront capital investment (CAPEX) in driving quality data.

Practical Lessons from Data Acquisition

Lutz Finger’s instruction demonstrated how data professionals must balance value creation with ethical responsibility. Through the program, Finger shared his journey from building web scrapers at startups to protecting user data at LinkedIn, illustrating the dual nature of data work. The eCornell program curriculum emphasized creating innovative data acquisition approaches while respecting legal boundaries.

Lutz Finger had invited Andrew Fogg, former founder and Chief Data Office, of web data extraction platform Import.io to share practical wisdom. Fogg’s advice resonated strongly: “Can you do this without AI? Because AI is a real big pain.” This perspective, shared during the program sessions, becomes especially relevant when choosing between traditional ML and resource intensive but sophisticated neural-net architectures (e.g., LLMs).

Key Takeaways for Modern AI Practitioners

Cornell University’s eCornell program “Designing and Building AI Solutions”, under Lutz Finger’s guidance, emphasized starting with business problems rather than technology capabilities. The curriculum consistently demonstrated that poor data governance leads to data breaches, inconsistencies, and inaccuracies that undermine decision making processes. With AI systems making increasingly autonomous decisions (especially under misalignment), these risks multiply exponentially.

The Cornell program’s practical exercises taught us to implement robust data quality frameworks. Lutz Finger’s instruction covered key areas requiring attention:

Data versioning: Track changes over time for reproducibility
Drift detection: Identify when models need retraining
Anomaly monitoring: Catch quality issues early
Bias tracking: Monitor fairness across the pipeline

Cornell’s curriculum stressed building explainability from the ground up. Through program exercises, we learned that frameworks incorporating references, explanations, and confidence scores become essential when deploying AI in production.

Remember that bias flows from humans to data to models. No amount of model sophistication overcomes biased training data. As we deploy more powerful models, vigilance about data quality becomes even more critical.

Looking Ahead

The fundamentals of data quality are not obsolete in the supercycle age of advanced AI. They are prerequisites for highly-aligned, reliable and responsible AI deployment. As we build systems with increasing autonomy and impact, the foundations of traditional data science serve as guidance and guardrails for innovation. In the next article, I shall reflect upon my learning and application of bias detection and mitigation strategies to modern AI systems, including practical techniques I have tested during the program.

Cornell University’s eCornell “Designing and Building AI Solutions” program develops this essential balance between innovation and responsibility. Lutz Finger’s instruction and structured approach transforms abstract & fundamental concepts into practical skills. The program prepares practitioners to navigate the complex landscape where technical metrics, business objectives, and ethical considerations converge.

Begin your journey toward AI fluency today: https://ecornell.cornell.edu/certificates/technology/designing-and-building-ai-solutions/

Join the Conversation

What challenges have you encountered with data quality in your AI initiatives?
How do you balance the promise of new technologies with fundamental data science principles?

Share your experiences in the comments below.

About This Series

This is Part 1 of my series reflecting on key learnings from the third module “Leveraging Data for AI Solutions” from Cornell University’s eCornell certificate program “Designing and Building AI Solutions.”

Stay tuned for upcoming articles covering:

Bias mitigation in AI systems
Data value creation strategies
Principled AI design approaches

Follow my CoreAI newsletter for practical insights at the intersection of data quality and AI implementation.

#DataScience #ArtificialIntelligence #DataQuality #MachineLearning #AIEducation #CornellUniversity #ResponsibleAI

Core AI

Discussion about this post

Ready for more?