Data Annotation is the new AI Consumer Gold Rush

The Consumer Playbook Series: Building consumer companies is HARD. Like ridiculously hard. This is a series about how every element of consumer company building is undergoing change & sharing the consumer building blocks no one talks about. Even as everything changes, Consumer is defined by its evergreen pillars: its scale – something millions of people can use; by its GTM –  low/touch no touch sales; and by its purchase power – agency to spend, no approval needed. 

My dad keeps photo albums on the bookshelf next to the mantel. He’s a semiprofessional photographer (ask me about middle school dances embarrassment) so we have dozens. Each one is labeled by year, with notes written on the back of the photo for vacations, first days of school, family gatherings. 

I used to think he was just photo-obsessed, and he is, but now I realize he was doing the analog equivalent of a new AI-native job: annotation.


He was adding metadata to our memories: labeling who, when, where, and why. He turned a pile of disorganized photos into a navigable archive of our lives.

That’s what the best data work is: structured and cleaned archives. And it’s what every company is sitting on today – years of unstructured moments that, if organized and labeled, can become an entirely new business line.

The same way my dad’s albums created organized memories out of film rolls, data labeling is turning offline datasets into valuable assets for AI labs.

Today, there’s an opportunity for many consumer businesses to capitalize on the data they own by courting LLMs and data annotators as a customer paying for their data.

This isn’t a new refrain (“we can monetize our data”) but it’s finally believable because it’s no secret that the data is the new gold rush.

So now the question is whether the ability to finally reliably monetize data is a repeatable, scalable monetization strategy that many consumer founders can adopt?

Consumer businesses are defined in part by their scale. They serve millions of customers. Guess what happens when you serve millions of customers?….you create billions of datapoints. 

The consumer monetization playbook has expanded.

Data as a New Business Line

Yes I know…eyeroll. In the past, everyone who talked about data said it was this amazing asset they were going to accumulate in the business. 

I mostly called BS on this (because who were you going to sell it to?), until 2025. Now I’m eating my fist. 

Because finally the era has arrived when there is a massive hungry buyer of data who’s emerged. 

This hungry tiger is the research labs – OpenAI, Anthropic, XAI, etc and their army of data labeling vendors – Scale AI, Mercor, Handshake, etc. or vendors like Labelbox.

Data accumulation strategies are finally paying off. 

Most consumer founders are not paying attention to this being something they could capture. But they should. Admittedly, I don’t know how long it will be until margins get squeezed. 

But I’ve already seen several companies pivot a part of their business quickly to serve the insatiable data demand. For example: 

  • Take Handshake, which started as a platform for connecting students to employers. They realized their real asset wasn’t the helping students create online profiles –  it was the collection of specialized laborers themselves. Today, they’re rebranding themselves into the AI data economy, offering human-labeled annotators for model training.
  • Or Pesto, once an education platform in India that trained software developers. They realized their learners could also be data annotators –  turning a one-time education sale into a recurring services business.
  • Or Reddit, who inked a deal with OpenAI that was announced in May 2024 which made Reddit’s data more accessible via Reddit’s data API 

If you’re a consumer founder, you’re probably sitting on years of behavioral data: interactions, content, ratings, images, voice, or transactions. Clean it, label it, and you’ve got a dataset that might be worth more than your current ARR.

AI labs like OpenAI, Anthropic, and DeepMind have already scraped the public internet. They need fresh, domain-specific, proprietary datasets: including everything from healthcare conversations, retail purchase behavior, creative production, to video context, and more.

That means if you’re a founder who touches a unique domain that’s not on the internet (beauty, music, sports, parenting, gaming, fitness), you’re holding a resource that AI labs cannot yet access.

Founders are able to monetize their data twice – once by serving their users, and again by serving the models that learn from those users.

Come on….really? Data again?

Now, data has been pitched as being key for over a decade but that was for the advertising business model and it didn’t pan out for 99% of consumer companies. It turns out the minimum bar for the mount of eyeballs you needed was very high to command the right CPMs. 

The same could be said for today’s data labeling opportunity. There’s a very high bar for what meets the minimum quality requirements. 

As Michelle put it, “The gap between “we collected videos” and “we have a dataset” is way bigger than people think. Frame extraction, segmentation, labeling, format conversion, quality checks…” 

I’m not saying this is an easy, automatic translation, but it’s the start of something that can be harvested to produce value. 

Data Laborers

What Handshake and Pesto show is that there’s a new type of worker in 2025: the data laborer.

They’re not engineers, they’re not creators, they’re not customer support reps. They’re subject matter experts who generate, label, and validate the data that powers the models.

These workforces already exist. Some in education companies, or gig platforms, or tutoring networks, or content moderation teams. Savvy founders are reorienting them toward AI data production.

Just as every startup once needed QA testers, then growth hackers, the next generation will need data ops managers: people who organize annotation tasks, run labeling QA, and even manage relationships with AI research labs.

Tactical Questions for Founders

Here are a few questions I’d be asking myself if I were building a consumer company today:

Data as Your Business Line

  • What proprietary data do we generate that no one else has access to? Do we own it and can we license it? 
  • Could that data be cleaned, anonymized, and licensed to model builders?
  • How quickly can we prove accuracy of the data output and beat the benchmark?  
  • Can we compete in an area that is less saturated, where the models haven’t been battle tested and there isn’t a ton of alignment around scalability? 
  • Can you really understand an AI Lab’s data requirements especially if it’s for a new type of data where the scaling function isn’t known?
  • Can you set up the functions in house to approximate the gain from the data? Can you create tooling to interact with the models? Can you create your own post training data? Do you want to rent your own GPUs? Can you produce rubrics that are used in eval‘s?
  • What would it look like to build a data monetization P&L next to our product P&L?

Managing Data Laborers

  • Do we already have a workforce (or community) that could be repurposed into annotation or QA? Subject matter experts that are already paid for knowledge based content creation?
  • Could we design microtasks that improve our own models while producing datasets valuable to others?
  • Do we want to go to market with a data-only solution (licensing) or data and annotation, or simply offering your data laborers as short term contractors to labs?

Common Misconceptions

I was initially really skeptical of this as a business or monetization model. I mean, they’re just short-term moments in time with no defensibility, right? These businesses will only see more margin squeeze from research labs.

The more validated a specific subset of data is that the models and AI research labs want it, the more they are going to search for it, and more data is going to come their way, and that they are going to squeeze margins. 

So your Pareto optimal efficient frontier is to be on the edge of a new data set that has not yet been fully accepted by the labs or to have large quantities of data in a known area that they want a ton of. 

So be forewarned, there isn’t a ton of defensibility in these businesses, you just have to stay ahead of the curve. But this is what great consumer founders excel at – staying ahead of the underlying changes in models. 

There’s really only a few true sources of defensibility if you’re going to take on this new monetization layer:

  • Founder’s ability to pivot the business every quarter is the defensibility. Anything selling right now will not be selling in 3 months. Alex from ScaleAI was famously great at predicting what was in demand next quarter.
  • Defining the benchmark – if a provider comes up with benchmarks – can they front-run the data that will help the labs climb the benchmarks over the next 3 months? Scale AI made Humanity’s last exam and now AI labs build to their benchmark. https://scale.com/leaderboard. 
  • Having a low cost basis – need to be able to withstand the squeeze that buyers will ultimately put on your business. Great if you have offshore experts. 

Snapshots 

When I think back to those photo albums, I realize my dad wasn’t just preserving moments. He was future-proofing our collective family memory. He gave context to time that would otherwise blur into a jumbled forgetting. 

In the AI age, every interaction, image, and transaction your company touches is a snapshot but they need to be labeled and placed in an album before your “family” can truly enjoy them. 

The next decade of consumer-company success will be defined by those that can turn raw history into structured intelligence.

The future of your business may in fact be its own [data labeled] past.

Leave a comment