AI startups are increasingly abandoning outsourced data collection in favor of building proprietary, high-quality datasets in-house to gain a distinct competitive advantage. As the industry shifts from prioritizing raw computing power to refined training methods, companies are discovering that the secret to superior model performance lies in human-led curation.
The Quality-Over-Quantity Paradigm
The email management firm Fyxer serves as a prime example of this trend. By leveraging AI to automate email sorting and drafting, the company found that relying on massive, generic datasets was less effective than utilizing an array of smaller models trained on highly specialized data.
“We realized that the quality of the data, not the quantity, is the thing that really defines the performance,” founder Richard Hollingsworth explained.
Human Expertise as the Training Standard
This commitment to quality forced Fyxer to rethink its workforce. In the company’s early stages, engineers and managers were frequently outnumbered four-to-one by executive assistants tasked with training the AI models. Hollingsworth emphasized that because email response logic is inherently people-oriented, the company required experienced executive assistants to define the fundamentals of professional communication.
The Risks and Rewards of Synthetic Data
While startups are expanding their reach through synthetic data, the reliance on high-quality source material has never been more critical. Synthetic data can magnify both the capabilities and the existing flaws within a dataset. For instance, Turing estimates that 75% to 80% of its vision-based data is synthetic, extrapolated from original GoPro footage. However, experts warn that any compromise in the initial training data will inevitably propagate through synthetic outputs.
“If the pre-training data itself is not of good quality, then whatever you do with synthetic data is also not going to be of good quality,” noted Sivaraman.
Data Collection as a Strategic Moat
Beyond model performance, keeping data collection in-house functions as a powerful protective barrier. In a market where open-source models are easily accessible, the ability to secure expert annotators and cultivate proprietary, human-led datasets has become a primary differentiator. For companies like Fyxer, this rigorous, custom approach to training is the ultimate defense against competitors.
“We believe that the best way to do it is through data, through building custom models, through high-quality, human-led data training,” Hollingsworth said.
